Big data was one of the biggest topics on this year’s useR conference in Albacete and it is definitely one of today’s hottest buzzwords. But what defines “Big Data”? And on the practical side: How can big data be tackled in R?
What data is big?
Hadley Wickham, one of the best known R developers, gave an interesting definition of Big Data on the conceptual level in his useR!-Conference talk “BigR data”. In traditional analysis, the development of a statistical model takes more time than the calculation by the computer. When it comes to Big Data this proportion is turned upside down. Big Data comes into play when the CPU time for the calculation takes longer than the cognitive process of designing a model.
Jan Wijffels proposed in his talk at the useR!-Conference a trisection of data according to its size. As a rule of thumb: Data sets that contain up to one million records can easily processed with standard R. Data sets with about one million to one billion records can also be processed in R, but need some additional effort. Data sets that contain more than one billion records need to be analyzed by map reduce algorithms. These algorithms can be designed in R and processed with connectors to Hadoop and the like.
The number of records of a data set is just a rough estimator of the data size though. It’s not about the size of the original data set, but about the size of the biggest object created during the analysis process. Depending on the analysis type, a relatively small data set can lead to very large objects. To give an example: The distance matrix in hierarchical cluster analysis on 10.000 records contains almost 50 Million distances.
Big Data Strategies in R
If Big Data has to be tackle with R, five different strategies can be considered: