Plyrmr- bringing #rstats Plyr to Map Reduce for Hadoop

Just saw this package, it is in testing early release now- Love the thought of Hadley’s Split Apply Combine Package being used for Map Reduce which is conceptually similar in many many ways. I do think though Revolution’s work in R and D needs to be applauded- given by the number of packages they have created- or funded AND donated( seperate blog post on this?) while RStudio seems more content on building basic blocks for infrastructure , without an adequate Big Data solution for R Studio itself.

Of course usage stats on RevoScaleR , Revolution’s Big Data package are not as transparent or in line with Free as Beer and Free as Speech philosophy that RStudio breathes in.

https://github.com/RevolutionAnalytics/RHadoop/wiki/plyrmr

This R package enables the R user to perform common data manipulation operations, as found in popular packages such as plyr and reshape2, on very large data sets stored on Hadoop. Like rmr, it relies on Hadoop mapreduce to perform its tasks, but it provides a familiar plyr-like interface while hiding many of the mapreduce details. plyrmr provides:

Hadoop-capable versions of well known data.frame functions: transform, subset, mutate, summarize, melt, dcast and more from packages base, plyr and reshape2.
Simple but powerful ways of applying any function operating on data.frames to Hadoop data sets: do and magic.wand.
Simple but powerful ways of aggregating data: group, group.f, gather and ungroup.
All of the above can be combined by normal functional composition: delayed evaluation helps mitigating any performance penalty of doing so by minimizing the number of Hadoop jobs launched to evaluate an expression.
New data frame functions which are also Hadoop-capable that are more suitable for development than some of the above: select and where.