Just saw this package, it is in testing early release now- Love the thought of Hadley’s Split Apply Combine Package being used for Map Reduce which is conceptually similar in many many ways. I do think though Revolution’s work in R and D needs to be applauded- given by the number of packages they have created- or funded AND donated( seperate blog post on this?) while RStudio seems more content on building basic blocks for infrastructure , without an adequate Big Data solution for R Studio itself.
Of course usage stats on RevoScaleR , Revolution’s Big Data package are not as transparent or in line with Free as Beer and Free as Speech philosophy that RStudio breathes in.
https://github.com/RevolutionAnalytics/RHadoop/wiki/plyrmr
This R package enables the R user to perform common data manipulation operations, as found in popular packages such as plyr
and reshape2
, on very large data sets stored on Hadoop. Like rmr, it relies on Hadoop mapreduce to perform its tasks, but it provides a familiar plyr-like interface while hiding many of the mapreduce details. plyrmr
provides:
- Hadoop-capable versions of well known data.frame functions:
transform
,subset
,mutate
,summarize
,melt
,dcast
and more from packagesbase
,plyr
andreshape2
. - Simple but powerful ways of applying any function operating on data.frames to Hadoop data sets:
do
andmagic.wand
. - Simple but powerful ways of aggregating data:
group
,group.f
,gather
andungroup
. - All of the above can be combined by normal functional composition: delayed evaluation helps mitigating any performance penalty of doing so by minimizing the number of Hadoop jobs launched to evaluate an expression.
- New data frame functions which are also Hadoop-capable that are more suitable for development than some of the above:
select
andwhere
.
Hi Ajay, Thanks for sharing this insightful update.
Regards, Sanjeev