Here’s a group of questions and answers that David Smith of Revolution Analytics was kind enough to answer post the launch of the new R Package which integrates Hadoop and R- RevoScaleR
Ajay- How does RevoScaleR work from a technical viewpoint in terms of Hadoop integration?
David-The point isn’t that there’s a deep technical integration between Revolution R and Hadoop, rather that we see them as complementary (not competing) technologies. Hadoop is amazing at reliably (if slowly) processing huge volumes of distributed data; the RevoScaleR package complements Hadoop by providing statistical algorithms to analyze the data processed by Hadoop. The analogy I use is to compare a freight train with a race car: use Hadoop to slog through a distributed data set and use Map/Reduce to output an aggregated, rectangular data file; then use RevoScaleR to perform statistical analysis on the processed data (and use the speed of RevolScaleR to iterate through many model options to find the best one).
Ajay- How is it different from MapReduce and R Hipe– existing R Hadoop packages?
They’re complementary. In fact, we’ll be publishing a white paper soon by Saptarshi Guha, author of the Rhipe R/Hadoop integration, showing how he uses Hadoop to process vast volumes of packet-level VOIP data to identify call time/duration from the packets, and then do a regression on the table of calls using RevoScaleR. There’s a little more detail in this blog post: http://blog.revolutionanalytics.com/2010/08/announcing-big-data-for-revolution-r.html
Ajay- Is it going to be proprietary, free or licensable (open source)?
RevoScaleR is a proprietary package, available to paid subscribers (or free to academics) with Revolution R Enterprise. (If you haven’t seen it, you might be interested in this Q&A I did with Matt Shotwell: http://biostatmatt.com/archives/533
Ajay- Any existing client case studies for Terabyte level analysis using R.
David- The VOIP example above gets close, but most of the case studies we’ve seen in beta testing have been in the 10’s to 100’s of Gb range. We’ve tested RevoScaleR on larger data sets internally, but we’re eager to hear about real-life use cases in the terabyte range.
Ajay- How can I use RevoScaleR on my dual chip Win Intel laptop for say 5 gb of data.
One of the great things about RevoScaleR is that it’s designed to work on commodity hardware like a dual-core laptop. You won’t be constrained by the limited RAM available, and the parallel processing algorithms will make use of all cores available to speed up the analysis even further. There’s an example in this white paper (http://info.revolutionanalytics.com/bigdata.html
) of doing linear regression on 13Gb of data on a simple dual-core laptop in less than 5 seconds.
AJ-Thanks to David Smith, for this fast response and wishing him, Saptarshi Guha Dr Norman Nie and the rest of guys at Revolution Analytics a congratulations for this new product launch.