Big Data and R: New Product Release by Revolution Analytics

Press Release by the Guys in Revolution Analytics- this time claiming to enable terabyte level analytics with R. Interesting stuff but techie details are awaited.

Revolution Analytics Brings

Big Data Analysis to R

The world’s most powerful statistics language can now tackle terabyte-class data sets using

Revolution R Enterpriseat a fraction of the cost of legacy analytics products


JSM 2010 – VANCOUVER (August 3, 2010) — Revolution Analytics today introduced ‘Big Data’ analysis to its Revolution R Enterprise software, taking the popular R statistics language to unprecedented new levels of capacity and performance for analyzing very large data sets. For the first time, R users will be able to process, visualize and model terabyte-class data sets in a fraction of the time of legacy products—without employing expensive or specialized hardware.

The new version of Revolution R Enterprise introduces an add-on package called RevoScaleR that provides a new framework for fast and efficient multi-core processing of large data sets. It includes:

  • The XDF file format, a new binary ‘Big Data’ file format with an interface to the R language that provides high-speed access to arbitrary rows, blocks and columns of data.
  • A collection of widely-used statistical algorithms optimized for Big Data, including high-performance implementations of Summary Statistics, Linear Regression, Binomial Logistic Regressionand Crosstabs—with more to be added in the near future.
  • Data Reading & Transformation tools that allow users to interactively explore and prepare large data sets for analysis.
  • Extensibility, expert R users can develop and extend their own statistical algorithms to take advantage of Revolution R Enterprise’s new speed and scalability capabilities.

“The R language’s inherent power and extensibility has driven its explosive adoption as the modern system for predictive analytics,” said Norman H. Nie, president and CEO of Revolution Analytics. “We believe that this new Big Data scalability will help R transition from an amazing research and prototyping tool to a production-ready platform for enterprise applications such as quantitative finance and risk management, social media, bioinformatics and telecommunications data analysis.”

Sage Bionetworks is the nonprofit force behind the open-source collaborative effort, Sage Commons, a place where data and disease models can be shared by scientists to better understand disease biology. David Henderson, Director of Scientific Computing at Sage, commented: “At Sage Bionetworks, we need to analyze genomic databases hundreds of gigabytes in size with R. We’re looking forward to using the high-speed data-analysis features of RevoScaleR to dramatically reduce the times it takes us to process these data sets.”

Take Hadoop and Other Big Data Sources to the Next Level

Revolution R Enterprise fits well within the modern ‘Big Data’ architecture by leveraging popular sources such as Hadoop, NoSQL or key value databases, relational databases and data warehouses. These products can be used to store, regularize and do basic manipulation on very large datasets—while Revolution R Enterprise now provides advanced analytics at unparalleled speed and scale: producing speed on speed.

“Together, Hadoop and R can store and analyze massive, complex data,” said Saptarshi Guha, developer of the popular RHIPE R package that integrates the Hadoop framework with R in an automatically distributed computing environment. “Employing the new capabilities of Revolution R Enterprise, we will be able to go even further and compute Big Data regressions and more.”

Platforms and Availability

The new RevoScaleR package will be delivered as part of Revolution R Enterprise 4.0, which will be available for 32-and 64-bit Microsoft Windows in the next 30 days. Support for Red Hat Enterprise Linux (RHEL 5) is planned for later this year.

On its website (http://www.revolutionanalytics.com/bigdata), Revolution Analytics has published performance and scalability benchmarks for Revolution R Enterprise analyzing a 13.2 gigabyte data set of commercial airline information containing more than 123 million rows, and 29 columns.

Additionally, the company will showcase its new Big Data solution in a free webinar on August 25 at 9:00 a.m. Pacific.

Additional Resources

•      Big Data Benchmark whitepaper

•      The Revolution Analytics Roadmap whitepaper

•      Revolutions Blog

•      Download free academic copy of Revolution R Enterprise

•      Visit Inside-R.org for the most comprehensive set of information on R

•      Spread the word: Add a “Download R!” badge on your website

•      Follow @RevolutionR on Twitter

About Revolution Analytics

Revolution Analytics (http://www.revolutionanalytics.com) is the leading commercial provider of software and support for the popular open source R statistics language. Its Revolution R products help make predictive analytics accessible to every type of user and budget. The company is headquartered in Palo Alto, Calif. and backed by North Bridge Venture Partners and Intel Capital.

Media Contact

Chantal Yang
Page One PR, for Revolution Analytics
Tel: +1 415-875-7494

Email:  revolution@pageonepr.com

R Oracle Data Mining

Here is a new package called R ODM and it is an interface to do Data Mining via Oracle Tables through R. You can read more here http://www.oracle.com/technetwork/database/options/odm/odm-r-integration-089013.html and here http://cran.fhcrc.org/web/packages/RODM/RODM.pdf . Also there is a contest for creative use of R and ODM.

R Interface to Oracle Data Mining

The R Interface to Oracle Data Mining ( R-ODM) allows R users to access the power of Oracle Data Mining’s in-database functions using the familiar R syntax. R-ODM provides a powerful environment for prototyping data analysis and data mining methodologies.

R-ODM is especially useful for:

  • Quick prototyping of vertical or domain-based applications where the Oracle Database supports the application
  • Scripting of “production” data mining methodologies
  • Customizing graphics of ODM data mining results (examples: classificationregressionanomaly detection)

The R-ODM interface allows R users to mine data using Oracle Data Mining from the R programming environment. It consists of a set of function wrappers written in source R language that pass data and parameters from the R environment to the Oracle RDBMS enterprise edition as standard user PL/SQL queries via an ODBC interface. The R-ODM interface code is a thin layer of logic and SQL that calls through an ODBC interface. R-ODM does not use or expose any Oracle product code as it is completely an external interface and not part of any Oracle product. R-ODM is similar to the example scripts (e.g., the PL/SQL demo code) that illustrates the use of Oracle Data Mining, for example, how to create Data Mining models, pass arguments, retrieve results etc.

R-ODM is packaged as a standard R source package and is distributed freely as part of the R environment’s Comprehensive R Archive Network ( CRAN). For information about the R environment, R packages and CRAN, see www.r-project.org.

and

Present and win an Apple iPod Touch!
The BI, Warehousing and Analytics (BIWA) SIG is giving an Apple iPOD Touch to the best new presenter. Be part of the TechCast series and get a chance to win!

Consider highlighting a creative use of R and ODM.

BIWA invites all Oracle professionals (experts, end users, managers, DBAs, developers, data analysts, ISVs, partners, etc.) to submit abstracts for 45 minute technical webcasts to our Oracle BIWA (IOUG SIG) Community in our Wednesday TechCast series. Note that the contest is limited to new presenters to encourage fresh participation by the BIWA community.

Also an interview with Oracle Data Mining head, Charlie Berger https://decisionstats.wordpress.com/2009/09/02/oracle/

Business Analytics Analyst Relations /Ethics/White Papers

Curt Monash, whom I respect and have tried to interview (unsuccessfully) points out suitable ethical dilemmas and gray areas in Analyst Relations in Business Intelligence here at http://www.dbms2.com/2010/07/30/advice-for-some-non-clients/

If you dont know what Analyst Relations are, well it’s like credit rating agencies for BI software. Read Curt and his landscaping of the field here ( I am quoting a summary) at http://www.strategicmessaging.com/the-ethics-of-white-papers/2010/08/01/

Vendors typically pay for

  1. They want to connect with sales prospects.
  2. They want general endorsement from the analyst.
  3. They specifically want endorsement from the analyst for their marketing claims.
  4. They want the analyst to do a better job of explaining something than they think they could do themselves.
  5. They want to give the analyst some money to enhance the relationship,

Merv Adrian (I interviewed Merv here at http://www.dudeofdata.com/?p=2505) has responded well here at http://www.enterpriseirregulars.com/23040/white-paper-sponsorship-and-labeling/

None of the sites I checked clearly identify the work as having been sponsored in any way I found obvious in my (admittefly) quick scan. So this is an issue, but it’s not confined to Oracle.

My 2 cents (not being so well paid 😉 are-

I think Curt was calling out Oracle (which didnt respond) and not Merv ( whose subsequent blog post does much to clarify).

As a comparative new /younger blogger in this field,
I applaud both Curt to try and bell the cat ( or point out what everyone in AR winks at) and for Merv for standing by him.

In the long run, it would strengthen analyst relations as a channel if they separate financial payment of content from bias. An example is credit rating agencies who forgot to do so in BFSI and see what happened.

Customers invest millions of dollars in BI systems trusting marketing collateral/white papers/webinars/tests etc. Perhaps it’s time for an industry association for analysts so that individual analysts don’t knuckle down under vendor pressure.

It is easier for someone of Curt, Merv’s stature to declare editing policy and disclosures before they write a white paper.It is much harder for everyone else who is not so well established.

White papers can take as much as 25,000$ to produce- and I know people who in Business Analytics (as opposed to Business Intelligence) slog on cents per hour cranking books on R, SAS , webinars, trainings but there are almost no white papers in BA. Are there any analytics independent analysts who are not biased by R or SAS or SPSS or etc etc. I am not sure but this looks like a good line to  pursue 😉 – provided ethical checks and balances are established.

Personally I know of many so called analytics communities go all out to please their sponsors so bias in writing does exist (you cant praise SAS on a R Blogging Forum or R USers Meet and you cant write on WPS at SAS Community.org )

– at the same time someone once told me- It is tough to make a living as a writer, and that choice between easy money and credible writing needs to be respected.

Most sponsored white papers I read are pure advertisements, directed at CEOs rather than the techie community at large.

Almost every BI vendor claims to have the fastest database with 5X speed- and benchmarking in technical terms could be something they could do too.

Just like Gadget sites benchmark products, you can not benchmark BI or even BA products as it is written not to do so  in many licensing terms.

Probably that is the reason Billions are spent in BI and the positive claims are doubtful ( except by the sellers). Similarly in Analytics, many vendors would have difficulty justifying their claims or prices if they are subjected to a side by side comparison. Unfortunately the resulting confusion results in shoddy technology coming stronger due to more aggressive marketing.

More R please

some R news

0 The R Foundation Website I guess the http://www.r-project.org team is busy prettyfying before the annual R users conference kicks in- the website of www.r-project.org ( I was told it looks has the aesthetic visual appeal of dead cat splattered on the autobahn a very HTML 4.0 kind of retro look )

I cant believe the R Site and R core honchos finds the following image the prettiest image to represent graphical abilities of R

The R core site has tremendous functionality and demand though I wonder if they can just put up some ads and get some funding/ two way research tie- up with Google —Google uses R extensively, and can help with online methods as well, and is listed as supporting organization at http://www.r-project.org/foundation/memberlist.html …..

The R archives are a collection of emails and thats not documentation at all – but

1 Revolution R Website and particularly David Smith’s blog is a great way to stay updated on R news at http://blog.revolution-computing.com/

I have covered REvolution R before, and they are truly impressive.

http://www.decisionstats.com/2009/01/31/interviewrichard-schultz-ceo-revolution-computing/

It seems the domain name revolutioncomputing.com was squatted ( by NC?) so thats why the hyphenated web name. It is a very lucid website- though I do request them to put more video/podcasts and a Tweet this button would be great :))

and another more techie post here

http://blog.revolution-computing.com/2009/05/verifying-zipfs-powerdistribution-law-for-cities.html

Another great source is the Twitter – it seems that Twitter R users use the hashtag #rstats to search for R kind of news and code – that should help R bloggers and at a later date users.

Click here for checking it out

http://search.twitter.com/search?q=#stats

2 Some more R forums and sites

Forum for R Enterprise Users http://www.revolution-computing.com/forum

A R Tips Site http://onertipaday.blogspot.com/

The R Journal ( yes there is a journal for all hard working R fans) http://journal.r-project.org/

R on Linkedin http://www.linkedin.com/groups?about=&gid=77616

and the Analytic Bridge community group for R

http://www.analyticbridge.com/group/rprojectandotherfreesoftwaretools

2 Here is a terrific post by Robert Grossman

at http://blog.rgrossman.com/2009/05/17/running-r-on-amazons-ec2/

I liked the way he built the case for using R on Amazon EC2 ( Business case not Use case) and then proceeded to a step by step tutorial simple and powerful blog post. I hope R comes out with a standardized Online R Doc like that which is a single point search able archive for code – something like the SAS online doc (which remains free for WPS users 😉 ) but the way the web is evolving it seems the present mish mash method would continue

the main steps to use R on a pre-configured AMI.

Set up.
The set up needs to be done just once.

1. Set up an Amazon Web Services (AWS) account by going to:

aws.amazon.com.

If you already have an Amazon account for buying books and other items from Amazon, then you can use this account also for AWS.
2. Login to the AWS console
3. Create a “key-pair” by clinking on the link “Key Pairs” in the Configuration section of the Navigation Menu on the left hand side of the AWS console page.
4. Clink on the “Create Key Pair” button, about a quarter of the way down the page.
5. Name the key pair and save it to working directory, say /home/rlg/work.

Launching the AMI. These steps are done whenever you want to launch a new AMI.

1. Login to the AWS console. Click on the Amazon EC2 tab.
2. Click the “AMIs” button under the “Images and Instances” section of the left navigation menu of the AWS console.
3. Enter “opendatagroup” in the search box and select the AMI labeled
“opendatagroup/r-timeseries.manifest.xml”, which
is AMI instance “ami-ea846283″.
4. Enter the number of instances to launch (1), the name of the key pair that you have previously created, and select “web server” for the security group. Click the launch button to launch the AMI. Be sure to terminate the AMI when you are done.
5. Wait until the status of the AMI is “running.” This usually takes about 5 minutes.

Accessing the AMI.

1. Get the public IP address of the new AMI. The easiest way to do this is to select the AMI by checking the box. This provides some additional information about the AMI at the bottom of the window. You can can copy the IP address there.
2. Open a console window and cd to your working directory which contains the key-pair that you previously downloaded.
3. Type the command:
ssh -i testkp.pem -X root@ec2-67-202-44-197.compute-1.amazonaws.com

Here we assume that the name of the key-pair you created is “testkp.pem.” The flag “-X” starts a session that supports X11. If you don’t have X11 on your machine, you can still login and use R but the graphics in the example below won’t be displayed on your computer.

Using R on the AMI.

1. Change your directory and start R

#cd examples
#R
2. Test R by entering a R expression, such as:

> mean(1:100)
[1] 50.5
>
3. From within R, you can also source one of the example scripts to see some time series computations:

> source(‘NYSE.r’)
4. After a minute or so, you should see a graph on your screen. After the graph is finished being drawn, you should see a prompt:

CR to continue

Enter a carriage return and you should see another graph. You will need to enter a carriage return 8 times to complete the script (you can also choose to break out of the script if you get bored with the all the graphs.
5. When you are done, exit your R session with a control-D. Exit your ssh session with an “exit” and terminte your AMI from the Amazon AWS console. You can also choose to leave your AMI running (it is only a few dollars a day).

Acknowledgements: Steve Vejcik from Open Data Group wrote the R scripts and configured the AMI.

AjayTerrific R companies, blogs, tweets, research and sites, but do let me know your feedback . Just un-other R day.