Is Python going to be better than R for Big Data Analytics and Data Science? #rstats #python

Uptil now the R ecosystem of package developers has mostly shrugged away the Big Data question. In   a fascinating insight Hadley Wickham said this in a recent interview- shockingly it mimicks the FUD you know who has been accused of ( source

https://peadarcoyle.wordpress.com/2015/08/02/interview-with-a-data-scientist-hadley-wickham/

5. How do you respond when you hear the phrase ‘big data’? Big data is extremely overhyped and not terribly well defined. Many people think they have big data, when they actually don’t.

I think there are two particularly important transition points:

* From in-memory to disk. If your data fits in memory, it’s small data. And these days you can get 1 TB of ram, so even small data is big!

* From one computer to many computers.

R is a fantastic environment for the rapid exploration of in-memory data, but there’s no elegant way to scale it to much larger datasets. Hadoop works well when you have thousands of computers, but is incredible slow on just one machine. Fortunately, I don’t think one system needs to solve all big data problems.

To me there are three main classes of problem:

1. Big data problems that are actually small data problems, once you have the right subset/sample/summary.

2. Big data problems that are actually lots and lots of small data problems

3. Finally, there are irretrievably big problems where you do need all the data, perhaps because you fitting a complex model. An example of this type of problem is recommender systems

Ajay- One of the reasons of non development of R Big Data packages is- it takes money. The private sector in R ecosystem is a duopoly ( Revolution Analytics ( acquired by Microsoft) and RStudio (created by Microsoft Alum JJ Allaire). Since RStudio actively tries as a company to NOT step into areas Revolution Analytics works in- it has not ventured into Big Data in my opinion for strategic reasons.

Revolution Analytics project on RHadoop is actually just one consultant working on it here https://github.com/RevolutionAnalytics/RHadoop and it has not been updated since six months

We interviewed the creator of R Hadoop here https://decisionstats.com/2014/07/10/interview-antonio-piccolboni-big-data-analytics-rhadoop-rstats/

However Python developers have been trying to actually develop systems for Big Data actively. The Hadoop ecosystem and the Python ecosystem are much more FOSS friendly even in enterprise solutions.

This is where Python is innovating over R in Big Data-

http://blaze.pydata.org/en/latest/

  • Blaze: Translates NumPy/Pandas-like syntax to systems like databases.

    Blaze presents a pleasant and familiar interface to us regardless of what computational solution or database we use. It mediates our interaction with files, data structures, and databases, optimizing and translating our query as appropriate to provide a smooth and interactive session.

  • Odo: Migrates data between formats.

    Odo moves data between formats (CSV, JSON, databases) and locations (local, remote, HDFS) efficiently and robustly with a dead-simple interface by leveraging a sophisticated and extensible network of conversions. http://odo.pydata.org/en/latest/perf.html

    odo takes two arguments, a target and a source for a data transfer.

    >>> from odo import odo
    >>> odo(source, target)  # load source into target 
  • Dask.array: Multi-core / on-disk NumPy arrays

    Dask.arrays provide blocked algorithms on top of NumPy to handle larger-than-memory arrays and to leverage multiple cores. They are a drop-in replacement for a commonly used subset of NumPy algorithms.

  • DyND: In-memory dynamic arrays

    DyND is a dynamic ND-array library like NumPy. It supports variable length strings, ragged arrays, and GPUs. It is a standalone C++ codebase with Python bindings. Generally it is more extensible than NumPy but also less mature.  https://github.com/libdynd/libdynd

    The core DyND developer team consists of Mark Wiebe and Irwin Zaid. Much of the funding that made this project possible came through Continuum Analytics and DARPA-BAA-12-38, part of XDATA.

    LibDyND, a component of the Blaze project, is a C++ library for dynamic, multidimensional arrays. It is inspired by NumPy, the Python array programming library at the core of the scientific Python stack, but tries to address a number of obstacles encountered by some of its users. Examples of this are support for variable-sized string and ragged array types. The library is in a preview development state, and can be thought of as a sandbox where features are being tried and tweaked to gain experience with them.

    C++ is a first-class target of the library, the intent is that all its features should be easily usable in the language. This has many benefits, such as that development within LibDyND using its own components is more natural than in a library designed primarily for embedding in another language.

    This library is being actively developed together with its Python bindings,

http://dask.pydata.org/en/latest/

On a single machine dask increases the scale of comfortable data from fits-in-memory to fits-on-diskby intelligently streaming data from disk and by leveraging all the cores of a modern CPU.

Users interact with dask either by making graphs directly or through the dask collections which provide larger-than-memory counterparts to existing popular libraries:

  • dask.array = numpy + threading
  • dask.bag = map, filter, toolz + multiprocessing
  • dask.dataframe = pandas + threading

Dask primarily targets parallel computations that run on a single machine. It integrates nicely with the existing PyData ecosystem and is trivial to setup and use:

conda install dask
or
pip install dask

https://github.com/cloudera/ibis

When open source fights- closed source wins. When the Jedi fight the Sith Lords will win

So will R people rise to the Big Data challenge or will they bury their heads in sands like an ostrich or a kiwi. Will Python people learn from R design philosophies and try and incorporate more of it without redesigning the wheel

Converting code from one language to another automatically?

How I wish there was some kind of automated conversion tool – that would convert a CRAN R package into a standard Python package which is pip installable

Machine learning for more machine learning anyone?

Revolution Analytics and Pricing Analytics

Cost of 1 day of Revolution Analytics Training at http://www.revolutionanalytics.com/services/training/

 

1. Intro to R

Price:  Commercial: SGD$500.00
Academic:SGD$350.00

1 Singapore dollar = 0.8197 US dollars

10% Early Bird Discount Deadline: November 13, 2012 @ 12:00PM Pacific Time
Discount code: earlybird

2. (aptly titled Minimalistic Sufficient R…you think the ricing would be minimalistic.. but)

http://www.revolutionanalytics.com/services/training/public/minimalist-sufficient-r.php

Price: 

$750

$100 Early Bird Discount Deadline: November 16, 2012 @ 12:00PM Pacific Time
Discount code: earlybird

3.

Advanced R (Italian)

Price:  Commercial: €680.00
Academic: €480.00

1 euro = 1.2975 US dollars

4.

Big Data AnalyticS with RevoScaleR

Price:  $500 with 2 month Revolution R Enterprise workstation evaluation.

$700 with 1 year subscription of Revolution R enterprise workstation ($1500 value)

10% Early Bird Discount Deadline: October 30, 2012 @ 12:00PM Pacific Time
Discount code: early

5.

Revolution R Time Series Training

Price:  Commercial: S$1,200.00
Academic:S$750.00

10% Early Bird Discount Deadline: October 30, 2012 @ 12:00PM Pacific Time
Discount code: earlybird

so training costs differently different strokes for different folks I guess,

BUT me hearties.

Cost of 1 year of Revolution Enterprise= $1000

Thats a flat rate, so the Linux and Windows costs the same and so does the 32-bit and 64-bit

(see http://buy.revolutionanalytics.com/ )

( My comment- either Revo should give away the license for free to enterprises, rationalize training costs, seriously how can 2 days of training cost like a 1 year of license and the software is definitely quite good., or create a paid Amazon Ec 2 AMI for enterprises to rent the Revolution Analytics software (like SAP Hana ), or even on Windows Azure if they insist on hugging Microsoft, though I am clearly seeing various flavors of Linux beating Windows Server to a pulp in the Big Data market, though I am probably more optimistic on the Windows 8 on Surface but because of hardware not software/ Azure alternative to Amazon given Google’s delayed offering- I dont even know many many instance of Windows related HPC or HPA,  (/end_of_rant)

Annual Subscription
Includes software license and technical support
Price Quantity Total
Revolution R Enterprise Single-User Workstation (64-bit Windows) $1,000.00 $0.00
Revolution R Enterprise Single-User Workstation (32-bit Windows) $1,000.00 $0.00
Revolution R Enterprise Single-User Workstation (64-bit Red Hat 6 Enterprise Linux) $1,000.00 $0.00
Revolution R Enterprise Single-User Workstation (64-bit Red Hat 5 Enterprise Linux) $1,000.00 $0.00

 

Online Education- MongoDB and Oracle R Enterprise

I really liked the course developed by 10 gen for MongoDB (there are two tracks for Developers and DBAs at https://education.10gen.com/)

The interface is very nice and is a step upwards from Coursera’s ( https://www.coursera.org/) pioneering work (and even http://www.codecademy.com/#!/exercises/0 )– each video has a small question, the videos are not cluttered, and the voice and transcription quality is impeccable. Lastly a certification for people who clear 65% acts as an academic incentive, they get a certificate.

yes it is free.

 

Oracle recently launched a series of nicely made R tutorials at https://apex.oracle.com/pls/apex/f?p=44785:24:0::NO::P24_CONTENT_ID,P24_PREV_PAGE:6528,1but I wish Oracle R had some certifications too!

If only more techie companies like SAS Institute (expensive SAS training), IBM (cluttered website), Revolution Analytics (expensive partners in Certification), Google (unpolished Python lectures)

put an effort with polished e-learning interfaces than be dependent on external partners…..or internal gurus…interfaces matter especially in education.

\Well Anyways!!

Happy Mongo DBing/ Oracle R!

 

JMP Student Edition

I really liked the initiatives at JMP/Academic. Not only they offer the software bundled with a textbook, which is both good common sense as well as business sense given how fast students can get confused

(Rant 1 Bundling with textbooks is something I think is Revolution Analytics should think of doing instead of just offering the academic  version for free downloading- it would be interesting to see the penetration of R academic market with Revolution’s version and the open source version with the existing strategy)

From http://www.jmp.com/academic/textbooks.shtml

Major publishers of introductory statistics textbooks offer a 12-month license to JMP Student Edition, a streamlined version of JMP, with their textbooks.

and a glance through this http://www.jmp.com/academic/pdf/jmp_se_comparison.pdf  shows it is a credible and not extremely whittled down version which would be just dishonest.

And I loved this Reference Card at http://www.jmp.com/academic/pdf/jmp10_se_quick_guide.pdf

 

Oracle, SAP- Hana, Revolution Analytics and even SAS/STAT itself can make more reference cards like this- elegant solutions for students and new learners!

More- creative-rants Honestly why do corporate sites use PDFs anymore when they can use Instapaper , or any of these SlideShare/Scribd formats to show information in a better way without diverting the user from the main webpage.

But I digress, back to JMP

 

Resources for Faculty Using JMP® Student Edition

Faculty who select a JMP Student Edition bundle for their courses may be eligible for additional resources, including course materials and training.

Special JMP® Student Edition for AP Statistics

JMP Student Edition is available in a convenient five-year license for qualified Advanced Placement statistics programs.

Try and have a look yourself at http://www.jmp.com/academic/student.shtml

 

 

 

RevoDeployR and commercial BI using R and R based cloud computing using Open CPU

Revolution Analytics has of course had RevoDeployR, and in a  webinar strive to bring it back to center spotlight.

BI is a good lucrative market, and visualization is a strength in R, so it is matter of time before we have more R based BI solutions. I really liked the two slides below for explaining RevoDeployR better to newbies like me (and many others!)

Integrating R into 3rd party and Web applications using RevoDeployR

Please click here to download the PDF.

Here are some additional links that may be of interest to you:

 

( I still think someone should make a commercial version of Jeroen Oom’s web interfaces and Jeff Horner’s web infrastructure (see below) for making customized Business Intelligence (BI) /Data Visualization solutions , UCLA and Vanderbilt are not exactly Stanford when it comes to deploying great academic solutions in the startup-tech world). I kind of think Google or someone at Revolution  should atleast dekko OpenCPU as a credible cloud solution in R.

I still cant figure out whether Revolution Analytics has a cloud computing strategy and Google seems to be working mysteriously as usual in broadening access to the Google Compute Cloud to the rest of R Community.

Open CPU  provides a free and open platform for statistical computing in the cloud. It is meant as an open, social analysis environment where people can share and run R functions and objects. For more details, visit the websit: www.opencpu.org

and esp see

https://public.opencpu.org/userapps/opencpu/opencpu.demo/runcode/

Jeff Horner’s

http://rapache.net/

Jerooen Oom’s

Online Education takes off

Udacity is a smaller player but welcome competition to Coursera. I think companies that have on demand learning programs should consider donating a course to these online education players (like SAS Institute for SAS , Revolution Analytics for R, SAP, Oracle for in-memory analytics etc)

Any takers!

http://www.udacity.com/

 

Coursera  is doing a superb job with huge number of free courses from notable professors. 111 courses!

I am of course partial to the 7 courses that are related to my field-

https://www.coursera.org/

 

 

Teradata Analytics

A recent announcement showing Teradata partnering with KXEN and Revolution Analytics for Teradata Analytics.

http://www.teradata.com/News-Releases/2012/Teradata-Expands-Integrated-Analytics-Portfolio/

The Latest in Open Source Emerging Software Technologies
Teradata provides customers with two additional open source technologies – “R” technology from Revolution Analytics for analytics and GeoServer technology for spatial data offered by the OpenGeo organization – both of which are able to leverage the power of Teradata in-database processing for faster, smarter answers to business questions.

In addition to the existing world-class analytic partners, Teradata supports the use of the evolving “R” technology, an open source language for statistical computing and graphics. “R” technology is gaining popularity with data scientists who are exploiting its new and innovative capabilities, which are not readily available. The enhanced “R add-on for Teradata” has a 50 percent performance improvement, it is easier to use, and its capabilities support large data analytics. Users can quickly profile, explore, and analyze larger quantities of data directly in the Teradata Database to deliver faster answers by leveraging embedded analytics.

Teradata has partnered with Revolution Analytics, the leading commercial provider of “R” technology, because of customer interest in high-performing R applications that deliver superior performance for large-scale data. “Our innovative customers understand that big data analytics takes a smart approach to the entire infrastructure and we will enable them to differentiate their business in a cost-effective way,” said David Rich, chief executive officer, Revolution Analytics. “We are excited to partner with Teradata, because we see great affinity between Teradata and Revolution Analytics – we embrace parallel computing and the high performance offered by multi-core and multi-processor hardware.”

and

The Teradata Data Lab empowers business users and leading analytic partners to start building new analytics in less than five minutes, as compared to waiting several weeks for the IT department’s assistance.

“The Data Lab within the Teradata database provides the perfect foundation to enable self-service predictive analytics with KXEN InfiniteInsight,” said John Ball, chief executive officer, KXEN. “Teradata technologies, combined with KXEN’s automated modeling capabilities and in-database scoring, put the power of predictive analytics and data mining directly into the hands of business users. This powerful combination helps our joint customers accelerate insight by delivering top-quality models in orders of magnitude faster than traditional approaches.”

Read more at

http://www.sacbee.com/2012/03/06/4315500/teradata-expands-integrated-analytics.html