Is R going to be better than Python for Big Data Analytics and Data Science? #rstats #python

My last articles seems to have touched a nerve or two judging by the 2000 views I got in a single day on a Sunday ( and India’s national Independence Day / and  V-J Day). Here I am simply reproducing the unedited and very interesting comments I got with an interesting R package.

 

On Google Plus, there is a vibrant community for R and Statistics. Yes Google plus exists still 😉 The following excellent comment makes you think.

This is pretty much a ho-hum topic with me. I don’t find this article very convincing. If you like Python, fine! Use Python. The problem I have with Python is that it is an interpreted language. Anything written in pure Python is going to take a long time to run on a big data set. Sure, there are Python packages for data analysis that run quickly, but you either have to depend on what someone else provides or develop your own package in compiled code.

I’ve found most software apps written specifically for “big data” to be very limited: a lot of them begin and end at N/N (pretty old hat now and inferior to a number of other methods for many analyses). If you can’t look under the hood and see what goes on in an analysis package, well, then good luck to you if you to use it, but don’t expect me to.

So far I’ve found that R works well for the large data sets I work with. (I’ll leave aside the issue of graphics for now; I have yet to see anything else that can hold a candle to R in that regard.) If the base packages that come with R can’t do a particular task I’ll first search among the over 5,000 packages currently available on CRAN. If that doesn’t work I’ll send a request to the R help list server. If that doesn’t work I’ll write my own routine in C or C# (I prefer the latter). BTW, if you are in the data analysis game you need to know enough to be able to do your own numerical analysis programming, say at the level of Numerical Recipes. Otherwise you are going to be overly dependent on someone else to provide software for you.

I’m not writing this to persuade anyone to pick one over the other. It’s just that there are a lot of possible choices out there — it’s not just R vs Python. And I’m just tired of these endless debates that go nowhere. As we say in the software engineering world: don’t try to convince the other person that your text editor/IDE/programming language is better than theirs.

—-
and
Anthony the creator or RHadoop was kind enough to not only write a comment here but also provide a tech solution AND throw a challenge at all pythonistas.

The lack of activity on rmr2 reflects maturity of the package and a shift away from Hadoop mapreduce toward spark. Please check the dplyr.spark package on github. It’s the easiest way to run spark bar none, including python, in its author very biased opinion. Example: find the best and worst flight by arrival delay on each day:

group_by(flights, year, month, day) %>%
select(flight, arr_delay) %>%
filter(arr_delay == min(arr_delay) || arr_delay == max(arr_delay))

Runs on spark, scales to whatever your cluster can store. Please show me the equivalent in any other language, python included. I am waiting.

and finally after all that violence and doubletalk ( as Dire Straits sang in the Walk of Life)  the R package that will beat all packages on Big Data —-(apparently)

download spark and build it as follows

cd <spark root>
build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests -Phive -Phive-thriftserver clean package

Then start the thift service.

sbin/start-thriftserver.sh  
install.packages(c("RJDBC", "dplyr", "DBI", "devtools"))
devtools::install_github("hadley/purrr")  

Indirectly RJDBC needs rJava. Make sure that you have rJava working with:

library(rJava)
.jinit()
install.packages("devtools")
library(devtools)
install_url(
  "https://github.com/RevolutionAnalytics/dplyr-spark/releases/download/0.2.2/dplyr.spark_0.2.2.tar.gz")

library(dplyr)

library(dplyr.spark)

spark.src = src_SparkSQL(localhost, 10000)

 

 

Author: Ajay Ohri

http://about.me/ajayohri

One thought on “Is R going to be better than Python for Big Data Analytics and Data Science? #rstats #python”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s