My last articles seems to have touched a nerve or two judging by the 2000 views I got in a single day on a Sunday ( and India’s national Independence Day / and V-J Day). Here I am simply reproducing the unedited and very interesting comments I got with an interesting R package.
On Google Plus, there is a vibrant community for R and Statistics. Yes Google plus exists still 😉 The following excellent comment makes you think.
This is pretty much a ho-hum topic with me. I don’t find this article very convincing. If you like Python, fine! Use Python. The problem I have with Python is that it is an interpreted language. Anything written in pure Python is going to take a long time to run on a big data set. Sure, there are Python packages for data analysis that run quickly, but you either have to depend on what someone else provides or develop your own package in compiled code.
I’ve found most software apps written specifically for “big data” to be very limited: a lot of them begin and end at N/N (pretty old hat now and inferior to a number of other methods for many analyses). If you can’t look under the hood and see what goes on in an analysis package, well, then good luck to you if you to use it, but don’t expect me to.
So far I’ve found that R works well for the large data sets I work with. (I’ll leave aside the issue of graphics for now; I have yet to see anything else that can hold a candle to R in that regard.) If the base packages that come with R can’t do a particular task I’ll first search among the over 5,000 packages currently available on CRAN. If that doesn’t work I’ll send a request to the R help list server. If that doesn’t work I’ll write my own routine in C or C# (I prefer the latter). BTW, if you are in the data analysis game you need to know enough to be able to do your own numerical analysis programming, say at the level of Numerical Recipes. Otherwise you are going to be overly dependent on someone else to provide software for you.
I’m not writing this to persuade anyone to pick one over the other. It’s just that there are a lot of possible choices out there — it’s not just R vs Python. And I’m just tired of these endless debates that go nowhere. As we say in the software engineering world: don’t try to convince the other person that your text editor/IDE/programming language is better than theirs.
download spark and build it as follows
cd <spark root> build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests -Phive -Phive-thriftserver clean package
Then start the thift service.
install.packages(c("RJDBC", "dplyr", "DBI", "devtools")) devtools::install_github("hadley/purrr")
rJava. Make sure that you have
rJava working with:
spark.src = src_SparkSQL(“localhost“, “10000“)
The lack of activity on rmr2 reflects maturity of the package and a shift away from Hadoop mapreduce toward spark. Please check the dplyr.spark package on github. It’s the easiest way to run spark bar none, including python, in its author very biased opinion. Example: find the best and worst flight by arrival delay on each day:
group_by(flights, year, month, day) %>%
select(flight, arr_delay) %>%
filter(arr_delay == min(arr_delay) || arr_delay == max(arr_delay))
Runs on spark, scales to whatever your cluster can store. Please show me the equivalent in any other language, python included. I am waiting.