Uptil now the R ecosystem of package developers has mostly shrugged away the Big Data question. In a fascinating insight Hadley Wickham said this in a recent interview- shockingly it mimicks the FUD you know who has been accused of ( source
I think there are two particularly important transition points:
* From in-memory to disk. If your data fits in memory, it’s small data. And these days you can get 1 TB of ram, so even small data is big!
* From one computer to many computers.
R is a fantastic environment for the rapid exploration of in-memory data, but there’s no elegant way to scale it to much larger datasets. Hadoop works well when you have thousands of computers, but is incredible slow on just one machine. Fortunately, I don’t think one system needs to solve all big data problems.
To me there are three main classes of problem:
1. Big data problems that are actually small data problems, once you have the right subset/sample/summary.
2. Big data problems that are actually lots and lots of small data problems
3. Finally, there are irretrievably big problems where you do need all the data, perhaps because you fitting a complex model. An example of this type of problem is recommender systems
Ajay- One of the reasons of non development of R Big Data packages is- it takes money. The private sector in R ecosystem is a duopoly ( Revolution Analytics ( acquired by Microsoft) and RStudio (created by Microsoft Alum JJ Allaire). Since RStudio actively tries as a company to NOT step into areas Revolution Analytics works in- it has not ventured into Big Data in my opinion for strategic reasons.
Revolution Analytics project on RHadoop is actually just one consultant working on it here https://github.com/RevolutionAnalytics/RHadoop and it has not been updated since six months
We interviewed the creator of R Hadoop here https://decisionstats.com/2014/07/10/interview-antonio-piccolboni-big-data-analytics-rhadoop-rstats/
However Python developers have been trying to actually develop systems for Big Data actively. The Hadoop ecosystem and the Python ecosystem are much more FOSS friendly even in enterprise solutions.
This is where Python is innovating over R in Big Data-
Blaze: Translates NumPy/Pandas-like syntax to systems like databases.
Blaze presents a pleasant and familiar interface to us regardless of what computational solution or database we use. It mediates our interaction with files, data structures, and databases, optimizing and translating our query as appropriate to provide a smooth and interactive session.
Odo: Migrates data between formats.
Odo moves data between formats (CSV, JSON, databases) and locations (local, remote, HDFS) efficiently and robustly with a dead-simple interface by leveraging a sophisticated and extensible network of conversions. http://odo.pydata.org/en/latest/perf.html
odotakes two arguments, a target and a source for a data transfer.
>>> from odo import odo >>> odo(source, target) # load source into target
Dask.array: Multi-core / on-disk NumPy arrays
Dask.arrays provide blocked algorithms on top of NumPy to handle larger-than-memory arrays and to leverage multiple cores. They are a drop-in replacement for a commonly used subset of NumPy algorithms.
DyND: In-memory dynamic arrays
DyND is a dynamic ND-array library like NumPy. It supports variable length strings, ragged arrays, and GPUs. It is a standalone C++ codebase with Python bindings. Generally it is more extensible than NumPy but also less mature. https://github.com/libdynd/libdynd
LibDyND, a component of the Blaze project, is a C++ library for dynamic, multidimensional arrays. It is inspired by NumPy, the Python array programming library at the core of the scientific Python stack, but tries to address a number of obstacles encountered by some of its users. Examples of this are support for variable-sized string and ragged array types. The library is in a preview development state, and can be thought of as a sandbox where features are being tried and tweaked to gain experience with them.
C++ is a first-class target of the library, the intent is that all its features should be easily usable in the language. This has many benefits, such as that development within LibDyND using its own components is more natural than in a library designed primarily for embedding in another language.
This library is being actively developed together with its Python bindings,
On a single machine dask increases the scale of comfortable data from fits-in-memory to fits-on-diskby intelligently streaming data from disk and by leveraging all the cores of a modern CPU.
Users interact with dask either by making graphs directly or through the dask collections which provide larger-than-memory counterparts to existing popular libraries:
- dask.array = numpy + threading
- dask.bag = map, filter, toolz + multiprocessing
- dask.dataframe = pandas + threading
Dask primarily targets parallel computations that run on a single machine. It integrates nicely with the existing PyData ecosystem and is trivial to setup and use:
conda install dask or pip install dask
When open source fights- closed source wins. When the Jedi fight the Sith Lords will win
So will R people rise to the Big Data challenge or will they bury their heads in sands like an ostrich or a kiwi. Will Python people learn from R design philosophies and try and incorporate more of it without redesigning the wheel
Converting code from one language to another automatically?
How I wish there was some kind of automated conversion tool – that would convert a CRAN R package into a standard Python package which is pip installable
Machine learning for more machine learning anyone?