Home » Articles posted by Son of Gandhi

Author Archives: Son of Gandhi

Cheat Sheets for Data Scientists

a compilation of cheat sheets for Data Scientists


Movie Review – Two States (Bollywood 2014)

As a Punjabi engineer turned IIM MBA, who turns into a writer after marrying a woman from a different state (that’s me here!) , I was a bit apprehensive on how good or bad this movie could be. It was a relief and indeed a joy to watch this movie. The comic timing is light and everyone acts well. Watch Two States as the feel good movie of New BollyWood 2014. Alia Bhatt is lovely, Arjun Kapoor decides to move his acting and facial muscles, and the veteran team of actors fits in the roles very very naturally. A good way to spend a few hours, and be amazed at the amazing diversity of India.

2states-700 (1)

Talking on Analytics #rstats #analytics #bigdata #sas #conferences

This is a post that lists down my talks or conferences on Analytics and will be updated as I keep talking.
It also lists sponsors.


1) Big Data Big Analytics - http://krishnarajpm.com/bigdata/abstract.pdf Workshop on  Statistical Machine Learning and Game Theory  Approaches for Large Scale Data Analysis  9 July 2012 – 14 July 2012  Sponsored by Mathematical Sciences, Division of Science and Engineering  Research Board at Bangalore India
Department of Science & Technology Government of India. (sponsored airfare-hotel accomodation-honorium)

2) Data Analytics using the Cloud- Challenges and Opportunities for India at 1st International Symposium on Big Data and Cloud Computing Challenges(ISBCC-2014) March 27-28, 2014 VIT University, Chennai, India Sponsored by BRNS (flight)


3) Open Source Analytics at OSSCamp 2014 http://osscamp.in/





1) First Big Data Conference -New York sponsored by Aster Data (flight-stay)


2) Data Mining 2009 Sponsored by SAS Institute ( flight-stay)



1) Using R in  HR Analytics -HR Tech Europe 2012  http://blog.hrtecheurope.com/2012/08/more-r-in-hr/ also http://blog.revolutionanalytics.com/2012/11/using-r-in-the-human-resources-department.html

2) R for Business Analytics- Analytics India Magazine


3) Data Mining Research People Award 2010:  Data Mining Blog http://www.dataminingblog.com/data-mining-research-interview-ajay-ohri/

4) Statsjobs – April 2008 http://statjobs.blogspot.in/2008/04/interview-with-ajay-ohri-data-mining.html 


1) Blogger of the Week Social Media Collective http://socialmediatoday.com/SMC/67268

2) Member of the Month (2 times) on Analyticbridge  $250 awarded twice



If you need me to talk on analytics  - drop me a line at http://linkedin.com/in/ajayohri and I can talk.




Talk on Open Source Analytics at OSSCamp Delhi

I presented on how open source has helped business analytics grow. This  was  at the unconference (my first unconference) at http://osscamp.in/


Updating R for Business Analytics

I just updated my R for Business Analytics site (http://rforanalytics.wordpress.com/ ). Additions are as below you can go to http://rforanalytics.wordpress.com/ for the complete list- What I am trying to do is build a kind of Task View dedicated to Business Analytics (aimed at Business Analyst and Data Scientists) with slightly better HTML ( maybe Markdown later on) and some visual appeal.


Interviews with R Community



Jeroen Ooms (OpenCPU)


Christian (Statace)


Ian Fellows (Deducer)


Jeff Allen (Trestle)


Gergely Darcozi (RApporter)


ODBC /Databases for R (including Hadoop and NoSQL)


R with MongoDB


This R package provides an interface to the NoSQL MongoDB database
using the MongoDB C-driver version 0.8


R with JSON


This package is a fork of the RJSONIO package 

R with CouchDB


R with MonetDB


MonetDB.R: Connect MonetDB to R

Allows to pull data from MonetDB into R

Cassandra with R


Neo4j with R


# Function for querying Neo4j from within R 
# from http://stackoverflow.com/questions/11188918/use-neo4j-with-r
query <- function(querystring) {
    h = basicTextGatherer()
    curlPerform(url = "localhost:7474/db/data/ext/CypherPlugin/graphdb/execute_query", 
        postfields = paste("query", curlEscape(querystring), 
        sep = "="), writefunction = h$update, verbose = FALSE)
    result <- fromJSON(h$value())
    data <- data.frame(t(sapply(result$data, unlist)))
    names(data) <- result$columns
# -------------------------------------- 
# import all data into neo4j
# --------------------------------------
nrow(venueDataset)  # number of venues


RHadoop consists of the following packages:

  • NEW! plyrmr - higher level plyr-like data processing for structured data, powered by rmr
  • rmr - functions providing Hadoop MapReduce functionality in R
  • rhdfs - functions providing file management of the HDFS from within R
  • rhbase - functions providing database management for the HBase distributed database from within R

R with Spark


SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. SparkR exposes the Spark API through the RDD class and allows users to interactively run jobs from the R shell on a cluster.

R with Hive


RHive is an R extension facilitating distributed computing via HIVE query. RHive allows easy usage of HQL(Hive SQL) in R, and allows easy usage of R objects and R functions in Hive.


DDR with R – Rhipe (dormant)



A package to connect and run queries on Cloudera Impala (thanks to Mu Sigma)


Pig with R


Updates at Statace : Early access to make your own R in the browser GUI #rstats

The guys at Statace released major updates- I am particularly excited for the ability to create a custom GUI box for your own analysis or for sharing with consulting clients or students.

What does that mean? Basically they are making it a bit like R Commander Extensions- so if you have a package or analysis you would rather do visually (than code) – you can create a GUI module for it. The modular extension is quite cool in my opinion, but further proof will be in how well designed the pudding is.


Public sharing of results
Now you can share your analysis results for the world to see (example). Just click Share in the results pane.

Google Drive integration
We added integration with Google Drive. This makes collaboration and synchronization of large files even easier. Don’t forget we also support Dropbox. Just click the Connect to menu in the file manager.

Plots zoom and SVG export
Now you can open plots in a separate window that supports zoom in and zoom out. From it, you can also export to the SVG format which is ideal for printing. Just click the lens icon next to any plot.

Point-and-click PCA + data transformation without R knowledge
You can now carry out a PCA by just pointing and clicking though Analysis > Dimensional Analysis > Principal Components Analysis. We also added the Data menu which allows you to filter and sort datasets without any knowledge of R.

(Secret) Build your own visual dialog box to run R code
Do you have colleagues who don’t know R but need to use functionality you developed? Do you do consulting and want your customers to be able to run your models with point-and-click? Do you want to share a piece of R code with the world in an easy-to-use way?
StatAce now allows you to easily create a custom graphical interface for your R code. The process is entirely visual (no coding) and is what we use to build our own Data & Analysis menus (e.g. the bivariate correlation and linear regression dialog boxes). We are testing the functionality with a limited number of users, and their feedback has been great. Drop us a line at predict@statace.com to request early access.


Screenshot 2014-04-15 15.34.25




Comparing PIG with Hive SQL


a = LOAD 'nyse' USING org.apache.hcatalog.pig.HCatLoader();
b = FILTER a BY stock_symbol =='IBM' ;
c = group b all;
d = foreach c generate AVG(b.stock_volume);
dump d;

In SQL (Hive)

select AVG(stock_volume) from nyse where stock_symbol =="IBM"

(from HDP 2.0 Horton Sandbox Example)

Also see




Get every new post delivered to your Inbox.

Join 745 other followers