Home » Analytics

Category Archives: Analytics

Updating R for Business Analytics

I just updated my R for Business Analytics site (http://rforanalytics.wordpress.com/ ). Additions are as below you can go to http://rforanalytics.wordpress.com/ for the complete list- What I am trying to do is build a kind of Task View dedicated to Business Analytics (aimed at Business Analyst and Data Scientists) with slightly better HTML ( maybe Markdown later on) and some visual appeal.


Interviews with R Community



Jeroen Ooms (OpenCPU)


Christian (Statace)


Ian Fellows (Deducer)


Jeff Allen (Trestle)


Gergely Darcozi (RApporter)


ODBC /Databases for R (including Hadoop and NoSQL)


R with MongoDB


This R package provides an interface to the NoSQL MongoDB database
using the MongoDB C-driver version 0.8


R with JSON


This package is a fork of the RJSONIO package 

R with CouchDB


R with MonetDB


MonetDB.R: Connect MonetDB to R

Allows to pull data from MonetDB into R

Cassandra with R


Neo4j with R


# Function for querying Neo4j from within R 
# from http://stackoverflow.com/questions/11188918/use-neo4j-with-r
query <- function(querystring) {
    h = basicTextGatherer()
    curlPerform(url = "localhost:7474/db/data/ext/CypherPlugin/graphdb/execute_query", 
        postfields = paste("query", curlEscape(querystring), 
        sep = "="), writefunction = h$update, verbose = FALSE)
    result <- fromJSON(h$value())
    data <- data.frame(t(sapply(result$data, unlist)))
    names(data) <- result$columns
# -------------------------------------- 
# import all data into neo4j
# --------------------------------------
nrow(venueDataset)  # number of venues


RHadoop consists of the following packages:

  • NEW! plyrmr - higher level plyr-like data processing for structured data, powered by rmr
  • rmr - functions providing Hadoop MapReduce functionality in R
  • rhdfs - functions providing file management of the HDFS from within R
  • rhbase - functions providing database management for the HBase distributed database from within R

R with Spark


SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. SparkR exposes the Spark API through the RDD class and allows users to interactively run jobs from the R shell on a cluster.

R with Hive


RHive is an R extension facilitating distributed computing via HIVE query. RHive allows easy usage of HQL(Hive SQL) in R, and allows easy usage of R objects and R functions in Hive.


DDR with R – Rhipe (dormant)



A package to connect and run queries on Cloudera Impala (thanks to Mu Sigma)


Pig with R


Updates at Statace : Early access to make your own R in the browser GUI #rstats

The guys at Statace released major updates- I am particularly excited for the ability to create a custom GUI box for your own analysis or for sharing with consulting clients or students.

What does that mean? Basically they are making it a bit like R Commander Extensions- so if you have a package or analysis you would rather do visually (than code) – you can create a GUI module for it. The modular extension is quite cool in my opinion, but further proof will be in how well designed the pudding is.


Public sharing of results
Now you can share your analysis results for the world to see (example). Just click Share in the results pane.

Google Drive integration
We added integration with Google Drive. This makes collaboration and synchronization of large files even easier. Don’t forget we also support Dropbox. Just click the Connect to menu in the file manager.

Plots zoom and SVG export
Now you can open plots in a separate window that supports zoom in and zoom out. From it, you can also export to the SVG format which is ideal for printing. Just click the lens icon next to any plot.

Point-and-click PCA + data transformation without R knowledge
You can now carry out a PCA by just pointing and clicking though Analysis > Dimensional Analysis > Principal Components Analysis. We also added the Data menu which allows you to filter and sort datasets without any knowledge of R.

(Secret) Build your own visual dialog box to run R code
Do you have colleagues who don’t know R but need to use functionality you developed? Do you do consulting and want your customers to be able to run your models with point-and-click? Do you want to share a piece of R code with the world in an easy-to-use way?
StatAce now allows you to easily create a custom graphical interface for your R code. The process is entirely visual (no coding) and is what we use to build our own Data & Analysis menus (e.g. the bivariate correlation and linear regression dialog boxes). We are testing the functionality with a limited number of users, and their feedback has been great. Drop us a line at predict@statace.com to request early access.


Screenshot 2014-04-15 15.34.25




Comparing PIG with Hive SQL


a = LOAD 'nyse' USING org.apache.hcatalog.pig.HCatLoader();
b = FILTER a BY stock_symbol =='IBM' ;
c = group b all;
d = foreach c generate AVG(b.stock_volume);
dump d;

In SQL (Hive)

select AVG(stock_volume) from nyse where stock_symbol =="IBM"

(from HDP 2.0 Horton Sandbox Example)

Also see



Installing Scala on CentOS

Scala files are now here http://www.scala-lang.org/files/archive/

wget http://www.scala-lang.org/files/archive/scala-2.10.1.tgz
tar xvf scala-2.10.1.tgz
sudo mv scala-2.10.1 /usr/lib
sudo ln -s /usr/lib/scala-2.10.1 /usr/lib/scala
export PATH=$PATH:/usr/lib/scala/bin
scala -version


Writing for kdnuggets.com

I have been writing freelance for kdnuggets.com

Its a great learning for me to be a better writer especially in my discipline.

These are a list of articles -interviews are in bold and I will keep updating this list

  1. Book Review: Data Just Right 2014/04/03
  2. Exclusive Interview: Richard Socher, founder of etcML, Easy Text Classification Startup 2014/03/31
  3. Trifacta – Tackling Data Wrangling with Automation and Machine Learning 2014/03/17
  4. Paxata automates Data Preparation for Big Data Analytics 2014/03/07
  5. etcML Promises to Make Text Classification Easy  2014/03/05
  6. Wolfram Breakthrough Knowledge-based Programming Language – what it means for Data Science? 2014/03/02

New Delhi R Meetup March 2014

We had a nice , small and slightly quirky Meetup this weekend in New Delhi. Apart from the weekend traffic that ensured almost everyone was late, and a cieling that started dripping with water – we were in Cafe Coffee Day,CP it was enjoyable meeting with the mix of people present including students, people moving away from SAS, IT people who are R curious- who mostly form the backbone of R Meetups here in New Delhi

I presented on

1) R basics including R Commander

2) Hadoop basics including Hortworks Sandbox

3) Tableau Public basics for visualization

4) Slides from my talk last week of Cloud Computing

Overall with 239 Users, this is growing to be a critical and large piece in the traditionally outsourcing hub of NCR, India. With thousands of SAS license users and programmers in Gurgaon and Noida, we hope we can make a difference in promoting open source awareness.

Screenshot 2014-03-31 14.54.45

Is the biggest threat to the cloud US Government over reach

A presentation I made today on a talk


Get every new post delivered to your Inbox.

Join 735 other followers