Interview Dr. Jonathan Cornelissen, CEO Datamind #rstats

Here is an interview  with Dr Jonathan  Cornelissen, CEO of Datamind which also makes RDocumentation, and R-fiddle. I have written on them before here and here   jonathan

Ajay- Congrats for making on the first page of hacker news with R-Fiddle .What were your motivations for making http://www.r-fiddle.org/.

Jonathan- Thank you. I must admit it was very exciting to be mentioned on Hacker News, since a lot of people were exposed to the R-fiddle project immediately. In addition, it was a first good test on how our servers would perform!

The motivation for building R-fiddle was simple; our CTO Dieter frequently uses the popular sitehttp://jsfiddle.net/ to prototype webpages, and to share his coding ideas with us. We were looking for something similar for R but it turned out a website allowing you to quickly write, run and share R-code right inside your browser didn’t exist yet. Since we were convinced a fiddle like tool for R would be  useful, we just started building it. Based on the positive reactions and the fast adoption of R-fiddle, I think we were right. That being said, this is a first version of R-fiddle, and we’ll definitely improve it over the coming months. Check out our blog for updates. (http://blog.datamind.org/)

Ajay- Why did you make http://www.rdocumentation.org/ given that there is so much #Rstats documentation all around the internet including http://www.inside-r.org/ 

Jonathan- When we started working on the www.datamind.org platform, we did an online survey to find out whether there would be interest in an interactive learning platform for R and statistics. Although the survey was not on this topic, one of the most striking findings was that a large percentage of R users apparently is frustrated about the documentation of R and its packages. This is interesting since it not only frustrates current users, but it also increases the barrier to entry for new R users and hence puts a brake on the growth and adoption of R as a language. It is mainly for the latter reason we started building Rdocumentation. The whole focus is on usability and letting all users contribute to make the documentation stronger. By the end of next week, we’ll launch a new version of Rdocumentation, that introduces advanced search functionality for all R functions, shows the popularity of R packages and much more. So make sure to www.Rdocumentation.org for updates!

Ajay- What have been your responses to http://www.datamind.org/#/ . Any potential content creation partners or even corporate partners like statistics.com, Revolution , RStudio, Mango etc

Jonathan- The response to the beta version of DataMind has been great, thousands of learners signed up and took the demo course. We are talking to some of the leading companies in the space and some very well-known professors to develop courses together. It is too soon to disclose details, but we will put regular updates on www.datamind.org! Corporates interested in what we do should definitely get in contact with Martijn@datamind.org.

Ajay- Would it be accurate to call http://www.r-fiddle.org/#/  a browser based GUI for R on the cloud . What enhancements can be we expect in the future?

Jonathan- R-fiddle is indeed a browser based GUI for R on the cloud. We have a lot of ideas to improve and extend it. Some of the ideas are: the ability for users to concurrently make changes to a fiddle (Google-docs-style), support for loading data sets, github integration, better security management, lists of popular fiddles or fiddles from popular people, etc. However, the strong point about R-fiddle is that it is really simple and there is absolutely no friction to start using it. In that respect, we want to differentiate R-fiddle from more advanced solutions such as StatAce or Rstudio Server, which focus on more advanced R users or R usage. rf1

Ajay- You described your architecture for datamind.org  at http://blog.datamind.org/ which is very open and transparent of you.  What is the architecture for http://www.r-fiddle.org/#/ and what is it based out of?

Jonathan- That’s an easy one. Although some details differ obviously, from a high-level perspectiveDataMind.org and R-fiddle.org have exactly the same IT architecture.

Ajay-  http://www.datamind.org/#/dashboard describes course creation . How many courses are in the pipeline and how many users and corporate training clients do you foresee in the next 12 months

Jonathan- Since  we launched DataMind, we were inundated by requests from teachers and industry experts eager to contribute their own coursework on the site. But up until last week, it was only possible to take courses instead of creating them yourself. We decided to change this since we do not want to be solely a content company, but also a platform for others to create courses. 1

Furthermore, by expanding DataMind with a content creation tool, we go beyond our naturally limited in-house ability to create courses. Now DataMind is ready to become a full on ecosystem to facilitate education between our users.

Ajay- Are you self funded- any funding constraints based on being based in Europe?

Jonathan- We are a Belgian company, founded in November of this 2013 by Dieter De Mesmaeker, Martijn Theuwissen and myself. However, the DataMind team travelled to Santagio (Chile) last week to participate in the Start-up Chile incubator for the next 6 months (which offers $40k in equity-free funding and mentoring). Here in Santiago, a fourth team member Bram Jans joined us. Furthermore, we have raised $135k seed capital from the iMinds incubator in Belgium to market and further develop the technology. Next summer, we’d like to raise more capital to be able to execute faster on our strategy towards monetization. Tech savvy investors with an interest/network in the statistics and data science area, or in online education, can always send a mail to Jonathan@datamind.org to discuss potential collaboration.

Ajay- What do you think of R in the cloud for teaching ( http://blog.datamind.org/2013/07/23/how-to-run-r-in-the-cloud-for-teaching/

Jonathan- We are convinced that cloud solutions are the future of teaching and learning in general. The main problem with the first wave of online education solutions (such as Coursera, EdX, Udacity, etc.) is that they “only” make a copy of the classroom online instead of leveraging technology to create a more engaging and efficient learning experience and interface. Therefore, I don’t think the future is in generic learning solutions. Learning interfaces will differ from domain to domain. Good examples are:Duolingo.com to learn languages, or Codeschool.com to learn web development. We are on a mission to build the best learning solutions for statistics and data science.

Ajay- What are some of the other ways we can help make R more popular on the cloud?

Jonathan- I really like the vision behind StatAce.com, and I think something like it will definitely increase further adoption of R. It is somewhat surprising that Rstudio is not offering something like that, but my assumption is they are working on it. That being said, what would be really cool is a very easy-to-use graphical user interface with R under the hood. Whether you like it or not, R has quite a steep learning curve for most people, and allowing them to analyze data with R through a graphical user interface on the web as a first step, could start the adoption of R in less technical areas.

Ajay-  Any plans to make R (CRAN or Github) packages to help  with these solutions?

Jonathan-  We’ll put a first version of the very simple Rdocumentation R package on CRAN soon. This would allow people to integrate Rdocumentation in their standard R work-flow (See an early draft version on Github: https://github.com/jonathancornelissen/Rdocumentation_package)

rdoc1

For DataMind, we are working on an R package as well to make the creation of interactive courses easier: https://github.com/jonathancornelissen/datamind. A part of this R package is actually just a wrapper around the great Slidify package (http://slidify.org/).

 

Ajay- Describe your work life balance at a tech startup?

Jonathan- Hmm, work life balance J? 

About-

You can also connect with Dr Jonathan here http://www.linkedin.com/pub/jonathan-cornelissen/4/22/426

rfiddle_smalldatamind_small (1)
rdocs_small

(Standard Blogger Disclosure- they also support Decisionstats.com in case you didnt notice the banner ad)

My Creativity

When not distributing my ideas for free and all kinds of people claiming me as an advisor based on a few emails-

This isn’t a new topic, one of our advisors Ajay Ohri, also the author of Springer’s book on R, wrote about this idea back in 2011   (http://readwrite.com/2011/06/01/an-app-store-for-algorithms#awesm=~ohfvTpPiq6Jmt5).

Some of you know I have been trying to write a movie

todayilearntincanada.wordpress.com

Some of you know I write poetry blog http://poemsforkush.com/  (



 

 

 

 

 

 

 

Coming up- a post of the different kinds of dashboards within different social media websites etc.

Garbage In Garbage Out

Many people are like garbage trucks. They run around full of garbage, full of frustration, full of anger, and full of disappointment. As their garbage piles up, they need a place to dump it and sometimes they’ll dump it on you. NEVER take it personally. Just smile, wave, wish them well, and move on with the routine life.” Don’t take their garbage and spread it to other people at work, at home or on the streets.

Hat Tip – http://www.linkedin.com/pub/badri-s-evergreen-thoughts/51/461/209

Data Science Hype Bubble

k-bigpic

  1. People selling business analytics software claim business analytics will solve everything for your business (including world peace and hunger if the govt chooses)
  2. People selling business analytics training claim there is a big shortage of analysts/data scientists and getting those skills will give you job and eternal bliss ( but in obtuse language to prevent lawsuits)
  3. People selling consulting services claim software (see 1) is incredibly difficult to customize without their help
  4. Everyone is charging money which is expensive without any transparency on why it is priced so. What are your costs etc?
  5. Everyone has a few shiny testimonials on their website. This is very confusing. How can everyone be equally good.
  6. The credit rating agencies of Data science world are as corrupted and prone to influence as the credit rating agencies of the financial world (Enough said, Gideon!)
  7. Pricing in data science solutions, products and services is like this- my website is better than that competitor website /blog so if he charges X let me charge X +dx
  8. Even companies that began with grand visions of revolution and changing the world slowly upped their price  of both software and training
  9. White papers in data science is a declining but still robust industry. The latest thing in data science- SLICK BLOGS by smart looking people
  10. No one bothers to explain total cost of ownership or total return on investment on data science and analytics. Very surprising, since every one is a quantitative expert and these two metrics should bother the dear beloved end customer the most
  11. I have seen some hype bubbles ( yes I am 36 years old) Business Intelligence- Business Analytics- Data Science- Big Data… What is the next big buzzword
  12. Everyone is selling webinars for free. There is no free lunch. Why are there free webinars.
  13. How I can go from unpaid blogger to paid webinar guru— test my hypothesis (thinks a lot of people everyday)
  14. Somewhere in a West Coast college dormitory or an Eastern Eurpean garage, some geeks are plotting the next data revolution. You have been warned.
  15. How many bums must one guy kiss to get invited to conferences
  16. In the age of skype, and video conferencing- why do you need a conference. oH right- thats another side industry too.
  17. The more billions a software company makes in analytics, the more haters it gets!
  18. 123,000 bloggers think they can run Google better than Eric Schmidt. Includes two.

ps- Sarcasm was totally unintentional. Direct all malevolence here http://plus.google.com/+AjayOhri

50 functions to clear a basic interview for Business Analytics #rstats

Due respect to all cheat sheets and ref cards, but these are the functions that I use in a sequence to analyze a business data set. interview

  • Packages
  1. install.packages(“Hmisc”)  installs package Hmisc
  2. library(Hmisc) –loads package Hmisc
  3. update.packages() Updates all packages
  • Data Input
  1. getwd() – Gets you the current working directory
  2. setwd(“C:/Path”) -Sets the working directory to the new path , here C:/Path
  3. dir() – Lists all the files present in the current working directory
  4. a=read.csv(“1.csv”,header=T,sep=”,”,stringsAsFactors=T)

here

a= read.csv (assigns the object name a to whatever comes to the right side)

You can also explicitly assign a character name to a data object using assign)

read.csv is a type of input command derived from read.table to read in a rectangular object (rows and columns)

header specifies whether the first line has variable names or not

sep denotes seperator (generally a comma for CSV files but can be space or tab for text files)

stringsAsFactors=T reads in strings as character values and not factor levels

  • Object Inspection
  1. str(a) Gives the structure of object named  including class, dimensions, type of variables , names of variables and a few values of variables as well. Only drawback is can throw up a lot of information for a big data set
  2. names(a) Gives the names of variables of the object
  3. class(a) Gives the class of a object like data.frame, list,matrix, vector etc
  4. dim(a) Gives the dimension of object (rows column)
  5. nrow(a) Gives the number of rows of object a- useful when used as an input for another function
  6. ncol(a) Gives the number of columns of object a
  7. length(a) Gives the length of object- useful for vectors, for a data frame it is the same as ncol
  8. a[i,j] Gives the value in ith row and jth column
  9. a$var1 Gives the variable named var1 in object a . This can be treated as a seperate object on it’s own for inspection or analysis
  • Data Inspection
  1. head(a,10) gives the first ten rows of object a
  2. tail(a,20) gives the last twenty rows of object a
  3. b=ajay[sample(nrow(ajay),replace=F,size=0.05*nrow(ajay)),] 

Lets get a 5 % random sample of the object ajay

[] uses the subset to give value in the specified row

Sample is the command for sample

So out nrow(ajay) or total number to be sampled of ,size= Size of sample it is taking 5% numbers, and these are the row numbers that are being returned. replace =F means each number is selected only once

  • Math Functions

Basic-

  1. sum  -sum
  2. sqrt -square root
  3. sd  -standard deviation
  4. log –log
  5. mean -mean
  6. median– median

Additional

  1. cumsum – Cumulative Sum for a column
  2. diff –Differencing
  3. lag – Lag
  • Data Manipulation
  1. paste(a$Var) converts Var from Factor/Numeric variable to Charachter Variable
  2. as.numeric(a$Var2) Converts a character variable into a numeric variable
  3. is.na(a) retruns TRUE wheneve it encounters a Missing Value within the object
  4. na.omit(a) Deletes all missing values (denoted by NA within R)
  5. na.rm=T (this option enables you to calculate values Ignoring Missing Values)
  6. nchar(abc) gives the values of characters in a character value
  7. substr(“ajay”,1,3) gives the sub string from starting place 1 to ending place 3. Note in R index starts from 1 for first object
  • Date Manipulation  

library(lubridate)
> a=”20Mar1987″
> dmy(a)
[1] “1987-03-20 UTC”
> b=”20/7/89″
> dmy(b)
[1] “1989-07-20 UTC”
> c=”October 12 93″
> mdy(c)
[1] “1993-10-12 UTC”

  • Data Analysis
  1. summary(a) Gives summary of object including min,max,median,mean, 1st quartile, 3rd Quartile) for numeric objects and frequency analysis of Factor variables
  2. table(a) Gives Frequency Analysis of variable or obejct
  3. table(a$var1,a$var2) Gives cross tabs of Var1 with respect to Var 2 of object a

library(Hmisc) loads HMisc which enables use to use describe and summarize function

  1. describe(a$var1) gives a much more elaborate and concise summary of the variable Var 1- it’s a better version of summary
  2. summarize(a$var1,a$var2,FUN)  applies a function (like sum, median, summary or mean) on Var 1 , as GROUPED by Var2
  3. cor(a) gives corelation between all numeric variables of a
  • Data Visualization
  1. plot(a$var1,a$var2) Plots Var 1 with  Var 2
  2. boxplot(a) boxplot
  3. hist(a$Var1) Histogram
  4. plot(density(a$Var1) Density Plot
  5. pie (pie chart)

Modeling

  1. a=lm(y~x) creates model
  2. vif(a) gives Variance Inflation  (library(car) may help)
  3. outlierTest(a) gives Outliers

summary(a) gives model summary including parameter estimates

  • Write Output
  1. write.csv(a) Write output as a csv file
  2. png(“graph.png”) Write plot as png file
  3. q() –Quits R Session

——————————————————————————————————————————————————-

R is a huge language with 5000 packages and hundreds of thousands of functions.

But if you memorize these functions ~50, I assure you will make a much more positive impression in your business analytics interview !

Bonus-

Sys.time() and Sys.Date() gives current   time and date (note the change in case)
while system.time(experession gives time taken to evaluate an expression)

Citation- http://www.ats.ucla.edu/stat/r/faq/timing_code.htm

Also

yo=function(a,b){ a*b*12} creates a custom function called yo which can be then invoked as yo(2,3)

Coming up-

the apply family, the Hadley collective  and GUIs  for the second round of interview ;)

The Amazing R-Fiddle truly brings #rstats to the browser

Datamind.com whom I interact with on and off, and also the masterminds behind http://www.rdocumentation.org/

have finally created their platform for interactive and gamified R learning on the web. Take a look- it does like slightly better than Codeacademy’s interface doesnt it. The platform is called http://www.r-fiddle.org/#/

More power to R for Cloud Computing!

Screenshot from 2013-11-21 21:37:25

Now if they could only collobrate with other players like Quandl, BigML and even StatAce for a even cooler suggestion. Even Revolution Analytics and RStudio who have very expensive training modules should be able to use this for self paced online learning courses!

 

Quote- A software of beauty is a joy forever – Keats

RapidMiner 6.0 launched!

What’s new–

  • Revised visualization and display creation
  • A new “statistics” view
  • Improved results view
  • Better tours and tutorials

RapidMiner v6.0 provides four specific application wizards:

  • Churn reduction
  • Direct marketing
  • Sentiment analysis
  • Predictive maintenance

Check it out today!

http://rapidminer.com/my-account/

Screenshot from 2013-11-21 21:20:01