November 2013 – DECISION STATS

My Creativity

When not distributing my ideas for free and all kinds of people claiming me as an advisor based on a few emails-

This isn’t a new topic, one of our advisors Ajay Ohri, also the author of Springer’s book on R, wrote about this idea back in 2011 (http://readwrite.com/2011/06/01/an-app-store-for-algorithms#awesm=~ohfvTpPiq6Jmt5).

Some of you know I have been trying to write a movie

todayilearntincanada.wordpress.com

Some of you know I write poetry blog http://poemsforkush.com/ (

195,789 views) and E Books on Scribd ( http://www.scribd.com/ajay_ohri_1 19000 views) besides Decisionstats.com (+500,000 views now)

–

Coming up- a post of the different kinds of dashboards within different social media websites etc.

Many people are like garbage trucks. They run around full of garbage, full of frustration, full of anger, and full of disappointment. As their garbage piles up, they need a place to dump it and sometimes they’ll dump it on you. NEVER take it personally. Just smile, wave, wish them well, and move on with the routine life.” Don’t take their garbage and spread it to other people at work, at home or on the streets.

Hat Tip – http://www.linkedin.com/pub/badri-s-evergreen-thoughts/51/461/209

Data Science Hype Bubble

People selling business analytics software claim business analytics will solve everything for your business (including world peace and hunger if the govt chooses)
People selling business analytics training claim there is a big shortage of analysts/data scientists and getting those skills will give you job and eternal bliss ( but in obtuse language to prevent lawsuits)
People selling consulting services claim software (see 1) is incredibly difficult to customize without their help
Everyone is charging money which is expensive without any transparency on why it is priced so. What are your costs etc?
Everyone has a few shiny testimonials on their website. This is very confusing. How can everyone be equally good.
The credit rating agencies of Data science world are as corrupted and prone to influence as the credit rating agencies of the financial world (Enough said, Gideon!)
Pricing in data science solutions, products and services is like this- my website is better than that competitor website /blog so if he charges X let me charge X +dx
Even companies that began with grand visions of revolution and changing the world slowly upped their price of both software and training
White papers in data science is a declining but still robust industry. The latest thing in data science- SLICK BLOGS by smart looking people
No one bothers to explain total cost of ownership or total return on investment on data science and analytics. Very surprising, since every one is a quantitative expert and these two metrics should bother the dear beloved end customer the most
I have seen some hype bubbles ( yes I am 36 years old) Business Intelligence- Business Analytics- Data Science- Big Data… What is the next big buzzword
Everyone is selling webinars for free. There is no free lunch. Why are there free webinars.
How I can go from unpaid blogger to paid webinar guru— test my hypothesis (thinks a lot of people everyday)
Somewhere in a West Coast college dormitory or an Eastern Eurpean garage, some geeks are plotting the next data revolution. You have been warned.
How many bums must one guy kiss to get invited to conferences
In the age of skype, and video conferencing- why do you need a conference. oH right- thats another side industry too.
The more billions a software company makes in analytics, the more haters it gets!
123,000 bloggers think they can run Google better than Eric Schmidt. Includes two.

ps- Sarcasm was totally unintentional. Direct all malevolence here http://plus.google.com/+AjayOhri

50 functions to clear a basic interview for Business Analytics #rstats

Due respect to all cheat sheets and ref cards, but these are the functions that I use in a sequence to analyze a business data set.

Packages

install.packages(“Hmisc”) installs package Hmisc
library(Hmisc) –loads package Hmisc
update.packages() Updates all packages

Data Input

getwd() – Gets you the current working directory
setwd(“C:/Path”) -Sets the working directory to the new path , here C:/Path
dir() – Lists all the files present in the current working directory
a=read.csv(“1.csv”,header=T,sep=”,”,stringsAsFactors=T)

here

a= read.csv (assigns the object name a to whatever comes to the right side)

You can also explicitly assign a character name to a data object using assign)

read.csv is a type of input command derived from read.table to read in a rectangular object (rows and columns)

header specifies whether the first line has variable names or not

sep denotes seperator (generally a comma for CSV files but can be space or tab for text files)

stringsAsFactors=T reads in strings as character values and not factor levels

Object Inspection

str(a) Gives the structure of object named including class, dimensions, type of variables , names of variables and a few values of variables as well. Only drawback is can throw up a lot of information for a big data set
names(a) Gives the names of variables of the object
class(a) Gives the class of a object like data.frame, list,matrix, vector etc
dim(a) Gives the dimension of object (rows column)
nrow(a) Gives the number of rows of object a- useful when used as an input for another function
ncol(a) Gives the number of columns of object a
length(a) Gives the length of object- useful for vectors, for a data frame it is the same as ncol
a[i,j] Gives the value in ith row and jth column
a$var1 Gives the variable named var1 in object a . This can be treated as a seperate object on it’s own for inspection or analysis

Data Inspection

head(a,10) gives the first ten rows of object a
tail(a,20) gives the last twenty rows of object a
b=ajay[sample(nrow(ajay),replace=F,size=0.05*nrow(ajay)),]

Lets get a 5 % random sample of the object ajay

[] uses the subset to give value in the specified row

Sample is the command for sample

So out nrow(ajay) or total number to be sampled of ,size= Size of sample it is taking 5% numbers, and these are the row numbers that are being returned. replace =F means each number is selected only once

Math Functions

Basic-

sum -sum
sqrt -square root
sd -standard deviation
log –log
mean -mean
median– median

Additional–

cumsum – Cumulative Sum for a column
diff –Differencing
lag – Lag

Data Manipulation

paste(a$Var) converts Var from Factor/Numeric variable to Charachter Variable
as.numeric(a$Var2) Converts a character variable into a numeric variable
is.na(a) retruns TRUE wheneve it encounters a Missing Value within the object
na.omit(a) Deletes all missing values (denoted by NA within R)
na.rm=T (this option enables you to calculate values Ignoring Missing Values)
nchar(abc) gives the values of characters in a character value
substr(“ajay”,1,3) gives the sub string from starting place 1 to ending place 3. Note in R index starts from 1 for first object

Date Manipulation

library(lubridate)
> a=”20Mar1987″
> dmy(a)
[1] “1987-03-20 UTC”
> b=”20/7/89″
> dmy(b)
[1] “1989-07-20 UTC”
> c=”October 12 93″
> mdy(c)
[1] “1993-10-12 UTC”

Data Analysis

summary(a) Gives summary of object including min,max,median,mean, 1st quartile, 3rd Quartile) for numeric objects and frequency analysis of Factor variables
table(a) Gives Frequency Analysis of variable or obejct
table(a$var1,a$var2) Gives cross tabs of Var1 with respect to Var 2 of object a

library(Hmisc) loads HMisc which enables use to use describe and summarize function

describe(a$var1) gives a much more elaborate and concise summary of the variable Var 1- it’s a better version of summary
summarize(a$var1,a$var2,FUN) applies a function (like sum, median, summary or mean) on Var 1 , as GROUPED by Var2
cor(a) gives corelation between all numeric variables of a

Data Visualization

plot(a$var1,a$var2) Plots Var 1 with Var 2
boxplot(a) boxplot
hist(a$Var1) Histogram
plot(density(a$Var1) Density Plot
pie (pie chart)

Modeling

a=lm(y~x) creates model
vif(a) gives Variance Inflation (library(car) may help)
outlierTest(a) gives Outliers

summary(a) gives model summary including parameter estimates

Write Output

write.csv(a) Write output as a csv file
png(“graph.png”) Write plot as png file
q() –Quits R Session

——————————————————————————————————————————————————-

R is a huge language with 5000 packages and hundreds of thousands of functions.

But if you memorize these functions ~50, I assure you will make a much more positive impression in your business analytics interview !

Bonus-

Sys.time() and Sys.Date() gives current   time and date (note the change in case)
while system.time(experession gives time taken to evaluate an expression)

Citation- http://www.ats.ucla.edu/stat/r/faq/timing_code.htm

Also

yo=function(a,b){ a*b*12} creates a custom function called yo which can be then invoked as yo(2,3)

Coming up-

the apply family, the Hadley collective  and GUIs  for the second round of interview ;)

The Amazing R-Fiddle truly brings #rstats to the browser

Datamind.com whom I interact with on and off, and also the masterminds behind http://www.rdocumentation.org/

have finally created their platform for interactive and gamified R learning on the web. Take a look- it does like slightly better than Codeacademy’s interface doesnt it. The platform is called http://www.r-fiddle.org/#/

More power to R for Cloud Computing!

Now if they could only collobrate with other players like Quandl, BigML and even StatAce for a even cooler suggestion. Even Revolution Analytics and RStudio who have very expensive training modules should be able to use this for self paced online learning courses!

Quote- A software of beauty is a joy forever – Keats

RapidMiner 6.0 launched!

What’s new–

Revised visualization and display creation
A new “statistics” view
Improved results view
Better tours and tutorials

RapidMiner v6.0 provides four specific application wizards:

Churn reduction
Direct marketing
Sentiment analysis
Predictive maintenance

Check it out today!

http://rapidminer.com/my-account/

Interview Christian Mladenov CEO StatAce Excellent and Hot #rstats StartUp

Here is an interview with Christian Mladenov, CEO of Statace , a hot startup in cloud based data science and statistical computing.

Ajay Ohri (AO)- What is the difference between using R by StatAce and using R by RStudio on a R Studio server hosted on Amazon EC2

Christian Mladenov (CM)- There are a few ways in which I think StatAce is better:

You do not need the technical skills to set up a server. You can instead start straight away at the click of a button.
You can save the full results for later reference. With an RStudio server you need to manually save and organize the text output and the graphics.
We are aiming to develop a visual interface for all the standard stuff. Then you will not need to know R at all.
We are developing features for collaboration, so that you can access and track changes to data, scripts and results in a team. With an RStudio server, you manage commits yourself, and Git is not suitable for large data files.

AO- How do you aim to differentiate yourself from other providers of R based software including Revolution, RStudio, Rapporter and even Oracle R Enterprise

CM- We aim to build a scalable, collaborative and easy to use environment. Pretty much everything else in the R ecosystem is lacking one, if not two of these. Most of the GUIs lack a visual way of doing the standard analyses. The ones that have it (e.g. Deducer) have a rather poor usability. Collaboration tools are hardly built in. RStudio has Git integration, but you need to set it up yourself, and you cannot really track large source data in Git.

Revolution Analytics have great technology, but you need to know R and you need to know how to maintain servers for large scale work. It is not very collaborative and can become quite expensive.

Rapporter is great for generating reports, but it is not very interactive – editing templates is a bit cumbersome if you just need to run a few commands. I think it wants to be the place to go to after you have finalized the development of the R code, so that you can share it. Right now, I also do not see the scalability.

With Oracle R Enterprise you again need to know R. It is a targeted at large enterprises and I imagine it is quite expensive, considering it only works with Oracle’s database. For that you need an IT team.

AO- How do you see the space for using R on a cloud?

CM- I think this is an area that has not received enough quality attention – there are some great efforts (e.g. ElasticR), but they are targeted at experienced R users. I see a few factors that facilitate the migration to the cloud:

Statisticians collaborate more and more, which means they need to have a place to share data, scripts and results.
The number of devices people use is increasing, and now frequently includes a tablet. Having things accessible through the web gives more freedom.
More and more data lives on servers. This is both because it is generated there (e.g. click streams) and because it is too big to fit on a user’s PC (e.g. raw DNA data). Using it where it already is prevents slow download/upload.
Centralizing data, scripts and results improves compliance (everybody knows where it is), reproducibility and reliability (it is easily backed up).

For me, having R to the cloud is a great opportunity.

AO- What are some of the key technical challenges you currently face and are seeking to solve for R based cloud solutions

CM- Our main challenge is CPU use, since cloud servers typically have multiple slow cores and R is mostly single-threaded. We have yet to fully address that and are actively following the projects that aim to improve R’s interpreter – pqR, Renjin, Riposte, etc. One option is to move to bare metal servers, but then we will lose a lot of flexibility.

Another challenge is multi-server processing. This is also an area of progress where we have do not yet have a stable solution.

AO- What are some of the advantages and disadvantages of being a Europe based tech startup vis a vis a San Fransisco based tech startup

CM-In Eastern Europe at least, you can live quite cheaply, therefore you can better focus on the product and the customers. In the US you need to spend a lot of time courting investors.

Eastern Europe also has a lot of technical talent – it is not that difficult or expensive to hire experienced engineers.

The disadvantages are many, and I think they out-weigh the advantages:

Capital is scarce, especially after the seed stage. This means startups either have to focus on profit which limits their ability to execute a grander vision or they need to move to the US which wastes a lot of time and resources.
There is limited access to customers, partners, mentors and advisors. Most of the startup innovation happens in the US and its users prefer to deal with local companies.
The environment in Europe is not as supportive in terms of events, media coverage, and even social acceptance. In many countries founders are viewed with a bit of suspicion, and failure frequently means the end to one’s credibility.

AO- What advice would you give to aspiring data scientists

CM-Use open-source. R, Julia, Octave and the others are seeing a level of innovation that the commercial solutions just cannot match. They are very flexible and versatile, and if you need something specific, you should learn some Python and do it yourself.

Keep things reproducible, or at some point you will get lost. This includes a version control system.

Be active in the community. While books are great, sharing and seeking advice will improve your skills much faster.

Focus more on “why” you do something and “what” you want to achieve. Only then get technical about “how” you want to do it. Use a good IDE that facilitates your work and allows you to do the simple things fast. You know, like StatAce 🙂

AO- Describe your career journey from Student to CEO

CM-During my bachelor studies I worked as a software developer and customer intelligence analyst. This gave me a lot of perspective on software and data.

After graduating I got a job where I coordinated processes and led projects. This is where I discovered the importance of listening to customers, planning well in advance, and having good data to base decisions on.

In my master studies, it was my statistics-heavy thesis that made me think “why is there not a place where I can easily use the power of R on a machine with a lot of RAM?” This is when the idea for StatAce was born.

About StatAce-

Bulgarian StatAce is the winner of betapitch | global, which was held in Berlin on 6 July (read more about it here). The team, driven by the lack of software for low, student budgets, came up with the idea of building “Google docs for professional statisticians” and eventually took home the first prize of the startup competition.

Month: November 2013

My Creativity

Garbage In Garbage Out

Data Science Hype Bubble

50 functions to clear a basic interview for Business Analytics #rstats

The Amazing R-Fiddle truly brings #rstats to the browser

RapidMiner 6.0 launched!

Interview Christian Mladenov CEO StatAce Excellent and Hot #rstats StartUp

Please share:

Please share:

Please share:

Please share:

Please share:

Please share:

Please share: