1-click Random Decision Forests

petersp's avatarThe Official Blog of BigML.com

One of the pitfalls of machine learning is that creating a single predictive model has the potential to overfit your data. That is, the performance on your training data might be very good, but the model does not generalize well to new data. Ensemble learning of decision trees, also referred to as forests or simply ensembles,  is a tried-and-true technique for reducing the error of single machine-learned models. By learning multiple models over different subsamples of your data and taking a majority vote at prediction time, the risk of overfitting a single model to all of the data is mitigated. You can read more about this in our previous post.

Early this year, we showed how BigML ensembles outperform their solo counterparts and even beat other machine learning services. However, up until now creating ensembles with BigML has only been available via our API. We are excited to announce that ensembles are now available via our…

View original post 869 more words

A SunBurst of Insight

a nice addition to Big Data Visualization- sunbursts (which I have covered in the Dat Viz chapter of my R book)

Great work by BigML.com

davidgerster's avatarThe Official Blog of BigML.com

This is a guest post by David Gerster (@gerster), a data scientist and investor in BigML.

I work at a consumer web company, and recently used BigML to understand what drives return visits to our site. I followed Standard Operating Procedure for data mining, sampling a group of users, dividing them into two classes, and creating several features that I hoped would be useful in predicting these classes. I then fed this training data to BigML, which quickly and obediently produced a decision tree:

decision_tree

Next I used BigML’s interface to examine the tree’s many subsets, shown as “nodes” in the diagram above. I moused over a node at the top of the tree and saw that it achieved high separation for a large fraction of the training set:

Shhh, I'm hunting nodes!

This one node covered 58% of the data, and separated the two classes with 73% confidence. (“Confidence” is a measure…

View original post 546 more words

Predictive Analytics World goes to Chicago

Message from our Sponsors and my favorite Analytics conference ( only if I could attend a cool analytics conference nearby in Asia (singapore/turkey?)  -sighs) Even useR wont come to Asia ever?-

This is the number 1 conference for analytics in the world and it is next month in Chicago, USA? So you think you have the best analytics software or product or service. Here is where you can find it out!

It’s time to amp-up your analytics strategy. It’s time to beef up your analytics strategy by attending Predictive Analytics World Chicago, June 10-13, 2013. With over 30 case studies from leading organizations across a spectrum of industries, this is the must-attend event for anyone serious about their analytics strategy.

Here’s what your peers had to say about their experience at PAW:

“Great speakers, interesting content, and great networking. PAW conferences are among my favorite analytic events!”
– Karl Rexer, Ph.D. Rexer Analytics“This vendor neutral conference always gives me tangible ideas I can put to work right away.”
– Greg Hayworth, Humana

“Predictive Analytics World did a great job keeping up with the trends in Predictive Modeling. There were also plenty of opportunities to learn about the most valuable resources available to data scientists.”
– Conor Sontag, Marketing Evolution

“People who are in analytics must join Predictive Analytics World and see the state of the art projects.”
– Burak Buyuktombak, Avea Telecommunication Services (Turkey)

And there is more where that came from.

Who’s attending PAW Chicago 2013?

Here are just a few of the many companies attending:

Whose attending PAW Chicago

And many more!

Registration options for all budgets.

PAW Chicago has a variety of conference pass options available to meet budgets of all sizes.

Learn more about pricing and how to register.

Register Now!

2013 Chicago Sponsors
Follow Us on Twitter Be a Fan on Facebook LinkedIn Group Live Twitter Feed

Using a Linux only package in Windows #rstats

Here is some R code for using a R package that has only a tar.gz file available (used to load R packages in Linux) and no Zip file available (used to load R packages in Windows).

Step 1- Download the tar.gz file.

Step 2 Unzip it (twice) using 7zip

Step 3 Change the path variable below to your unzipped, downloaded location for the R sub folder within the package folder .

Step 4 Copy and Paste this in R

Step 5 Start using the R package in Windows (where 75% of the money and clients and businesses still are)

Caveat Emptor- No X Dependencies (ok!)

path="C:\\Users\\KUs\\Desktop\\segue\\R"
b=dir(path)
c=length(b)
for (i in 1:c){source(gsub(" ","",paste(path,"\\",b[i])))}
ls()

 

R2D2

Adding a + to the bit.ly link you get to get analytics on your spammers

Just add a + sign to any bit.ly link and you get to see associated analytics for that link.

you can get information (traffic, referrers, locations, conversations) about any Bit.ly link simply by taking the short URL and adding a “+” at the end (minus the quotes)

Click on the image below and notice the + sign in the URL.

Read more here this can be useful than just fun-

Using Bit.ly for Spying, Link Building and Happiness

Unrelated- I interview Hilary Mason, Analytics legend and Bit.ly Chief Scientist here –

Interview Hilary Mason Chief Scientist bitly

nah

Using R for Cricket Analysis #rstats #IPL

#Downloading the Data for batting across all formats of cricket
library(XML)
url="http://stats.espncricinfo.com/ci/engine/stats/index.html?class=11;template=results;type=batting"
tables=readHTMLTable(url,stringsAsFactors = F)
#Note we wrote stringsAsFactors=F in this to avoid getting factor variables, 
#since we will need to convert these variables to numeric variables
table2=tables$"Overall figures"
rm(tables)
#Creating new variables from Span
table2$Debut=as.numeric(substr(table2$Span,1,4))
table2$LastYr=as.numeric(substr(table2$Span,6,10))
table2$YrsPlayed=table2$LastYr-table2$Debut
#Creating New Variables. In cricket a not out score is denoted by * which can cause data quality error. 
#This is treated by grepl for finding and gsub for removing the *. 
#Note the double \ to escape regex charachter
table2$HSNotOut=grepl("\\*",table2$HS)
table2$HS2=gsub("\\*","",table2$HS)
#Creating a FOR Loop (!) to convert variables to numeric variables
for (i in 3:17) {
+     table2[, i] <- as.numeric(table2[, i])
+ }

and we see why Sachin Tendulkar is the best (by using ggplot via Deducer)

dmancasestudy5

Also see 

  • Freaknomics Challenge-
    1. Prove match fixing does not and cannot exist in IPL
    2. Create an ideal fantasy team
    
    

 

Using R for Cricket Analysis #rstats

ESPN Crincinfo is the best site for cricket data (you can see an earlier detailed post on the database  here https://decisionstats.com/2012/04/07/cricinfo-statsguru-database-for-statistical-and-graphical-analysis/  ), and using the XML package in R we can easily scrape and manipulate data

Here is the code.

library(XML)
url="http://stats.espncricinfo.com/ci/engine/stats/index.html?class=1;team=6;template=results;type=batting"
#Note I can also break the url string and use paste command to modify this url with parameters
tables=readHTMLTable(url)
tables$"Overall figures"

#Now see this- since I only got 50 results in each page, I look at the url of next page

table1=tables$"Overall figures"
url="http://stats.espncricinfo.com/ci/engine/stats/index.html?class=1;page=2;team=6;template=results;type=batting"
tables=readHTMLTable(url)
table2=tables$"Overall figures"

#Now I need to join these two tables vertically

table3=rbind(table1,table2)

Note-I can also automate the web scraping .
Now the data is within R, we can use something like Deducer to visualize.
Created by Pretty R at inside-R.org