Random Sampling a Dataset in R

A common example in business  analytics data is to take a random sample of a very large dataset, to test your analytics code. Note most business analytics datasets are data.frame ( records as rows and variables as columns)  in structure or database bound.This is partly due to a legacy of traditional analytics software.

Here is how we do it in R-

• Refering to parts of data.frame rather than whole dataset.

Using square brackets to reference variable columns and rows

The notation dataset[i,k] refers to element in the ith row and jth column.

The notation dataset[i,] refers to all elements in the ith row .or a record for a data.frame

The notation dataset[,j] refers to all elements in the jth column- or a variable for a data.frame.

For a data.frame dataset

> nrow(dataset) #This gives number of rows

> ncol(dataset) #This gives number of columns

An example for corelation between only a few variables in a data.frame.

> cor(dataset1[,4:6])

Splitting a dataset into test and control.

ts.test=dataset2[1:200] #First 200 rows

ts.control=dataset2[201:275] #Next 75 rows

• Sampling

Random sampling enables us to work on a smaller size of the whole dataset.

use sample to create a random permutation of the vector x.

Suppose we want to take a 5% sample of a data frame with no replacement.

Let us create a dataset ajay of random numbers

ajay=matrix( round(rnorm(200, 5,15)), ncol=10)

#This is the kind of code line that frightens most MBAs!!

Note we use the round function to round off values.

ajay=as.data.frame(ajay)

 nrow(ajay)

[1] 20

> ncol(ajay)

[1] 10

This is a typical business data scenario when we want to select only a few records to do our analysis (or test our code), but have all the columns for those records. Let  us assume we want to sample only 5% of the whole data so we can run our code on it

Then the number of rows in the new object will be 0.05*nrow(ajay).That will be the size of the sample.

The new object can be referenced to choose only a sample of all rows in original object using the size parameter.

We also use the replace=FALSE or F , to not the same row again and again. The new_rows is thus a 5% sample of the existing rows.

Then using the square backets and ajay[new_rows,] to get-

b=ajay[sample(nrow(ajay),replace=F,size=0.05*nrow(ajay)),]

 

You can change the percentage from 5 % to whatever you want accordingly.

JMP 10 released

JMP , the visual data exploration, statistical quality control software from SAS Institute launched version 10 of its software today.

Source-http://jmp.com/about/events/webcasts/jmp_webcast.shtml?name=jmp10

JMP 10 includes:

Numerous enhancements to the drag-and-drop Graph Builder, including a new iPad application.

A cutting-edge Control Chart Builder to create process control charts with drag-and-drop ease.

New reliability capabilities, including growth and forecast models.

Additions and improvements for sorting and filtering data, design of experiments, statistical modeling, scripting, add-in and application development, script debugging and more.

From JohnSall’s blog post at http://blogs.sas.com/content/jmp/2012/03/20/discover-more-with-jmp-10/

Much of the development centered on four focus areas:

1. Graph Builder everywhere. The Graph Builder platform itself has new features like Heatmap and Treemap, an elements palette and properties panel, making the choices more visible. But Graph Builder also has some descendents now, including the new Control Chart Builder, which makes creating control charts an interactive process. In addition, some of the drag-and-drop features that are used to change columns in Graph Builder are also available in Distribution, Fit Y by X, and a few other places. Finally, Graph Builder has been ported to the iPad. For the first time, you can use JMP for exploration and presentation on a mobile device for free. So just think of Graph Builder as gradually taking over in lots of places.

2. Expert-driven design.reliability, measurement systems, and partial least squares analyses.

3. Performance.  this release has the most new multithreading so far

4. Application development

You can read more here –http://jmp.com/about/events/webcasts/jmpwebcast_detail.shtml?reglink=70130000001r9IP

Facebook and R

Part 1 How do people at Facebook use R?

tamar Rosenn, Facebook

Itamar conveyed how Facebook’s Data Team used R in 2007 to answer two questions about new users: (i) which data points predict whether a user will stay? and (ii) if they stay, which data points predict how active they’ll be after three months?

For the first question, Itamar’s team used recursive partitioning (via the rpartpackage) to infer that just two data points are significantly predictive of whether a user remains on Facebook: (i) having more than one session as a new user, and (ii) entering basic profile information.

For the second question, they fit the data to a logistic model using a least angle regression approach (via the lars package), and found that activity at three months was predicted by variables related to three classes of behavior: (i) how often a user was reached out to by others, (ii) frequency of third party application use, and (iii) what Itamar termed “receptiveness” — related to how forthcoming a user was on the site.

source-http://www.dataspora.com/2009/02/predictive-analytics-using-r/

and cute graphs like the famous

https://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919

 

and

studying baseball on facebook

https://www.facebook.com/notes/facebook-data-team/baseball-on-facebook/10150142265858859

by counting the number of posts that occurred the day after a team lost divided by the total number of wins, since losses for great teams are remarkable and since winning teams’ fans just post more.

 

But mostly at

https://www.facebook.com/data?sk=notes and https://www.facebook.com/data?v=app_4949752878

 

and creating new packages

1. jjplot (not much action here!)

https://r-forge.r-project.org/scm/viewvc.php/?root=jjplot

though

I liked the promise of JJplot at

http://pleasescoopme.com/2010/03/31/using-jjplot-to-explore-tipping-behavior/

2. ising models

https://github.com/slycoder/Rflim

https://www.facebook.com/note.php?note_id=10150359708746212

3. R pipe

https://github.com/slycoder/Rpipe

 

even the FB interns are cool

http://brenocon.com/blog/2009/02/comparison-of-data-analysis-packages-r-matlab-scipy-excel-sas-spss-stata/

 

Part 2 How do people with R use Facebook?

Using the API at https://developers.facebook.com/tools/explorer

and code mashes from

 

http://romainfrancois.blog.free.fr/index.php?post/2012/01/15/Crawling-facebook-with-R

http://applyr.blogspot.in/2012/01/mining-facebook-data-most-liked-status.html

but the wonderful troubleshooting code from http://www.brocktibert.com/blog/2012/01/19/358/

which needs to be added to the code first

 

and using network package

>access_token=”XXXXXXXXXXXX”

Annoyingly the Facebook token can expire after some time, this can lead to huge wait and NULL results with Oauth errors

If that happens you need to regenerate the token

What we need
> require(RCurl)
> require(rjson)
> download.file(url=”http://curl.haxx.se/ca/cacert.pem”, destfile=”cacert.pem”)

Roman’s Famous Facebook Function (altered)

> facebook <- function( path = “me”, access_token , options){
+ if( !missing(options) ){
+ options <- sprintf( “?%s”, paste( names(options), “=”, unlist(options), collapse = “&”, sep = “” ) )
+ } else {
+ options <- “”
+ }
+ data <- getURL( sprintf( “https://graph.facebook.com/%s%s&access_token=%s&#8221;, path, options, access_token ), cainfo=”cacert.pem” )
+ fromJSON( data )
+ }

 

Now getting the friends list
> friends <- facebook( path=”me/friends” , access_token=access_token)
> # extract Facebook IDs
> friends.id <- sapply(friends$data, function(x) x$id)
> # extract names
> friends.name <- sapply(friends$data, function(x) iconv(x$name,”UTF-8″,”ASCII//TRANSLIT”))
> # short names to initials
> initials <- function(x) paste(substr(x,1,1), collapse=””)
> friends.initial <- sapply(strsplit(friends.name,” “), initials)

This matrix can take a long time to build, so you can change the value of N to say 40 to test your network. I needed to press the escape button to cut short the plotting of all 400 friends of mine.
> # friendship relation matrix
> N <- length(friends.id)
> friendship.matrix <- matrix(0,N,N)
> for (i in 1:N) {
+ tmp <- facebook( path=paste(“me/mutualfriends”, friends.id[i], sep=”/”) , access_token=access_token)
+ mutualfriends <- sapply(tmp$data, function(x) x$id)
+ friendship.matrix[i,friends.id %in% mutualfriends] <- 1
+ }

 

Plotting using Network package in R (with help from the  comments at http://applyr.blogspot.in/2012/01/mining-facebook-data-most-liked-status.html)

> require(network)

>net1<- as.network(friendship.matrix)

> plot(net1, label=friends.initial, arrowhead.cex=0)

(Rgraphviz is tough if you are on Windows 7 like me)

but there is an alternative igraph solution at https://github.com/sciruela/facebookFriends/blob/master/facebook.r

 

After all that-..talk.. a graph..of my Facebook Network with friends initials as labels..

 

Opinion piece-

I hope plans to make the Facebook R package get fulfilled (just as the twitteR  package led to many interesting analysis)

and also Linkedin has an API at http://developer.linkedin.com/apis

I think it would be interesting to plot professional relationships across social networks as well. But I hope to see a LinkedIn package (or blog code) soon.

As for jjplot, I had hoped ggplot and jjplot merged or atleast had some kind of inclusion in the Deducer GUI. Maybe a Google Summer of Code project if people are busy!!

Also the geeks at Facebook.com can think of giving something back to the R community, as Google generously does with funding packages like RUnit, Deducer and Summer of Code, besides sponsoring meet ups etc.

 

(note – this is part of the research for the upcoming book ” R for Business Analytics”)

 

ps-

but didnt get time to download all my posts using R code at

https://gist.github.com/1634662#

or do specific Facebook Page analysis using R at

http://tonybreyal.wordpress.com/2012/01/06/r-web-scraping-r-bloggers-facebook-page-to-gain-further-information-about-an-authors-r-blog-posts-e-g-number-of-likes-comments-shares-etc/

Updated-

 #access token from https://developers.facebook.com/tools/explorer
access_token="AAuFgaOcVaUZAssCvL9dPbZCjghTEwwhNxZAwpLdZCbw6xw7gARYoWnPHxihO1DcJgSSahd67LgZDZD"
require(RCurl)
 require(rjson)
# download the file needed for authentication http://www.brocktibert.com/blog/2012/01/19/358/
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")
# http://romainfrancois.blog.free.fr/index.php?post/2012/01/15/Crawling-facebook-with-R
facebook <- function( path = "me", access_token = token, options){
if( !missing(options) ){
options <- sprintf( "?%s", paste( names(options), "=", unlist(options), collapse = "&", sep = "" ) )
} else {
options <- ""
}
data <- getURL( sprintf( "https://graph.facebook.com/%s%s&access_token=%s", path, options, access_token ), cainfo="cacert.pem" )
fromJSON( data )
}

 # see http://applyr.blogspot.in/2012/01/mining-facebook-data-most-liked-status.html

# scrape the list of friends
friends <- facebook( path="me/friends" , access_token=access_token)
# extract Facebook IDs
friends.id <- sapply(friends$data, function(x) x$id)
# extract names 
friends.name <- sapply(friends$data, function(x)  iconv(x$name,"UTF-8","ASCII//TRANSLIT"))
# short names to initials 
initials <- function(x) paste(substr(x,1,1), collapse="")
friends.initial <- sapply(strsplit(friends.name," "), initials)

# friendship relation matrix
#N <- length(friends.id)
N <- 200
friendship.matrix <- matrix(0,N,N)
for (i in 1:N) {
  tmp <- facebook( path=paste("me/mutualfriends", friends.id[i], sep="/") , access_token=access_token)
  mutualfriends <- sapply(tmp$data, function(x) x$id)
  friendship.matrix[i,friends.id %in% mutualfriends] <- 1
}
require(network)
net1<- as.network(friendship.matrix)
plot(net1, label=friends.initial, arrowhead.cex=0)

Created by Pretty R at inside-R.org

Interview Michal Kosinski , Concerto Web Based App using #Rstats

Here is an interview with Michal Kosinski , leader of the team that has created Concerto – a web based application using R. What is Concerto? As per http://www.psychometrics.cam.ac.uk/page/300/concerto-testing-platform.htm

Concerto is a web based, adaptive testing platform for creating and running rich, dynamic tests. It combines the flexibility of HTML presentation with the computing power of the R language, and the safety and performance of the MySQL database. It’s totally free for commercial and academic use, and it’s open source

Ajay-  Describe your career in science from high school to this point. What are the various stats platforms you have trained on- and what do you think about their comparative advantages and disadvantages?  

Michal- I started with maths, but quickly realized that I prefer social sciences – thus after one year, I switched to a psychology major and obtained my MSc in Social Psychology with a specialization in Consumer Behaviour. At that time I was mostly using SPSS – as it was the only statistical package that was taught to students in my department. Also, it was not too bad for small samples and the rather basic analyses I was performing at that time.

 

My more recent research performed during my Mphil course in Psychometrics at Cambridge University followed by my current PhD project in social networks and research work at Microsoft Research, requires significantly more powerful tools. Initially, I tried to squeeze as much as possible from SPSS/PASW by mastering the syntax language. SPSS was all I knew, though I reached its limits pretty quickly and was forced to switch to R. It was a pretty dreary experience at the start, switching from an unwieldy but familiar environment into an unwelcoming command line interface, but I’ve quickly realized how empowering and convenient this tool was.

 

I believe that a course in R should be obligatory for all students that are likely to come close to any data analysis in their careers. It is really empowering – once you got the basics you have the potential to use virtually any method there is, and automate most tasks related to analysing and processing data. It is also free and open-source – so you can use it wherever you work. Finally, it enables you to quickly and seamlessly migrate to other powerful environments such as Matlab, C, or Python.

Ajay- What was the motivation behind building Concerto?

Michal- We deal with a lot of online projects at the Psychometrics Centre – one of them attracted more than 7 million unique participants. We needed a powerful tool that would allow researchers and practitioners to conveniently build and deliver online tests.

Also, our relationships with the website designers and software engineers that worked on developing our tests were rather difficult. We had trouble successfully explaining our needs, each little change was implemented with a delay and at significant cost. Not to mention the difficulties with embedding some more advanced methods (such as adaptive testing) in our tests.

So we created a tool allowing us, psychometricians, to easily develop psychometric tests from scratch an publish them online. And all this without having to hire software developers.

Ajay -Why did you choose R as the background for Concerto? What other languages and platforms did you consider. Apart from Concerto, how else do you utilize R in your center, department and University?

Michal- R was a natural choice as it is open-source, free, and nicely integrates with a server environment. Also, we believe that it is becoming a universal statistical and data processing language in science. We put increasing emphasis on teaching R to our students and we hope that it will replace SPSS/PASW as a default statistical tool for social scientists.

Ajay -What all can Concerto do besides a computer adaptive test?

Michal- We did not plan it initially, but Concerto turned out to be extremely flexible. In a nutshell, it is a web interface to R engine with a built-in MySQL database and easy-to-use developer panel. It can be installed on both Windows and Unix systems and used over the network or locally.

Effectively, it can be used to build any kind of web application that requires a powerful and quickly deployable statistical engine. For instance, I envision an easy to use website (that could look a bit like SPSS) allowing students to analyse their data using a web browser alone (learning the underlying R code simultaneously). Also, the authors of R libraries (or anyone else) could use Concerto to build user-friendly web interfaces to their methods.

Finally, Concerto can be conveniently used to build simple non-adaptive tests and questionnaires. It might seem to be slightly less intuitive at first than popular questionnaire services (such us my favourite Survey Monkey), but has virtually unlimited flexibility when it comes to item format, test flow, feedback options, etc. Also, it’s free.

Ajay- How do you see the cloud computing paradigm growing? Do you think browser based computation is here to stay?

Michal – I believe that cloud infrastructure is the future. Dynamically sharing computational and network resources between online service providers has a great competitive advantage over traditional strategies to deal with network infrastructure. I am sure the security concerns will be resolved soon, finishing the transformation of the network infrastructure as we know it. On the other hand, however, I do not see a reason why client-side (or browser) processing of the information should cease to exist – I rather think that the border between the cloud and personal or local computer will continually dissolve.

About

Michal Kosinski is Director of Operations for The Psychometrics Centre and Leader of the e-Psychometrics Unit. He is also a research advisor to the Online Services and Advertising group at the Microsoft Research Cambridge, and a visiting lecturer at the Department of Mathematics in the University of Namur, Belgium. You can read more about him at http://www.michalkosinski.com/

You can read more about Concerto at http://code.google.com/p/concerto-platform/ and http://www.psychometrics.cam.ac.uk/page/300/concerto-testing-platform.htm

Predictive Models Ain’t Easy to Deploy

 

This is a guest blog post by Carole Ann Matignon of Sparkling Logic. You can see more on Sparkling Logic at http://my.sparklinglogic.com/

Decision Management is about combining predictive models and business rules to automate decisions for your business. Insurance underwriting, loan origination or workout, claims processing are all very good use cases for that discipline… But there is a hiccup… It ain’t as easy you would expect…

What’s easy?

If you have a neat model, then most tools would allow you to export it as a PMML model – PMML stands for Predictive Model Markup Language and is a standard XML representation for predictive model formulas. Many model development tools let you export it without much effort. Many BRMS – Business rules Management Systems – let you import it. Tada… The model is ready for deployment.

What’s hard?

The problem that we keep seeing over and over in the industry is the issue around variables.

Those neat predictive models are formulas based on variables that may or may not exist as is in your object model. When the variable is itself a formula based on the object model, like the min, max or sum of Dollar amount spent in Groceries in the past 3 months, and the object model comes with transaction details, such that you can compute it by iterating through those transactions, then the problem is not “that” big. PMML 4 introduced some support for those variables.

The issue that is not easy to fix, and yet quite frequent, is when the model development data model does not resemble the operational one. Your Data Warehouse very likely flattened the object model, and pre-computed some aggregations that make the mapping very hard to restore.

It is clearly not an impossible project as many organizations do that today. It comes with a significant overhead though that forces modelers to involve IT resources to extract the right data for the model to be operationalized. It is a heavy process that is well justified for heavy-duty models that were developed over a period of time, with a significant ROI.

This is a show-stopper though for other initiatives which do not have the same ROI, or would require too frequent model refresh to be viable. Here, I refer to “real” model refresh that involves a model reengineering, not just a re-weighting of the same variables.

For those initiatives where time is of the essence, the challenge will be to bring closer those two worlds, the modelers and the business rules experts, in order to streamline the development AND deployment of analytics beyond the model formula. The great opportunity I see is the potential for a better and coordinated tuning of the cut-off rules in the context of the model refinement. In other words: the opportunity to refine the strategy as a whole. Very ambitious? I don’t think so.

About Carole Ann Matignon

http://my.sparklinglogic.com/index.php/company/management-team

Carole-Ann Matignon Print E-mail

Carole-Ann MatignonCarole-Ann Matignon – Co-Founder, President & Chief Executive Officer

She is a renowned guru in the Decision Management space. She created the vision for Decision Management that is widely adopted now in the industry.  Her claim to fame is managing the strategy and direction of Blaze Advisor, the leading BRMS product, while she also managed all the Decision Management tools at FICO (business rules, predictive analytics and optimization). She has a vision for Decision Management both as a technology and a discipline that can revolutionize the way corporations do business, and will never get tired of painting that vision for her audience.  She speaks often at Industry conferences and has conducted university classes in France and Washington DC.

She started her career building advanced systems using all kinds of technologies — expert systems, rules, optimization, dashboarding and cubes, web search, and beta version of database replication. At Cleversys (acquired by Kurt Salmon & Associates), she also conducted strategic consulting gigs around change management.

While playing with advanced software components, she found a passion for technology and joined ILOG (acquired by IBM). She developed a growing interest in Optimization as well as Business Rules. At ILOG, she coined the term BRMS while brainstorming with her Sales counterpart. She led the Presales organization for Telecom in the Americas up until 2000 when she joined Blaze Software (acquired by Brokat Technologies, HNC Software and finally FICO).

Her 360-degree experience allowed her to gain appreciation for all aspects of a software company, giving her a unique perspective on the business. Her technical background kept her very much in touch with technology as she advanced.

Predictive analytics in the cloud : Angoss

I interviewed Angoss in depth here at http://www.decisionstats.com/interview-eberhard-miethke-and-dr-mamdouh-refaat-angoss-software/

Well they just announced a predictive analytics in the cloud.

 

http://www.angoss.com/predictive-analytics-solutions/cloud-solutions/

Solutions

Overview

KnowledgeCLOUD™ solutions deliver predictive analytics in the Cloud to help businesses gain competitive advantage in the areas of sales, marketing and risk management by unlocking the predictive power of their customer data.

KnowledgeCLOUD clients experience rapid time to value and reduced IT investment, and enjoy the benefits of Angoss’ industry leading predictive analytics – without the need for highly specialized human capital and technology.

KnowledgeCLOUD solutions serve clients in the asset management, insurance, banking, high tech, healthcare and retail industries. Industry solutions consist of a choice of analytical modules:

KnowledgeCLOUD for Sales/Marketing

KnowledgeCLOUD solutions are delivered via KnowledgeHUB™, a secure, scalable cloud-based analytical platform together with supporting deployment processes and professional services that deliver predictive analytics to clients in a hosted environment. Angoss industry leading predictive analytics technology is employed for the development of models and deployment of solutions.

Angoss’ deep analytics and domain expertise guarantees effectiveness – all solutions are back-tested for accuracy against historical data prior to deployment. Best practices are shared throughout the service to optimize your processes and success. Finely tuned client engagement and professional services ensure effective change management and program adoption throughout your organization.

For businesses looking to gain a competitive edge and put their data to work, Angoss is the ideal partner.

—-

Hmm. Analytics in the cloud . Reduce hardware costs. Reduce software costs . Increase profitability margins.

Hmmmmm

My favorite professor in North Carolina who calls cloud as a time sharing, are you listening Professor?