JMP and R – #rstats

An amazing example of R being used sucessfully in combination (and not is isolation) with other enterprise software is the add-ins functionality of JMP and it’s R integration.

See the following JMP add-ins which use R

http://support.sas.com/demosdownloads/downarea_t4.jsp?productID=110454&jmpflag=Y

JMP Add-in: Multidimensional Scaling using R

This add-in creates a new menu command under the Add-Ins Menu in the submenu R Add-ins. The script will launch a custom dialog (or prompt for a JMP data table is one is not already open) where you can cast columns into roles for performing MDS on the data table. The analysis results in a data table of MDS dimensions and associated output graphics. MDS is a dimension reduction method that produces coordinates in Euclidean space (usually 2D, 3D) that best represent the structure of a full distance/dissimilarity matrix. MDS requires that input be a symmetric dissimilarity matrix. Input to this application can be data that is already in the form of a symmetric dissimilarity matrix or the dissimilarity matrix can be computed based on the input data (where dissimilarity measures are calculated between rows of the input data table in R).

Submitted by: Kelci Miclaus SAS employee Initiative: All
Application: Add-Ins Analysis: Exploratory Data Analysis

Chernoff Faces Add-in

One way to plot multivariate data is to use Chernoff faces. For each observation in your data table, a face is drawn such that each variable in your data set is represented by a feature in the face. This add-in uses JMP’s R integration functionality to create Chernoff faces. An R install and the TeachingDemos R package are required to use this add-in.

Submitted by: Clay Barker SAS employee Initiative: All
Application: Add-Ins Analysis: Data Visualization

Support Vector Machine for Classification

By simply opening a data table, specifying X, Y variables, selecting a kernel function, and specifying its parameters on the user-friendly dialog, you can build a classification model using Support Vector Machine. Please note that R package ‘e1071’ should be installed before running this dialog. The package can be found from http://cran.r-project.org/web/packages/e1071/index.html.

Submitted by: Jong-Seok Lee SAS employee Initiative: All
Application: Add-Ins Analysis: Exploratory Data Analysis/Mining

Penalized Regression Add-in

This add-in uses JMP’s R integration functionality to provide access to several penalized regression methods. Methods included are the LASSO (least absolutee shrinkage and selection operator, LARS (least angle regression), Forward Stagewise, and the Elastic Net. An R install and the “lars” and “elasticnet” R packages are required to use this add-in.

Submitted by: Clay Barker SAS employee Initiative: All
Application: Add-Ins Analysis: Regression

MP Addin: Univariate Nonparametric Bootstrapping

This script performs simple univariate, nonparametric bootstrap sampling by using the JMP to R Project integration. A JMP Dialog is built by the script where the variable you wish to perform bootstrapping over can be specified. A statistic to compute for each bootstrap sample is chosen and the data are sent to R using new JSL functionality available in JMP 9. The boot package in R is used to call the boot() function and the boot.ci() function to calculate the sample statistic for each bootstrap sample and the basic bootstrap confidence interval. The results are brought back to JMP and displayed using the JMP Distribution platform.

Submitted by: Kelci Miclaus SAS employee Initiative: All
Application: Add-Ins Analysis: Basic Statistics

Interview Zach Goldberg, Google Prediction API

Here is an interview with Zach Goldberg, who is the product manager of Google Prediction API, the next generation machine learning analytics-as-an-api service state of the art cloud computing model building browser app.
Ajay- Describe your journey in science and technology from high school to your current job at Google.

Zach- First, thanks so much for the opportunity to do this interview Ajay!  My personal journey started in college where I worked at a startup named Invite Media.   From there I transferred to the Associate Product Manager (APM) program at Google.  The APM program is a two year rotational program.  I did my first year working in display advertising.  After that I rotated to work on the Prediction API.

Ajay- How does the Google Prediction API help an average business analytics customer who is already using enterprise software , servers to generate his business forecasts. How does Google Prediction API fit in or complement other APIs in the Google API suite.

Zach- The Google Prediction API is a cloud based machine learning API.  We offer the ability for anybody to sign up and within a few minutes have their data uploaded to the cloud, a model built and an API to make predictions from anywhere. Traditionally the task of implementing predictive analytics inside an application required a fair amount of domain knowledge; you had to know a fair bit about machine learning to make it work.  With the Google Prediction API you only need to know how to use an online REST API to get started.

You can learn more about how we help businesses by watching our video and going to our project website.

Ajay-  What are the additional use cases of Google Prediction API that you think traditional enterprise software in business analytics ignore, or are not so strong on.  What use cases would you suggest NOT using Google Prediction API for an enterprise.

Zach- We are living in a world that is changing rapidly thanks to technology.  Storing, accessing, and managing information is much easier and more affordable than it was even a few years ago.  That creates exciting opportunities for companies, and we hope the Prediction API will help them derive value from their data.

The Prediction API focuses on providing predictive solutions to two types of problems: regression and classification. Businesses facing problems where there is sufficient data to describe an underlying pattern in either of these two areas can expect to derive value from using the Prediction API.

Ajay- What are your separate incentives to teach about Google APIs  to academic or researchers in universities globally.

Zach- I’d refer you to our university relations page

Google thrives on academic curiosity. While we do significant in-house research and engineering, we also maintain strong relations with leading academic institutions world-wide pursuing research in areas of common interest. As part of our mission to build the most advanced and usable methods for information access, we support university research, technological innovation and the teaching and learning experience through a variety of programs.

Ajay- What is the biggest challenge you face while communicating about Google Prediction API to traditional users of enterprise software.

Zach- Businesses often expect that implementing predictive analytics is going to be very expensive and require a lot of resources.  Many have already begun investing heavily in this area.  Quite often we’re faced with surprise, and even skepticism, when they see the simplicity of the Google Prediction API.  We work really hard to provide a very powerful solution and take care of the complexity of building high quality models behind the scenes so businesses can focus more on building their business and less on machine learning.

 

 

Moving data between Windows and Ubuntu VMWare partition

I use Windows 7 on my laptop (it came pre-installed) and Ubuntu using the VMWare Player. What are the advantages of using VM Player instead of creating a dual-boot system? Well I can quickly shift from Ubuntu to Windows and bakc again without restarting my computer everytime. Using this approach allows me to utilize software that run only on Windows and run software like Rattle, the R data mining GUI, that are much easier installed on Linux.

However if your statistical software is on your Virtual Disk , and your data is on your Windows disk, you need a way to move data from Windows to Ubuntu.

The solution to this as per Ubuntu forums is –http://communities.vmware.com/thread/55242

Open My Computer, browse to the folder you want to share.  Right-click on the folder, select Properties.  Sharing tab.  Select the radio button to “Share this Folder”.  Change the default generated name if you wish; add a description if you wish.  Click the Permissions button to modify the security settings of what users can read/write to the share.

On the Linux side, it depends on the distro, the shell, and the window manager.

Well Ubuntu makes it really easy to configure the Linux steps to move data within Windows and Linux partitions.

 

NEW UPDATE-

VMmare makes it easy to share between your Windows (host) and Linux (guest) OS

 

Step 1

and step 2

Do this

 

and

Start the Wizard

when you finish the wizard and share a drive or folder- hey where do I see my shared ones-

 

see this folder in Linux- /mnt/hgfs (bingo!)

Hacker HW – Make this folder //mnt/hgfs a shortcut in Places your Ubuntu startup

Hacker Hw 2-

Upload using an anon email your VM dark data to Ubuntu one

Delete VM

Purge using software XX

Reinstall VM and bring back backup

 

Note time to do this

 

 

 

-General Sharing in Windows

 

 

Just open the Network tab in Ubuntu- see screenshots below-

Windows will now ask your Ubuntu user for login-

Once Logged in Windows from within Ubuntu Vmware, this is what happens

You see a tab called “users on “windows username”- pc appear on your Ubuntu Desktop  (see top right of the screenshot)

If you double click it- you see your windows path

You can now just click and drag data between your windows and linux partitions , just the way you do it in Windows .

So based on this- if you want to build  decision trees, artifical neural networks, regression models, and even time series models for zero capital expenditure- you can use both Ubuntu/R without compromising on your IT policy of Windows only in your organization (there is a shortage of Ubuntu trained IT administrators in the enterprise world)

Revised Installation Procedure for utilizing both Ubuntu /R/Rattle data mining on your Windows PC.

Using VMWare to build a free data mining system in R, as well as isolate your analytics system (thus using both Linux and Windows without overburdening your machine)

First Time

  1. http://downloads.vmware.com/d/info/desktop_end_user_computing/vmware_player/4_0Download and Install
  2. http://www.ubuntu.com/download/ubuntu/downloadDownload Only
  3. Create New Virtual Image in VM Ware Player
  4. Applications—–Terminal——sudo apt get-install R (to download and install)
  5.                                          sudo R (to open R)
  6. Once R is opened type this  —-install.packages(rattle)—– This will install rattle
  7. library(rattle) will load Rattle—–
  8. rattle() will open the GUI—-
Getting Data from Host to Guest VM
Next Time
  1. Go to VM Player
  2. Open the VM
  3. sudo R in terminal to bring up R
  4. library(rattle) within R
  5. rattle()
At this point even if you dont know any Linux and dont know any R, you can create data mining models using the Rattle GUI (and time series model using E pack in the R Commander GUI) – What can Rattle do in data mining? See this slideshow-http://www.decisionstats.com/data-mining-with-r-gui-rattle-rstats/
If Google Docs is banned as per your enterprise organizational IT policy of having Windows Explorer only- well you can see these screenshots http://rattle.togaware.com/rattle-screenshots.html

Building a Regression Model in R – Use #Rstats

One of the most commonly used uses of Statistical Software is building models, and that too logistic regression models for propensity in marketing of goods and services.

 

If building a model is what you do-here is a brief easy essay on  how to build a model in R.

1) Packages to be used-

For smaller datasets

use these

  1. CAR Package http://cran.r-project.org/web/packages/car/index.html
  2. GVLMA Package http://cran.r-project.org/web/packages/gvlma/index.html
  3. ROCR Package http://rocr.bioinf.mpi-sb.mpg.de/
  4. Relaimpo Package
  5. DAAG package
  6. MASS package
  7. Bootstrap package
  8. Leaps package

Also see

http://cran.r-project.org/web/packages/rms/index.html or RMS package

rms works with almost any regression model, but it was especially written to work with binary or ordinal logistic regression, Cox regression, accelerated failure time models, ordinary linear models, the Buckley-James model, generalized least squares for serially or spatially correlated observations, generalized linear models, and quantile regression.

For bigger datasets also see Biglm http://cran.r-project.org/web/packages/biglm/index.html and RevoScaleR packages.

http://www.revolutionanalytics.com/products/enterprise-big-data.php

2) Syntax

  1. outp=lm(y~x1+x2+xn,data=dataset) Model Eq
  2. summary(outp) Model Summary
  3. par(mfrow=c(2,2)) + plot(outp) Model Graphs
  4. vif(outp) MultiCollinearity
  5. gvlma(outp) Heteroscedasticity using GVLMA package
  6. outlierTest (outp) for Outliers
  7. predicted(outp) Scoring dataset with scores
  8. anova(outp)
  9. > predict(lm.result,data.frame(conc = newconc), level = 0.9, interval = “confidence”)

 

For a Reference Card -Cheat Sheet see

http://cran.r-project.org/doc/contrib/Ricci-refcard-regression.pdf

3) Also read-

http://cran.r-project.org/web/views/Econometrics.html

http://cran.r-project.org/web/views/Robust.html

 

Interview Dan Steinberg Founder Salford Systems

Here is an interview with Dan Steinberg, Founder and President of Salford Systems (http://www.salford-systems.com/ )

Ajay- Describe your journey from academia to technology entrepreneurship. What are the key milestones or turning points that you remember.

 Dan- When I was in graduate school studying econometrics at Harvard,  a number of distinguished professors at Harvard (and MIT) were actively involved in substantial real world activities.  Professors that I interacted with, or studied with, or whose software I used became involved in the creation of such companies as Sun Microsystems, Data Resources, Inc. or were heavily involved in business consulting through their own companies or other influential consultants.  Some not involved in private sector consulting took on substantial roles in government such as membership on the President’s Council of Economic Advisors. The atmosphere was one that encouraged free movement between academia and the private sector so the idea of forming a consulting and software company was quite natural and did not seem in any way inconsistent with being devoted to the advancement of science.

 Ajay- What are the latest products by Salford Systems? Any future product plans or modification to work on Big Data analytics, mobile computing and cloud computing.

 Dan- Our central set of data mining technologies are CART, MARS, TreeNet, RandomForests, and PRIM, and we have always maintained feature rich logistic regression and linear regression modules. In our latest release scheduled for January 2012 we will be including a new data mining approach to linear and logistic regression allowing for the rapid processing of massive numbers of predictors (e.g., one million columns), with powerful predictor selection and coefficient shrinkage. The new methods allow not only classic techniques such as ridge and lasso regression, but also sub-lasso model sizes. Clear tradeoff diagrams between model complexity (number of predictors) and predictive accuracy allow the modeler to select an ideal balance suitable for their requirements.

The new version of our data mining suite, Salford Predictive Modeler (SPM), also includes two important extensions to the boosted tree technology at the heart of TreeNet.  The first, Importance Sampled learning Ensembles (ISLE), is used for the compression of TreeNet tree ensembles. Starting with, say, a 1,000 tree ensemble, the ISLE compression might well reduce this down to 200 reweighted trees. Such compression will be valuable when models need to be executed in real time. The compression rate is always under the modeler’s control, meaning that if a deployed model may only contain, say, 30 trees, then the compression will deliver an optimal 30-tree weighted ensemble. Needless to say, compression of tree ensembles should be expected to be lossy and how much accuracy is lost when extreme compression is desired will vary from case to case. Prior to ISLE, practitioners have simply truncated the ensemble to the maximum allowable size.  The new methodology will substantially outperform truncation.

The second major advance is RULEFIT, a rule extraction engine that starts with a TreeNet model and decomposes it into the most interesting and predictive rules. RULEFIT is also a tree ensemble post-processor and offers the possibility of improving on the original TreeNet predictive performance. One can think of the rule extraction as an alternative way to explain and interpret an otherwise complex multi-tree model. The rules extracted are similar conceptually to the terminal nodes of a CART tree but the various rules will not refer to mutually exclusive regions of the data.

 Ajay- You have led teams that have won multiple data mining competitions. What are some of your favorite techniques or approaches to a data mining problem.

 Dan- We only enter competitions involving problems for which our technology is suitable, generally, classification and regression. In these areas, we are  partial to TreeNet because it is such a capable and robust learning machine. However, we always find great value in analyzing many aspects of a data set with CART, especially when we require a compact and easy to understand story about the data. CART is exceptionally well suited to the discovery of errors in data, often revealing errors created by the competition organizers themselves. More than once, our reports of data problems have been responsible for the competition organizer’s decision to issue a corrected version of the data and we have been the only group to discover the problem.

In general, tackling a data mining competition is no different than tackling any analytical challenge. You must start with a solid conceptual grasp of the problem and the actual objectives, and the nature and limitations of the data. Following that comes feature extraction, the selection of a modeling strategy (or strategies), and then extensive experimentation to learn what works best.

 Ajay- I know you have created your own software. But are there other software that you use or liked to use?

 Dan- For analytics we frequently test open source software to make sure that our tools will in fact deliver the superior performance we advertise. In general, if a problem clearly requires technology other than that offered by Salford, we advise clients to seek other consultants expert in that other technology.

 Ajay- Your software is installed at 3500 sites including 400 universities as per http://www.salford-systems.com/company/aboutus/index.html What is the key to managing and keeping so many customers happy?

 Dan- First, we have taken great pains to make our software reliable and we make every effort  to avoid problems related to bugs.  Our testing procedures are extensive and we have experts dedicated to stress-testing software . Second, our interface is designed to be natural, intuitive, and easy to use, so the challenges to the new user are minimized. Also, clear documentation, help files, and training videos round out how we allow the user to look after themselves. Should a client need to contact us we try to achieve 24-hour turn around on tech support issues and monitor all tech support activity to ensure timeliness, accuracy, and helpfulness of our responses. WebEx/GotoMeeting and other internet based contact permit real time interaction.

 Ajay- What do you do to relax and unwind?

 Dan- I am in the gym almost every day combining weight and cardio training. No matter how tired I am before the workout I always come out energized so locating a good gym during my extensive travels is a must. I am also actively learning Portuguese so I look to watch a Brazilian TV show or Portuguese dubbed movie when I have time; I almost never watch any form of video unless it is available in Portuguese.

 Biography-

http://www.salford-systems.com/blog/dan-steinberg.html

Dan Steinberg, President and Founder of Salford Systems, is a well-respected member of the statistics and econometrics communities. In 1992, he developed the first PC-based implementation of the original CART procedure, working in concert with Leo Breiman, Richard Olshen, Charles Stone and Jerome Friedman. In addition, he has provided consulting services on a number of biomedical and market research projects, which have sparked further innovations in the CART program and methodology.

Dr. Steinberg received his Ph.D. in Economics from Harvard University, and has given full day presentations on data mining for the American Marketing Association, the Direct Marketing Association and the American Statistical Association. After earning a PhD in Econometrics at Harvard Steinberg began his professional career as a Member of the Technical Staff at Bell Labs, Murray Hill, and then as Assistant Professor of Economics at the University of California, San Diego. A book he co-authored on Classification and Regression Trees was awarded the 1999 Nikkei Quality Control Literature Prize in Japan for excellence in statistical literature promoting the improvement of industrial quality control and management.

His consulting experience at Salford Systems has included complex modeling projects for major banks worldwide, including Citibank, Chase, American Express, Credit Suisse, and has included projects in Europe, Australia, New Zealand, Malaysia, Korea, Japan and Brazil. Steinberg led the teams that won first place awards in the KDDCup 2000, and the 2002 Duke/TeraData Churn modeling competition, and the teams that won awards in the PAKDD competitions of 2006 and 2007. He has published papers in economics, econometrics, computer science journals, and contributes actively to the ongoing research and development at Salford.

England rule India- again

If you type the words “business intelligence expert” in Google. you may get the top ranked result as http://goo.gl/pCqUh or Peter James Thomas, a profound name as it can be as it spans three of the most important saints in the church.

The current post for this is very non business -intelligence topic called Wager. http://peterjamesthomas.com/2011/07/20/wager/

It details how Peter, a virtual friend whom I have never met, and who looks suspiciously like Hugh Grant with the hair, and Ajay Ohri (myself) waged a wager on which cricket team would emerge victorious in the ongoing test series . It was a 4 match series, and India needed to win atleast the series or avoid losing it by a difference of 2, to retain their world cricket ranking (in Tests) as number 1.

Sadly at the end of the third test, the Indian cricket team have lost the series, the world number 1 ranking, and some serious respect by 3-0.

What is a Test Match? It is a game of cricket played over 5 days.
Why was Ajay so confident India would win. Because India won the one day world championship this April 2011. The one day series is a one day match of cricket.

There lies the problem. From an analytic point of view, I had been lulled into thinking that past performance was an indicator of future performance, indeed the basis of most analytical assumptions. Quite critically, I managed to overlook the following cricketing points-

1) Cricket performance is different from credit performance. It is the people and their fitness.

India’s strike bowler Zaheer Khan was out due to injury, we did not have any adequate replacement for him. India’s best opener Virender Sehwag was out due to shoulder injury in the first two tests.

Moral – Statistics can be misleading if you do not apply recent knowledge couple with domain expertise (in this case cricket)

2) What goes up must come down. Indeed if a team has performed its best two months back, it is a good sign that cyclicality will ensure performance will go down.

Moral- Do not depend on regression or time series with ignoring cyclical trends.

3) India’s cricket team is aging. England ‘s cricket team is youthful.

I should have gotten this one right. One of the big and understated reasons that the Indian economy is booming -is because we have the youngest population in the world with a median age of 28.

or as http://en.wikipedia.org/wiki/Demographics_of_India

India has more than 50% of its population below the age of 25 and more than 65% hovers below the age of 35. It is expected that, in 2020, the average age of an Indian will be 29 years, compared to 37 for China and 48 for Japan; and, by 2030, India’s dependency ratio should be just over 0.4

India’s population is 1.21 billion people, so potentially a much larger pool of athletes , once we put away our laptops that is.

http://en.wikipedia.org/wiki/Demographics_of_UK

 

the total population of the United Kingdom was 58,789,194 (I dont have numbers for average age)

 

Paradoxically India have the oldest cricket team in the world . This calls for detailed investigation and some old timers should give way to new comers after this drubbing.

Moral- Demographics matters. It is the people who vary more than any variable.

4) The Indian cricket team has played much less Test cricket and much more 20:20 and one day matches. 20:20 is a format in which only twenty overs are bowled per side. In Test Matches 90 overs are bowled every day for 5 days.

Stamina is critical in sports.

Moral- Context is important in extrapolating forecasts.

Everything said and done- the English cricket team played hard and fair and deserve to be number ones. I would love to say more on the Indian cricket team, but I now intend to watch Manchester United play soccer.

 

 

 

 

 

Analytics 2011 Conference

From http://www.sas.com/events/analytics/us/

The Analytics 2011 Conference Series combines the power of SAS’s M2010 Data Mining Conference and F2010 Business Forecasting Conference into one conference covering the latest trends and techniques in the field of analytics. Analytics 2011 Conference Series brings the brightest minds in the field of analytics together with hundreds of analytics practitioners. Join us as these leading conferences change names and locations. At Analytics 2011, you’ll learn through a series of case studies, technical presentations and hands-on training. If you are in the field of analytics, this is one conference you can’t afford to miss.

Conference Details

October 24-25, 2011
Grande Lakes Resort
Orlando, FL

Analytics 2011 topic areas include: