R for Predictive Modeling- PAW Toronto

A nice workshop on using R for Predictive Modeling by Max Kuhn Director, Nonclinical Statistics, Pfizer is on at PAW Toronto.

Workshop

Monday, April 23, 2012 in Toronto
Full-day: 9:00am – 4:30pm

R for Predictive Modeling:
A Hands-On Introduction

Intended Audience: Practitioners who wish to learn how to execute on predictive analytics by way of the R language; anyone who wants “to turn ideas into software, quickly and faithfully.”

Knowledge Level: Either hands-on experience with predictive modeling (without R) or hands-on familiarity with any programming language (other than R) is sufficient background and preparation to participate in this workshop.


What prior attendees have exclaimed


Workshop Description

This one-day session provides a hands-on introduction to R, the well-known open-source platform for data analysis. Real examples are employed in order to methodically expose attendees to best practices driving R and its rich set of predictive modeling packages, providing hands-on experience and know-how. R is compared to other data analysis platforms, and common pitfalls in using R are addressed.

The instructor, a leading R developer and the creator of CARET, a core R package that streamlines the process for creating predictive models, will guide attendees on hands-on execution with R, covering:

  • A working knowledge of the R system
  • The strengths and limitations of the R language
  • Preparing data with R, including splitting, resampling and variable creation
  • Developing predictive models with R, including decision trees, support vector machines and ensemble methods
  • Visualization: Exploratory Data Analysis (EDA), and tools that persuade
  • Evaluating predictive models, including viewing lift curves, variable importance and avoiding overfitting

Hardware: Bring Your Own Laptop
Each workshop participant is required to bring their own laptop running Windows or OS X. The software used during this training program, R, is free and readily available for download.

Attendees receive an electronic copy of the course materials and related R code at the conclusion of the workshop.


Schedule

  • Workshop starts at 9:00am
  • Morning Coffee Break at 10:30am – 11:00am
  • Lunch provided at 12:30 – 1:15pm
  • Afternoon Coffee Break at 2:30pm – 3:00pm
  • End of the Workshop: 4:30pm

Instructor

Max Kuhn, Director, Nonclinical Statistics, Pfizer

Max Kuhn is a Director of Nonclinical Statistics at Pfizer Global R&D in Connecticut. He has been applying models in the pharmaceutical industries for over 15 years.

He is a leading R developer and the author of several R packages including the CARET package that provides a simple and consistent interface to over 100 predictive models available in R.

Mr. Kuhn has taught courses on modeling within Pfizer and externally, including a class for the India Ministry of Information Technology.

Source-

http://www.predictiveanalyticsworld.com/toronto/2012/r_for_predictive_modeling.php

Interview Kelci Miclaus, SAS Institute Using #rstats with JMP

Here is an interview with Kelci Miclaus, a researcher working with the JMP division of the SAS Institute, in which she demonstrates examples of how the R programming language is a great hit with JMP customers who like to be flexible.

 

Ajay- How has JMP been using integration with R? What has been the feedback from customers so far? Is there a single case study you can point out where the combination of JMP and R was better than any one of them alone?

Kelci- Feedback from customers has been very positive. Some customers are using JMP to foster collaboration between SAS and R modelers within their organizations. Many are using JMP’s interactive visualization to complement their use of R. Many SAS and JMP users are using JMP’s integration with R to experiment with more bleeding-edge methods not yet available in commercial software. It can be used simply to smooth the transition with regard to sending data between the two tools, or used to build complete custom applications that take advantage of both JMP and R.

One customer has been using JMP and R together for Bayesian analysis. He uses R to create MCMC chains and has found that JMP is a great tool for preparing the data for analysis, as well as displaying the results of the MCMC simulation. For example, the Control Chart platform and the Bubble Plot platform in JMP can be used to quickly verify convergence of the algorithm. The use of both tools together can increase productivity since the results of an analysis can be achieved faster than through scripting and static graphics alone.

I, along with a few other JMP developers, have written applications that use JMP scripting to call out to R packages and perform analyses like multidimensional scaling, bootstrapping, support vector machines, and modern variable selection methods. These really show the benefit of interactive visual analysis of coupled with modern statistical algorithms. We’ve packaged these scripts as JMP add-ins and made them freely available on our JMP User Community file exchange. Customers can download them and now employ these methods as they would a regular JMP platform. We hope that our customers familiar with scripting will also begin to contribute their own add-ins so a wider audience can take advantage of these new tools.

(see http://www.decisionstats.com/jmp-and-r-rstats/)

Ajay- Are there plans to extend JMP integration with other languages like Python?

Kelci- We do have plans to integrate with other languages and are considering integrating with more based on customer requests. Python has certainly come up and we are looking into possibilities there.

 Ajay- How is R a complimentary fit to JMP’s technical capabilities?

Kelci- R has an incredible breadth of capabilities. JMP has extensive interactive, dynamic visualization intrinsic to its largely visual analysis paradigm, in addition to a strong core of statistical platforms. Since our brains are designed to visually process pictures and animated graphs more efficiently than numbers and text, this environment is all about supporting faster discovery. Of course, JMP also has a scripting language (JSL) allowing you to incorporate SAS code, R code, build analytical applications for others to leverage SAS, R and other applications for users who don’t code or who don’t want to code.

JSL is a powerful scripting language on its own. It can be used for dialog creation, automation of JMP statistical platforms, and custom graphic scripting. In other ways, JSL is very similar to the R language. It can also be used for data and matrix manipulation and to create new analysis functions. With the scripting capabilities of JMP, you can create custom applications that provide both a user interface and an interactive visual back-end to R functionality. Alternatively, you could create a dashboard using statistical and/or graphical platforms in JMP to explore the data and with the click of a button, send a portion of the data to R for further analysis.

Another JMP feature that complements R is the add-in architecture, which is similar to how R packages work. If you’ve written a cool script or analysis workflow, you can package it into a JMP add-in file and send it to your colleagues so they can easily use it.

Ajay- What is the official view on R from your organization? Do you think it is a threat, or a complimentary product or another statistical platform that coexists with your offerings?

Kelci- Most definitely, we view R as complimentary. R contributors are providing a tremendous service to practitioners, allowing them to try a wide variety of methods in the pursuit of more insight and better results. The R community as a whole is providing a valued role to the greater analytical community by focusing attention on newer methods that hold the most promise in so many application areas. Data analysts should be encouraged to use the tools available to them in order to drive discovery and JMP can help with that by providing an analytic hub that supports both SAS and R integration.

Ajay-  While you do use R, are there any plans to give back something to the R community in terms of your involvement and participation (say at useR events) or sponsoring contests.

 Kelci- We are certainly open to participating in useR groups. At Predictive Analytics World in NY last October, they didn’t have a local useR group, but they did have a Predictive Analytics Meet-up group comprised of many R users. We were happy to sponsor this. Some of us within the JMP division have joined local R user groups, myself included.  Given that some local R user groups have entertained topics like Excel and R, Python and R, databases and R, we would be happy to participate more fully here. I also hope to attend the useR! annual meeting later this year to gain more insight on how we can continue to provide tools to help both the JMP and R communities with their work.

We are also exploring options to sponsor contests and would invite participants to use their favorite tools, languages, etc. in pursuit of the best model. Statistics is about learning from data and this is how we make the world a better place.

About- Kelci Miclaus

Kelci is a research statistician developer for JMP Life Sciences at SAS Institute. She has a PhD in Statistics from North Carolina State University and has been using SAS products and R for several years. In addition to research interests in statistical genetics, clinical trials analysis, and multivariate analysis/visualization methods, Kelci works extensively with JMP, SAS, and R integration.

.

 

JMP and R – #rstats

An amazing example of R being used sucessfully in combination (and not is isolation) with other enterprise software is the add-ins functionality of JMP and it’s R integration.

See the following JMP add-ins which use R

http://support.sas.com/demosdownloads/downarea_t4.jsp?productID=110454&jmpflag=Y

JMP Add-in: Multidimensional Scaling using R

This add-in creates a new menu command under the Add-Ins Menu in the submenu R Add-ins. The script will launch a custom dialog (or prompt for a JMP data table is one is not already open) where you can cast columns into roles for performing MDS on the data table. The analysis results in a data table of MDS dimensions and associated output graphics. MDS is a dimension reduction method that produces coordinates in Euclidean space (usually 2D, 3D) that best represent the structure of a full distance/dissimilarity matrix. MDS requires that input be a symmetric dissimilarity matrix. Input to this application can be data that is already in the form of a symmetric dissimilarity matrix or the dissimilarity matrix can be computed based on the input data (where dissimilarity measures are calculated between rows of the input data table in R).

Submitted by: Kelci Miclaus SAS employee Initiative: All
Application: Add-Ins Analysis: Exploratory Data Analysis

Chernoff Faces Add-in

One way to plot multivariate data is to use Chernoff faces. For each observation in your data table, a face is drawn such that each variable in your data set is represented by a feature in the face. This add-in uses JMP’s R integration functionality to create Chernoff faces. An R install and the TeachingDemos R package are required to use this add-in.

Submitted by: Clay Barker SAS employee Initiative: All
Application: Add-Ins Analysis: Data Visualization

Support Vector Machine for Classification

By simply opening a data table, specifying X, Y variables, selecting a kernel function, and specifying its parameters on the user-friendly dialog, you can build a classification model using Support Vector Machine. Please note that R package ‘e1071’ should be installed before running this dialog. The package can be found from http://cran.r-project.org/web/packages/e1071/index.html.

Submitted by: Jong-Seok Lee SAS employee Initiative: All
Application: Add-Ins Analysis: Exploratory Data Analysis/Mining

Penalized Regression Add-in

This add-in uses JMP’s R integration functionality to provide access to several penalized regression methods. Methods included are the LASSO (least absolutee shrinkage and selection operator, LARS (least angle regression), Forward Stagewise, and the Elastic Net. An R install and the “lars” and “elasticnet” R packages are required to use this add-in.

Submitted by: Clay Barker SAS employee Initiative: All
Application: Add-Ins Analysis: Regression

MP Addin: Univariate Nonparametric Bootstrapping

This script performs simple univariate, nonparametric bootstrap sampling by using the JMP to R Project integration. A JMP Dialog is built by the script where the variable you wish to perform bootstrapping over can be specified. A statistic to compute for each bootstrap sample is chosen and the data are sent to R using new JSL functionality available in JMP 9. The boot package in R is used to call the boot() function and the boot.ci() function to calculate the sample statistic for each bootstrap sample and the basic bootstrap confidence interval. The results are brought back to JMP and displayed using the JMP Distribution platform.

Submitted by: Kelci Miclaus SAS employee Initiative: All
Application: Add-Ins Analysis: Basic Statistics

#Rstats for Business Intelligence

This is a short list of several known as well as lesser known R ( #rstats) language codes, packages and tricks to build a business intelligence application. It will be slightly Messy (and not Messi) but I hope to refine it someday when the cows come home.

It assumes that BI is basically-

a Database, a Document Database, a Report creation/Dashboard pulling software as well unique R packages for business intelligence.

What is business intelligence?

Seamless dissemination of data in the organization. In short let it flow- from raw transactional data to aggregate dashboards, to control and test experiments, to new and legacy data mining models- a business intelligence enabled organization allows information to flow easily AND capture insights and feedback for further action.

BI software has lately meant to be just reporting software- and Business Analytics has meant to be primarily predictive analytics. the terms are interchangeable in my opinion -as BI reports can also be called descriptive aggregated statistics or descriptive analytics, and predictive analytics is useless and incomplete unless you measure the effect in dashboards and summary reports.

Data Mining- is a bit more than predictive analytics- it includes pattern recognizability as well as black box machine learning algorithms. To further aggravate these divides, students mostly learn data mining in computer science, predictive analytics (if at all) in business departments and statistics, and no one teaches metrics , dashboards, reporting  in mainstream academia even though a large number of graduates will end up fiddling with spreadsheets or dashboards in real careers.

Using R with

1) Databases-

I created a short list of database connectivity with R here at https://rforanalytics.wordpress.com/odbc-databases-for-r/ but R has released 3 new versions since then.

The RODBC package remains the package of choice for connecting to SQL Databases.

http://cran.r-project.org/web/packages/RODBC/RODBC.pdf

Details on creating DSN and connecting to Databases are given at  https://rforanalytics.wordpress.com/odbc-databases-for-r/

For document databases like MongoDB and CouchDB

( what is the difference between traditional RDBMS and NoSQL if you ever need to explain it in a cocktail conversation http://dba.stackexchange.com/questions/5/what-are-the-differences-between-nosql-and-a-traditional-rdbms

Basically dispensing with the relational setup, with primary and foreign keys, and with the additional overhead involved in keeping transactional safety, often gives you extreme increases in performance

NoSQL is a kind of database that doesn’t have a fixed schema like a traditional RDBMS does. With the NoSQL databases the schema is defined by the developer at run time. They don’t write normal SQL statements against the database, but instead use an API to get the data that they need.

instead relating data in one table to another you store things as key value pairs and there is no database schema, it is handled instead in code.)

I believe any corporation with data driven decision making would need to both have atleast one RDBMS and one NoSQL for unstructured data-Ajay. This is a sweeping generic statement 😉 , and is an opinion on future technologies.

  • Use RMongo

From- http://tommy.chheng.com/2010/11/03/rmongo-accessing-mongodb-in-r/

http://plindenbaum.blogspot.com/2010/09/connecting-to-mongodb-database-from-r.html

Connecting to a MongoDB database from R using Java

http://nsaunders.wordpress.com/2010/09/24/connecting-to-a-mongodb-database-from-r-using-java/

Also see a nice basic analysis using R Mongo from

http://pseudofish.com/blog/2011/05/25/analysis-of-data-with-mongodb-and-r/

For CouchDB

please see https://github.com/wactbprot/R4CouchDB and

http://digitheadslabnotebook.blogspot.com/2010/10/couchdb-and-r.html

  • First install RCurl and RJSONIO. You’ll have to download the tar.gz’s if you’re on a Mac. For the second part, we’ll need to installR4CouchDB,

2) External Report Creating Software-

Jaspersoft- It has good integration with R and is a certified Revolution Analytics partner (who seem to be the only ones with a coherent #Rstats go to market strategy- which begs the question – why is the freest and finest stats software having only ONE vendor- if it was so great lots of companies would make exclusive products for it – (and some do -see https://rforanalytics.wordpress.com/r-business-solutions/ and https://rforanalytics.wordpress.com/using-r-from-other-software/)

From

http://www.jaspersoft.com/sites/default/files/downloads/events/Analytics%20-Jaspersoft-SEP2010.pdf

we see

http://jasperforge.org/projects/rrevodeployrbyrevolutionanalytics

RevoConnectR for JasperReports Server

RevoConnectR for JasperReports Server RevoConnectR for JasperReports Server is a Java library interface between JasperReports Server and Revolution R Enterprise’s RevoDeployR, a standardized collection of web services that integrates security, APIs, scripts and libraries for R into a single server. JasperReports Server dashboards can retrieve R charts and result sets from RevoDeployR.

http://jasperforge.org/plugins/esp_frs/optional_download.php?group_id=409

 

Using R and Pentaho
Extending Pentaho with R analytics”R” is a popular open source statistical and analytical language that academics and commercial organizations alike have used for years to get maximum insight out of information using advanced analytic techniques. In this twelve-minute video, David Reinke from Pentaho Certified Partner OpenBI provides an overview of R, as well as a demonstration of integration between R and Pentaho.
and from
R and BI – Integrating R with Open Source Business
Intelligence Platforms Pentaho and Jaspersoft
David Reinke, Steve Miller
Keywords: business intelligence
Increasingly, R is becoming the tool of choice for statistical analysis, optimization, machine learning and
visualization in the business world. This trend will only escalate as more R analysts transition to business
from academia. But whereas in academia R is often the central tool for analytics, in business R must coexist
with and enhance mainstream business intelligence (BI) technologies. A modern BI portfolio already includes
relational databeses, data integration (extract, transform, load – ETL), query and reporting, online analytical
processing (OLAP), dashboards, and advanced visualization. The opportunity to extend traditional BI with
R analytics revolves on the introduction of advanced statistical modeling and visualizations native to R. The
challenge is to seamlessly integrate R capabilities within the existing BI space. This presentation will explain
and demo an initial approach to integrating R with two comprehensive open source BI (OSBI) platforms –
Pentaho and Jaspersoft. Our efforts will be successful if we stimulate additional progress, transparency and
innovation by combining the R and BI worlds.
The demonstration will show how we integrated the OSBI platforms with R through use of RServe and
its Java API. The BI platforms provide an end user web application which include application security,
data provisioning and BI functionality. Our integration will demonstrate a process by which BI components
can be created that prompt the user for parameters, acquire data from a relational database and pass into
RServer, invoke R commands for processing, and display the resulting R generated statistics and/or graphs
within the BI platform. Discussion will include concepts related to creating a reusable java class library of
commonly used processes to speed additional development.

If you know Java- try http://ramanareddyg.blog.com/2010/07/03/integrating-r-and-pentaho-data-integration/

 

and I like this list by two venerable powerhouses of the BI Open Source Movement

http://www.openbi.com/demosarticles.html

Open Source BI as disruptive technology

http://www.openbi.biz/articles/osbi_disruption_openbi.pdf

Open Source Punditry

TITLE AUTHOR COMMENTS
Commercial Open Source BI Redux Dave Reinke & Steve Miller An review and update on the predictions made in our 2007 article focused on the current state of the commercial open source BI market. Also included is a brief analysis of potential options for commercial open source business models and our take on their applicability.
Open Source BI as Disruptive Technology Dave Reinke & Steve Miller Reprint of May 2007 DM Review article explaining how and why Commercial Open Source BI (COSBI) will disrupt the traditional proprietary market.

Spotlight on R

TITLE AUTHOR COMMENTS
R You Ready for Open Source Statistics? Steve Miller R has become the “lingua franca” for academic statistical analysis and modeling, and is now rapidly gaining exposure in the commercial world. Steve examines the R technology and community and its relevancy to mainstream BI.
R and BI (Part 1): Data Analysis with R Steve Miller An introduction to R and its myriad statistical graphing techniques.
R and BI (Part 2): A Statistical Look at Detail Data Steve Miller The usage of R’s graphical building blocks – dotplots, stripplots and xyplots – to create dashboards which require little ink yet tell a big story.
R and BI (Part 3): The Grooming of Box and Whiskers Steve Miller Boxplots and variants (e.g. Violin Plot) are explored as an essential graphical technique to summarize data distributions by categories and dimensions of other attributes.
R and BI (Part 4): Embellishing Graphs Steve Miller Lattices and logarithmic data transformations are used to illuminate data density and distribution and find patterns otherwise missed using classic charting techniques.
R and BI (Part 5): Predictive Modelling Steve Miller An introduction to basic predictive modelling terminology and techniques with graphical examples created using R.
R and BI (Part 6) :
Re-expressing Data
Steve Miller How do you deal with highly skewed data distributions? Standard charting techniques on this “deviant” data often fail to illuminate relationships. This article explains techniques to re-express skewed data so that it is more understandable.
The Stock Market, 2007 Steve Miller R-based dashboards are presented to demonstrate the return performance of various asset classes during 2007.
Bootstrapping for Portfolio Returns: The Practice of Statistical Analysis Steve Miller Steve uses the R open source stats package and Monte Carlo simulations to examine alternative investment portfolio returns…a good example of applied statistics using R.
Statistical Graphs for Portfolio Returns Steve Miller Steve uses the R open source stats package to analyze market returns by asset class with some very provocative embedded trellis charts.
Frank Harrell, Iowa State and useR!2007 Steve Miller In August, Steve attended the 2007 Internation R User conference (useR!2007). This article details his experiences, including his meeting with long-time R community expert, Frank Harrell.
An Open Source Statistical “Dashboard” for Investment Performance Steve Miller The newly launched Dashboard Insight web site is focused on the most useful of BI tools: dashboards. With this article discussing the use of R and trellis graphics, OpenBI brings the realm of open source to this forum.
Unsexy Graphics for Business Intelligence Steve Miller Utilizing Tufte’s philosophy of maximizing the data to ink ratio of graphics, Steve demonstrates the value in dot plot diagramming. The R open source statistical/analytics software is showcased.
I think that the report generation package Brew would also qualify as a BI package, but large scale implementation remains to be seen in
a commercial business environment
  • brew: Creating Repetitive Reports
 brew: Templating Framework for Report Generation

brew implements a templating framework for mixing text and R code for report generation. brew template syntax is similar to PHP, Ruby's erb module, Java Server Pages, and Python's psp module. http://bit.ly/jINmaI
  • Yarr- creating reports in R
to be continued ( when I have more time and the temperature goes down from 110F in Delhi, India)

R for Predictive Modeling:Workshop

A view of the Oakland-San Francisco Bay Bridge...
Image via Wikipedia

A workshop on using R for Predictive Modeling, by the Director, Non Clinical Stats, Pfizer. Interesting Bay Area Event- part of next edition of Predictive Analytics World

Sunday, March 13, 2011 in San Francisco

R for Predictive Modeling:
A Hands-On Introduction

Intended Audience: Practitioners who wish to learn how to execute on predictive analytics by way of the R language; anyone who wants “to turn ideas into software, quickly and faithfully.”

Knowledge Level: Either hands-on experience with predictive modeling (without R) or hands-on familiarity with any programming language (other than R) is sufficient background and preparation to participate in this workshop.


Workshop Description

This one-day session provides a hands-on introduction to R, the well-known open-source platform for data analysis. Real examples are employed in order to methodically expose attendees to best practices driving R and its rich set of predictive modeling packages, providing hands-on experience and know-how. R is compared to other data analysis platforms, and common pitfalls in using R are addressed.

The instructor, a leading R developer and the creator of CARET, a core R package that streamlines the process for creating predictive models, will guide attendees on hands-on execution with R, covering:

  • A working knowledge of the R system
  • The strengths and limitations of the R language
  • Preparing data with R, including splitting, resampling and variable creation
  • Developing predictive models with R, including decision trees, support vector machines and ensemble methods
  • Visualization: Exploratory Data Analysis (EDA), and tools that persuade
  • Evaluating predictive models, including viewing lift curves, variable importance and avoiding overfitting

Hardware: Bring Your Own Laptop
Each workshop participant is required to bring their own laptop running Windows or OS X. The software used during this training program, R, is free and readily available for download.

Attendees receive an electronic copy of the course materials and related R code at the conclusion of the workshop.


Schedule

  • Workshop starts at 9:00am
  • Morning Coffee Break at 10:30am – 11:00am
  • Lunch provided at 12:30 – 1:15pm
  • Afternoon Coffee Break at 2:30pm – 3:00pm
  • End of the Workshop: 4:30pm

Instructor

Max Kuhn, Director, Nonclinical Statistics, Pfizer

Max Kuhn is a Director of Nonclinical Statistics at Pfizer Global R&D in Connecticut. He has been apply models in the pharmaceutical industries for over 15 years.

He is a leading R developer and the author of several R packages including the CARET package that provides a simple and consistent interface to over 100 predictive models available in R.

Mr. Kuhn has taught courses on modeling within Pfizer and externally, including a class for the India Ministry of Information Technology.

 

http://www.predictiveanalyticsworld.com/sanfrancisco/2011/r_for_predictive_modeling.php

 

Troubleshooting Rattle Installation- Data Mining R GUI

Screenshot of Synaptic Package Manager running...
Image via Wikipedia

I really find the Rattle GUI very very nice and easy to do any data mining task. The software is available from http://rattle.togaware.com/

The only issue is Rattle can be quite difficult to install due to dependencies on GTK+

After fiddling for a couple of years- this is what I did

1) Created dual boot OS- Basically downloaded the netbook remix from http://ubuntu.com I created a dual boot OS so you can choose at the beginning whether to use Windows or Ubuntu Linux in that session.  Alternatively you can download VM Player www.vmware.com/products/player/ if you want to do both

2) Download R packages using Ubuntu packages and Install GTK+ dependencies before that.

GTK + Requires

  1. Libglade
  2. Glib
  3. Cairo
  4. Pango
  5. ATK

If  you are a Linux newbie like me who doesnt get the sudo apt get, tar, cd, make , install rigmarole – scoot over to synaptic software packages or just the main ubuntu software centre and download these packages one by one.

For R Dependencies, you need

  • PMML
  • XML
  • RGTK2

Again use r-cran as the prefix to these package names and simply install (almost the same way Windows does it easily -double click)

see http://packages.ubuntu.com/search?suite=lucid&searchon=names&keywords=r-cran

4) Install Rattle from source

http://rattle.togaware.com/rattle-download.html

Advanced users can download the Rattle source packages directly:

Save theses to your hard disk (e.g., to your Desktop) but don’t extract them. Then, on GNU/Linux run the install command shown below. This command is entered into a terminal window:

  • R CMD INSTALL rattle_2.6.0.tar.gz

After installation-

5) Type library(rattle) and rattle.info to get messages on what R packages to update for a proper functioning

</code>

> library(rattle)
Rattle: Graphical interface for data mining using R.
Version 2.6.0 Copyright (c) 2006-2010 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
> rattle.info()
Rattle: version 2.6.0
R: version 2.11.1 (2010-05-31) (Revision 52157)

Sysname: Linux
Release: 2.6.35-23-generic
Version: #41-Ubuntu SMP Wed Nov 24 10:18:49 UTC 2010
Nodename: k1-M725R
Machine: i686
Login: k1ng
User: k1ng

Installed Dependencies
RGtk2: version 2.20.3
pmml: version 1.2.26
colorspace: version 1.0-1
cairoDevice: version 2.14
doBy: version 4.1.2
e1071: version 1.5-24
ellipse: version 0.3-5
foreign: version 0.8-41
gdata: version 2.8.1
gtools: version 2.6.2
gplots: version 2.8.0
gWidgetsRGtk2: version 0.0-69
Hmisc: version 3.8-3
kernlab: version 0.9-12
latticist: version 0.9-43
Matrix: version 0.999375-46
mice: version 2.4
network: version 1.5-1
nnet: version 7.3-1
party: version 0.9-99991
playwith: version 0.9-53
randomForest: version 4.5-36 upgrade available 4.6-2
rggobi: version 2.1.16
survival: version 2.36-2
XML: version 3.2-0
bitops: version 1.0-4.1

Upgrade the packages with:

 > install.packages(c("randomForest"))

<code>

Now upgrade whatever package rattle.info tells to upgrade.

This is much simpler and less frustrating than some of the other ways to install Rattle.

If all goes well, you will see this familiar screen popup when you type

>rattle()

 

Choosing R for business – What to consider?

A composite of the GNU logo and the OSI logo, ...
Image via Wikipedia

Additional features in R over other analytical packages-

1) Source Code is given to ensure complete custom solution and embedding for a particular application. Open source code has an advantage that is extensively peer- reviewed in Journals and Scientific Literature.  This means bugs will found, shared and corrected transparently.

2) Wide literature of training material in the form of books is available for the R analytical platform.

3) Extensively the best data visualization tools in analytical software (apart from Tableau Software ‘s latest version). The extensive data visualization available in R is of the form a variety of customizable graphs, as well as animation. The principal reason third-party software initially started creating interfaces to R is because the graphical library of packages in R is more advanced as well as rapidly getting more features by the day.

4) Free in upfront license cost for academics and thus budget friendly for small and large analytical teams.

5) Flexible programming for your data environment. This includes having packages that ensure compatibility with Java, Python and C++.

 

6) Easy migration from other analytical platforms to R Platform. It is relatively easy for a non R platform user to migrate to R platform and there is no danger of vendor lock-in due to the GPL nature of source code and open community.

Statistics are numbers that tell (descriptive), advise ( prescriptive) or forecast (predictive). Analytics is a decision-making help tool. Analytics on which no decision is to be made or is being considered can be classified as purely statistical and non analytical. Thus ease of making a correct decision separates a good analytical platform from a not so good analytical platform. The distinction is likely to be disputed by people of either background- and business analysis requires more emphasis on how practical or actionable the results are and less emphasis on the statistical metrics in a particular data analysis task. I believe one clear reason between business analytics is different from statistical analysis is the cost of perfect information (data costs in real world) and the opportunity cost of delayed and distorted decision-making.

Specific to the following domains R has the following costs and benefits

  • Business Analytics
    • R is free per license and for download
    • It is one of the few analytical platforms that work on Mac OS
    • It’s results are credibly established in both journals like Journal of Statistical Software and in the work at LinkedIn, Google and Facebook’s analytical teams.
    • It has open source code for customization as per GPL
    • It also has a flexible option for commercial vendors like Revolution Analytics (who support 64 bit windows) as well as bigger datasets
    • It has interfaces from almost all other analytical software including SAS,SPSS, JMP, Oracle Data Mining, Rapid Miner. Existing license holders can thus invoke and use R from within these software
    • Huge library of packages for regression, time series, finance and modeling
    • High quality data visualization packages
    • Data Mining
      • R as a computing platform is better suited to the needs of data mining as it has a vast array of packages covering standard regression, decision trees, association rules, cluster analysis, machine learning, neural networks as well as exotic specialized algorithms like those based on chaos models.
      • Flexibility in tweaking a standard algorithm by seeing the source code
      • The RATTLE GUI remains the standard GUI for Data Miners using R. It was created and developed in Australia.
      • Business Dashboards and Reporting
      • Business Dashboards and Reporting are an essential piece of Business Intelligence and Decision making systems in organizations. R offers data visualization through GGPLOT, and GUI like Deducer and Red-R can help even non R users create a metrics dashboard
        • For online Dashboards- R has packages like RWeb, RServe and R Apache- which in combination with data visualization packages offer powerful dashboard capabilities.
        • R can be combined with MS Excel using the R Excel package – to enable R capabilities to be imported within Excel. Thus a MS Excel user with no knowledge of R can use the GUI within the R Excel plug-in to use powerful graphical and statistical capabilities.

Additional factors to consider in your R installation-

There are some more choices awaiting you now-
1) Licensing Choices-Academic Version or Free Version or Enterprise Version of R

2) Operating System Choices-Which Operating System to choose from? Unix, Windows or Mac OS.

3) Operating system sub choice- 32- bit or 64 bit.

4) Hardware choices-Cost -benefit trade-offs for additional hardware for R. Choices between local ,cluster and cloud computing.

5) Interface choices-Command Line versus GUI? Which GUI to choose as the default start-up option?

6) Software component choice- Which packages to install? There are almost 3000 packages, some of them are complimentary, some are dependent on each other, and almost all are free.

7) Additional Software choices- Which additional software do you need to achieve maximum accuracy, robustness and speed of computing- and how to use existing legacy software and hardware for best complementary results with R.

1) Licensing Choices-
You can choose between two kinds of R installations – one is free and open source from http://r-project.org The other R installation is commercial and is offered by many vendors including Revolution Analytics. However there are other commercial vendors too.

Commercial Vendors of R Language Products-
1) Revolution Analytics http://www.revolutionanalytics.com/
2) XL Solutions- http://www.experience-rplus.com/
3) Information Builder – Webfocus RStat -Rattle GUI http://www.informationbuilders.com/products/webfocus/PredictiveModeling.html
4) Blue Reference- Inference for R http://inferenceforr.com/default.aspx

  1. Choosing Operating System
      1. Windows

 

Windows remains the most widely used operating system on this planet. If you are experienced in Windows based computing and are active on analytical projects- it would not make sense for you to move to other operating systems. This is also based on the fact that compatibility problems are minimum for Microsoft Windows and the help is extensively documented. However there may be some R packages that would not function well under Windows- if that happens a multiple operating system is your next option.

        1. Enterprise R from Revolution Analytics- Enterprise R from Revolution Analytics has a complete R Development environment for Windows including the use of code snippets to make programming faster. Revolution is also expected to make a GUI available by 2011. Revolution Analytics claims several enhancements for it’s version of R including the use of optimized libraries for faster performance.
      1. MacOS

 

Reasons for choosing MacOS remains its considerable appeal in aesthetically designed software- but MacOS is not a standard Operating system for enterprise systems as well as statistical computing. However open source R claims to be quite optimized and it can be used for existing Mac users. However there seem to be no commercially available versions of R available as of now for this operating system.

      1. Linux

 

        1. Ubuntu
        2. Red Hat Enterprise Linux
        3. Other versions of Linux

 

Linux is considered a preferred operating system by R users due to it having the same open source credentials-much better fit for all R packages and it’s customizability for big data analytics.

Ubuntu Linux is recommended for people making the transition to Linux for the first time. Ubuntu Linux had an marketing agreement with revolution Analytics for an earlier version of Ubuntu- and many R packages can  installed in a straightforward way as Ubuntu/Debian packages are available. Red Hat Enterprise Linux is officially supported by Revolution Analytics for it’s enterprise module. Other versions of Linux popular are Open SUSE.

      1. Multiple operating systems-
        1. Virtualization vs Dual Boot-

 

You can also choose between having a VMware VM Player for a virtual partition on your computers that is dedicated to R based computing or having operating system choice at the startup or booting of your computer. A software program called wubi helps with the dual installation of Linux and Windows.

  1. 64 bit vs 32 bit – Given a choice between 32 bit versus 64 bit versions of the same operating system like Linux Ubuntu, the 64 bit version would speed up processing by an approximate factor of 2. However you need to check whether your current hardware can support 64 bit operating systems and if so- you may want to ask your Information Technology manager to upgrade atleast some operating systems in your analytics work environment to 64 bit operating systems.

 

  1. Hardware choices- At the time of writing this book, the dominant computing paradigm is workstation computing followed by server-client computing. However with the introduction of cloud computing, netbooks, tablet PCs, hardware choices are much more flexible in 2011 than just a couple of years back.

Hardware costs are a significant cost to an analytics environment and are also  remarkably depreciated over a short period of time. You may thus examine your legacy hardware, and your future analytical computing needs- and accordingly decide between the various hardware options available for R.
Unlike other analytical software which can charge by number of processors, or server pricing being higher than workstation pricing and grid computing pricing extremely high if available- R is well suited for all kinds of hardware environment with flexible costs. Given the fact that R is memory intensive (it limits the size of data analyzed to the RAM size of the machine unless special formats and /or chunking is used)- it depends on size of datasets used and number of concurrent users analyzing the dataset. Thus the defining issue is not R but size of the data being analyzed.

    1. Local Computing- This is meant to denote when the software is installed locally. For big data the data to be analyzed would be stored in the form of databases.
      1. Server version- Revolution Analytics has differential pricing for server -client versions but for the open source version it is free and the same for Server or Workstation versions.
      2. Workstation
    2. Cloud Computing- Cloud computing is defined as the delivery of data, processing, systems via remote computers. It is similar to server-client computing but the remote server (also called cloud) has flexible computing in terms of number of processors, memory, and data storage. Cloud computing in the form of public cloud enables people to do analytical tasks on massive datasets without investing in permanent hardware or software as most public clouds are priced on pay per usage. The biggest cloud computing provider is Amazon and many other vendors provide services on top of it. Google is also coming for data storage in the form of clouds (Google Storage), as well as using machine learning in the form of API (Google Prediction API)
      1. Amazon
      2. Google
      3. Cluster-Grid Computing/Parallel processing- In order to build a cluster, you would need the RMpi and the SNOW packages, among other packages that help with parallel processing.
    3. How much resources
      1. RAM-Hard Disk-Processors- for workstation computing
      2. Instances or API calls for cloud computing
  1. Interface Choices
    1. Command Line
    2. GUI
    3. Web Interfaces
  2. Software Component Choices
    1. R dependencies
    2. Packages to install
    3. Recommended Packages
  3. Additional software choices
    1. Additional legacy software
    2. Optimizing your R based computing
    3. Code Editors
      1. Code Analyzers
      2. Libraries to speed up R

citation-  R Development Core Team (2010). R: A language and environment for statistical computing. R Foundation for Statistical Computing,Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.

(Note- this is a draft in progress)

%d bloggers like this: