R is an epic fail or is it just overhyped

I came across this nice post from someone who is both knowledgeable and experienced in data. I mean I totally agree that data visualization , user interfaces and unstructured data mining are the trends of the future.

What caught my attention were the words from http://www.thejuliagroup.com/blog/?p=433

However, for me personally and for most users, both individual and organizational, the much greater cost of software is the time it takes to install it, maintain it, learn it and document it. On that, R is an epic fail. It does NOT fit with the way the vast majority of people in the world use computers. The vast majority of people are NOT programmers. They are used to looking at things and clicking on things.

Let me analyze this scientifically and dispassionately

R Documentation

I believe that the SAS Online Doc and the SPSS Documentation are both good examples of structured documentation. I do belive that despite the many corporate R products floating- the quality of R documentation is both very extensive and perhaps too big to be put in a neat document something like the ” The Little R Book” or “R Online Doc” would really help.

Entering ? or ?? to search for documentation seems like too difficult work and complex for corporate users it seems. However the documentation for R is not really enterprise software quality is a valid enough point.

Maintaining R

It takes a single line of code or even a single click to update and maintain R.

Apparently the author of the fore mentioned post that existing corporate users are too STUPID OR LAZY to do this.

I like to think most corporate users of statistical software are actually way smarter ( One Hint : They earn money doing that stuff)

Installing R

Anyone who mentions installation costs of software as a reason for enhanced software costs and then mentions R is either biased against R or has not worked with R. Or Both

Learn R

I think anyone cannot learn all R packages just as you cannot learn all the modules of SAS ( like ETS, Stat, etc etc)

R does have more time to learn than Base SAS and this is a valid enough point.

However two R GUI like Rattle and R Commander can help the execution time for this learning.

And increasingly R is taught in universities which is where the battle for future developers or users for platforms like SAS , SPSS , Stata or R would ultimately be decided while the short term monetization of other softwares dazzles people R has too many passionate developers or users to allow it to fail.

However,

R is not perfect. It does need a better corporate version than is currently offered especially to people who are simple users not developers , and it could also to well to better the marketability and visibility of R.

Regarding software costs, ironically while it is easier to estimate how much SAS will cost you in terms of licenses and training time. A similar comparitive document between R and SAS in terms of costs and estimated training costs etc should settle this debate more rationally and more dispassionately than is currently the norm in comparing softwares

R for Stats : Updated

Here is the new website for statistical analysis using the free analytical software called R (which is enabled for cloud computing as well : see here http://bit.ly/OhriCloud

or http://rgrossman.com/2009/05/17/running-r-on-amazons-ec2/

for the R tutorial on running it on Amazon’s EC2 pay per demand RAM.

It is called R 4 stats or simply http://www.r4stats.com/

Hosted on Google’s Updated Google Sites Platform- it offers a preview to Bob’s earlier run away hit R for SAS and SPSS users updation as well as his upcoming work R for Stata Users.

In Bob’s words himself –

I have substantially expanded the table that compares SAS and SPSS
add-on modules to somewhat equivalent R packages. This new version is
at:
http://r4stats.com/add-on-modules
and I would very much appreciate any feedback you might have on it.

The site http://r4stats.com is the replacement to
http://RforSASandSPSSusers.com and includes the support files for both
“R for SAS and SPSS Users” and the new “R for Stata Users”, due out in
March from Springer.

Topic SAS Product SPSS Product R Package
Advanced Models
SAS/STAT IBM SPSS Advanced Statistics
R, MASS, many others
Association Analysis
Enterprise Miner
IBM SPSS Association
arules, arulesNBMiner, arulesSequences
Basics Base SAS
IBM SPSS Statistics Base
R
Bootstrapping
SAS/STAT
IBM SPSS Bootstrapping
BootCL, BootPR, boot, bootRes, BootStepAIC, bootspecdens, bootstrap, FRB, gPdtest, meboot, multtest, pvclust, rqmcmb2, scaleboot, simpleboot
Classification Analysis
Enterprise Miner
IBM SPSS Classification
rattle, see the neural networks and trees entries in this table.
Conjoint Analysis
SAS/STAT: PROC TRANSREG
IBM SPSS Conjoint
homals, psychoR, bayesm
Correspondence Analysis
SAS/STAT: PROC CORRESP
IBM SPSS Categories
ade4, cocorresp, FactoMineR, homals, made4, MASS, psychoR, PTAk, vegan
Custom Tables
Base SAS, PROC REPORT, PROC SQL, PROC TABULATE, Enterprise Reporter
IBM SPSS Custom Tables
reshape
Data Access
SAS/ACCESS
SPSS Data Access Pack
DBI, foreign, Hmisc: sas.get, sasxport.get, RODBC
Data Collection
SAS/FSP
IBM SPSS Data Collection Family
RSQLite, and the other open source programs MySQL or PostgreSQL are popular among R users for this purpose.
Data Mining
Enterprise Miner
IBM SPSS Modeler
(formerly Clementine)
arules, FactoMineR, rattle, various functions
Data Mining, In-database Processing
SAS In-Database Initiative with Teradata
IBM SPSS Modeler
PL/R
Data Preparation
Various procedures
IBM SPSS Data Preparation, various commands
dprep, plyr, reshape, sqldf, various functions
Developer Tools
SAS/AF, SAS/FSP, SAS Integration Technologies, SAS/TOOLKIT IBM SPSS Statistics Developer, IBM SPSS Statistics Programmability Extension
StatET, R links to most popular compilers, scripting languages, and databases.
Direct Marketing
Nothing quite like it
IBM SPSS Direct Marketing
Nothing quite like it
Exact Tests
SAS/STAT various
IBM SPSS Exact Tests
coin, elrm, exactLoglinTest, exactmaxsel, and options in many others
Excel Integration
SAS Enterprise BI Server IBM SPSS Advantage for Excel 2007
RExcel
Forecasting
SAS/ETS
IBM SPSS Forecasting
Over 40 packages that do time series are described at the Task View link above under Time Series.
Forecasting, Automated
Forecast Server IBM SPSS Forecasting
forecast
Genetics JMP Genomics
None http://www.bioconductor.org
Geographic Information Systems
SAS/GIS, SAS/GRAPH
None (Maps is defunct)
maps, mapdata, mapproj, GRASS via spgrass6, RColorBrewer, see Spatial in Task Views at link at top
Graphical user interfaces
Enterprise Guide, IML Studio, SAS/ASSIST, Analyst, Insight
IBM SPSS Statistics Base Deducer, JGR, R Commander, pmg, rattle, many others at http://www.sciviews.org/_rgui/
Graphics, Interactive
SAS/IML Studio, SAS/INSIGHT, JMP
None
GGobi via rggobi, iPlots, latticist, playwith
Graphics, Static
SAS/GRAPH
SPSS Base, Graphics Production Language
ggplot2, gplots, graphics, grid, gridBase, hexbin, lattice, plotrix, scatterplot3d, vcd, vioplot, geneplotter, Rgraphics
Graphics, Template Builder
Doesn’t use Grammar of Graphics model that forms the core of IBM SPSS Viz Designer or R’s ggplot2
IBM SPSS Viz Designer
Doesn’t use templates, but this GUI for ggplot2 http://www.stat.ucla.edu/~jeroen/ggplot2.html works similarly to IBM SPSS Viz Designer.
Guided Analytics
SAS/LAB
None
None
Matrix/linear Algebra
SAS/IML Studio
IBM SPSS Matrix
R, matlab, Matrix, sparseM
Missing Values Imputation
SAS/STAT: PROC MI
IBM SPSS Missing Values
amelia, Hmisc: aregImpute, EMV, rms (replaces Design): fit.mult.impute, mice, mitools, mvnmle, VIM
Neural Networks
Enterprise Miner
IBM SPSS Neural Networks
AMORE, grnnR, neuralnet, nnet, rattle
Operations Research
SAS/OR
None
glpk, linprog, LowRankQP, TSP
Power Analysis
SAS Power and Sample Size Application, SAS/STAT:
PROC POWER,
PROC GLMPOWER
SamplePower
asypow, powerpkg, pwr, MBESS
Quality Control
SAS/QC
IBM SPSS Statistics Base qcc, spc
Regression Models
SAS/STAT
IBM SPSS Regression
R, Hmisc, lasso, VGAM, pda, rms (replaces Design)
Sampling, Complex
SAS/STAT: PROC SURVEY SELECT, SURVEYMEANS, etc.
IBM SPSS Complex Samples
pps, sampfling, sampling, spsurvey, survey
Segmentation Analysis
Enterprise Miner
IBM Modeler Segmentation
cluster, rattle, som, see CRAN Task Views under Cluster for over 70 packages
Server Version
SAS for your particular server IBM SPSS Statistics Server,
IBM SPSS Modeler Server
rapache, R(D)COM Server, Rserve, StatET
Structural Equation Modeling
SAS/STAT: PROC CALIS
Amos OpenMX, sem
Text Analysis/Mining
Text Miner
IBM SPSS Text Analytics,
IBM SPSS Text Analysis for Surveys
Rstem, las, tm
Trees, Decision, Classification or Regression
Enterprise Miner
IBM SPSS Decision Trees, IBM SPSS AnswerTree, IBM SPSS Modeler (formerly Clementine)
ada, adabag, BayesTree, boost, GAMboost, gbev, gbm, maptree, mboost, mvpart, party, pinktoe,
quantregForest, rpart,rpart.permutation, randomForest, rattle, tree

All SAS and SPSS product names are registered trademarks of their respective companies.

Disclaimer- Bob Muenchen and I work for the same University. While we do have interesting conflicts often, his interview was one of the earliest where this blog began.

See- http://sites.google.com/site/r4statistics/interview

What softwares do you plan to use/learn in the next one year?

The results for the question-

Which software do you plan to use/learn in the next one year  ?

SPSS Directions : Rexer Survey Results

Here are some results shared by Dr Karl Rexer of Rexer Analytics- they were presented at SPSS Directions. Clementine was #1 in customer satisfaction — everyone (N=78) who identified it as their primary tool were satisfied or very satisfied. It’s pretty amazing that not even one person was neutral (it was a 5-point scale).

For a detailed poster on the results contact http://www.RexerAnalytics.com More than 710 data mining professionals had completed the survey.

Here are some results shared by Dr Karl Rexer of Rexer Analytics- they were presented at SPSS Directions

When asked to select all of the software packages they use for data mining, each person selected an average of 5 tools.  More data miners reported using SPSS Statistics than any other tool.  And when we asked people to indicate their primary data mining tool, the tool selected by the most data miners was SPSS Modeler (Clementine).  The SPSS people were also thrilled to see that Clementine was #1 in customer satisfaction — everyone (N=78) who identified it as their primary tool were satisfied or very satisfied.  It’s pretty amazing that not even one person was neutral (it was a 5-point scale).

For a  detailed poster on the results contact www.RexerAnalytics.com More than 710 data mining professionals had completed the survey.

Christmas Carol: The Best Software (BI-Stats-Analytics)

There is no best software- they are just optimized for various constraints and tangible as well as intangible needs as defined for users.

  1. There is no best software- they are just optimized for various constraints and tangible as well as intangible needs as defined for users.  ( Image below Citation- support.sas.com )
  2. Price in products is defined as Demand divided by Supply. Sometimes this is Expected Demand over Expected Supply ( see Oil Prices) Everyone grumbles over prices but we pay what we think is fair. ( citation http://bm2.genes.nig.ac.jp/RGM2/index.php?ctv=Survival
  3. Prices in services are defined by value creation as well- Value= Benefit Divided by Cost  Benefits are tangible as in how much money it saves in fraud as well as intangible – how easy it is to start using JMP versus R Commander  Costs are Tangible- How much do we have to pay using our cheque book for this annual license or perpetual license or one time license or maintain contract or application support.Intangible costs are how long I have to hold the phone while talking to customer support and how much time it takes me to find the best solution using the website on my own without a sales person bothering me with frequent calls. (citation- http://academic.udayton.edu/gregelvers/psy216/spss/graphs.htm#tukey
  4. All sales people ( especially in the software industry) spam you with frequent calls, email reminders and how their company is the best company ever with the best software in the history of mankind. That is their job and they are pushed by sales quotas and pulled by their own enthusiasm to sell more to same customer. If you ever bought three licences and found out you just needed two at the end of the year- forgive the salesman. As Arthur Miller said’ All Salesmen are Dreamers  (Citation of STATA graph below http://www.ats.ucla.edu/stat/Stata/library/GraphExamples/code/grbartall.htm)
  5. Technology moves faster than you can say Jackie Robinson. and it is getting faster. Research and Development ( R and D) will always move slower than the speed at which Marketing thinks they can move. See http://www.dilbert.com for more insights on this. You either build a Billion Dollar in house lab ( like Palo Alto – remember) or you go for total outsourcing (like semi conductors and open source do). Or you go for a mix and match. ( Citation- http://people.sc.fsu.edu/~burkardt/html/matlab_graphics/matlab_graphics.html )

Based on the above parameters the best statistical software for 2009 continues to be the software that uses a mixture of Genetic Algorithms, Time Series Based Regression and Sampling – it is the software that runs in the head of the statistical /mathematical / customer BRAIN

Thats the best Software ever.

(Citation – Hugh of http://gapingvoid.com/ )

Happy Hols

News on R Commercial Development -Rattle- R Data Mining Tool

R RANT- while the European R Core leadership led by the Great Dane, Pierre Dalgaard focuses on the small picture and virtually handing the whole commercial side to Prof Nie and David Smith at Revo Computing other smaller package developers have refused to be treated as cheap R and D developers for enterprise software. How’s the book sales coming along, Prof Peter? Any plans to write another R Book or are you done with writing your version of Mathematica (Ref-Newton). Running the R Core project team must be so hard I recommend the Tarantino movie “Inglorious B…” for Herr Doktors. -END

I believe that individual R Package creators like Prof Harell (Hmisc) , or Hadley Wickham (plyr) deserve a share of the royalties or REVENUE that Revolution Computing, or ANY software company that uses R.

On this note-Some updated news on Rattle the Data Mining Tool created by Dr Graham Williams. Once again R development taken ahead by Down Under chaps while the Big Guys thrash out the road map across the Pond.

Data Mining Resources

Citation –http://datamining.togaware.com/

Rattle is a free and open source data mining toolkit written in the statistical language R using the Gnome graphical interface. It runs under GNU/Linux, Macintosh OS X, and MS/Windows. Rattle is being used in business, government, research and for teaching data mining in Australia and internationally. Rattle can be purchased on DVD (or made available as a downloadable CD image) as a standalone installation for $450USD ($560AUD), using one of the following payment buttons.

The free and open source book, The Data Mining Desktop Survival Guide (ISBN 0-9757109-2-3) simply explains the otherwise complex algorithms and concepts of data mining, with examples to illustrate each algorithm using the statistical language R. The book is being written by Dr Graham Williams, based on his 20 years research and consulting experience in machine learning and data mining. An electronic PDF version is available for a small fee from Togaware ($40AUD/$35USD to cover costs and ongoing development);

Other Resources

  • The Data Mining Software Repository makes available a collection of free (as in libre) open source software tools for data mining
  • The Data Mining Catalogue lists many of the free and commercial data mining tools that are available on the market.
  • The Australasian Data Mining Conferences are supported by Togaware, which also hosts the web site.
  • Information about the Pacific Asia Knowledge Discovery and Data Mining series of conferences is also available.
  • Data Mining course is taught at the Australian National University.
  • See also the Canberra Analytics Practise Group.
  • A Data Mining Course was held at the Harbin Institute of Technology Shenzhen Graduate School, China, 6 December – 13 December 2006. This course introduced the basic concepts and algorithms of data mining from an applications point of view and introduced the use of R and Rattle for data mining in practise.
  • Data Mining Workshop was held over two days at the University of Canberra, 27-28 November, 2006. This course introduced the basic concepts and algorithms for data mining and the use of R and Rattle.

Using R for Data Mining

The open source statistical programming language R (based on S) is in daily use in academia and in business and government. We use R for data mining within the Australian Taxation Office. Rattle is used by those wishing to interact with R through a GUI.

R is memory based so that on 32bit CPUs you are limited to smaller datasets (perhaps 50,000 up to 100,000, depending on what you are doing). Deploying R on 64bit multiple CPU (AMD64) servers running GNU/Linux with 32GB of main memory provides a powerful platform for data mining.

R is open source, thus providing assurance that there will always be the opportunity to fix and tune things that suit our specific needs, rather than rely on having to convince a vendor to fix or tune their product to suit our needs.

Also, by being open source, we can be sure that the code will always be available, unlike some of the data mining products that have disappearded (e.g., IBM’s Intelligent Miner).

See earlier interview-

https://decisionstats.wordpress.com/2009/01/13/interview-dr-graham-williams/

Analytics and BI for small biz

I saw a story on Warren B and Goldman S creating a 500$ million pool for small business owners.

  • The program will contribute $200 million to community colleges, universities and other institutions to provide small- business owners with practical business education.

  • Goldman Sachs repaid the $10 billion it was given last year under the taxpayer-funded Troubled Asset Relief Program, plus dividends. The firm continues to benefit from federal guarantees on about $21 billion of long-term debt.

  • Buffett, known as the “Oracle of Omaha” for his investing prowess, is the second-richest American. Berkshire, which invests in companies ranging from retailers to insurers, paid $5 billion in September 2008 to acquire preferred stock in Goldman Sachs that pays a 10 percent dividend. Berkshire, based in Omaha, Nebraska, also gained five-year warrants to buy $5 billion of common stock at $115 per share.

  • ( NOTE Curent Price of GS shares is 172$ – thats a 50% profit on 5 Billion~ 2.5 Billion for Mr Buffett but he is probably waiting for long term capital gains ax rates to kick in before encashing his patriotic  “Buy American. I am” warrants (see NYT op ed by him  http://www.nytimes.com/2008/10/17/opinion/17buffett.html )
  • A better analysis of the above Bloomberg story was given on Bloomberg itself at http://www.bloomberg.com/apps/news?pid=20601039&sid=asjp51YPDwJU
  • A small thought- could smaller businesses gain from efficiencies of programs like SPSS, SAS and R. Or would they be better off with customized GUI’s linked to their POS data.

Anyways a need for analytics for small businesses in inventory management, and sales planning could help. Joe the Plumber could do with some ETS and Regression Models as well.

However apart for Salesforce.com applications this field seems to be totally vacant for analytics. What are IBM SPSS, SAS, or even other stats packages doing for small businesses. or even developing Salesforce.com applications for their own equivalent software

The market could be an interesting one to atleast do a test in. Unless you don’t believe in test and control.

See below the IBM Cognos by IBM itself and the third party app by Pervasive for SAP Integration-

Citation-

http://sites.force.com/appexchange/listingDetail?listingId=a0N300000016YGYEA2

and

http://sites.force.com/appexchange/listingDetail?listingId=a0N300000016am1EAA