Trrrouble in land of R…and Open Source Suggestions

Recently some comments by Ross Ihake , founder of R Statistical Software on Revolution Analytics, leading commercial vendor of R….. came to my attention-

http://www.stat.auckland.ac.nz/mail/archive/r-downunder/2010-May/000529.html

[R-downunder] Article on Revolution Analytics

Ross Ihaka ihaka at stat.auckland.ac.nz
Mon May 10 14:27:42 NZST 2010


On 09/05/10 09:52, Murray Jorgensen wrote:
> Perhaps of interest:
>
> http://www.theregister.co.uk/2010/05/06/revolution_commercial_r/

Please note that R is "free software" not "open source".  These guys
are selling a GPLed work without disclosing the source to their part
of the work. I have complained to them and so far they have given me
the brush off. I am now considering my options.

Don't support these guys by buying their product. The are not feeding
back to the rights holders (the University of Auckland and I are rights
holders and they didn't even have the courtesy to contact us).

--
Ross Ihaka                         Email:  ihaka at stat.auckland.ac.nz
Department of Statistics           Phone:  (64-9) 373-7599 x 85054
University of Auckland             Fax:    (64-9) 373-7018
Private Bag 92019, Auckland
New Zealand
and from http://www.theregister.co.uk/2010/05/06/revolution_commercial_r/
Open source purists probably won't be all too happy to learn that Revolution is going to be employing an "open core" strategy, which means the core R programs will remain open source and be given tech support under a license model, but the key add-ons that make R more scalable will be closed source and sold under a separate license fee. Because most of those 2,500 add-ons for R were built by academics and Revolution wants to supplant SPSS and SAS as the tools used by students, Revolution will be giving the full single-user version of the R Enterprise stack away for free to academics. 
Conclusion-
So one co-founder of R is advocating not to buy from Revolution Analytics , which has the other co-founder of R, Gentleman on its board. 
Source- http://www.revolutionanalytics.com/aboutus/leadership.php

2) If Revolution Analytics is using 2500 packages for free but insisting on getting paid AND closing source of it’s packages (which is a technical point- how exactly can you prevent source code of a R package from being seen)

Maybe there can be a PACKAGE marketplace just like Android Apps, Facebook Apps, and Salesforce.com Apps – so atleast some of the thousands of R package developers can earn – sorry but email lists do not pay mortgages and no one is disputing the NEED for commercializing R or rewarding developers.

Though Barr created SAS, he gave up control to Goodnight and Sall https://decisionstats.wordpress.com/2010/06/02/sas-early-days/

and Goodnight and Sall do pay their developers well- to the envy of not so well paid counterparts.

3) I really liked the innovation of Revolution Analytics RevoScalar, and I wish that the default R dataset be converted to XDF dataset so that it basically kills

off the R criticism of being slow on bigger datasets. But I also realize the need for creating an analytics marketplace for R developers and R students- so academic version of R being free and Revolution R being paid seems like a trade off.

Note- You can still get a job faster as a stats student if you mention SAS and not R as a statistical skill- not all stats students go into academics.

4) There can be more elegant ways of handling this than calling for ignoring each other as REVOLUTION and Ihake seem to be doing to each other.

I can almost hear people in Cary, NC chuckling at Norman Nie, long time SPSS opponent and now REVOLUTION CEO, and his antagonizing R’s academicians within 1 year of taking over- so I hope this ends well for all. The road to hell is paved with good intentions- so if REVOLUTION can share some source code with say R Core members (even Microsoft shares source code with partners)- and R Core and Revolution agree on a licensing royalty from each other, they can actually speed up R package creation rather than allow this 2 decade effort to end up like S and S plus and TIBCO did.

Maybe Richard Stallman can help-or maybe Ihaka has a better sense of where things will go down in a couple of years-he must know something-he invented it, didnt he

On 09/05/10 09:52, Murray Jorgensen wrote:
> Perhaps of interest:
>
> http://www.theregister.co.uk/2010/05/06/revolution_commercial_r/

Please note that R is "free software" not "open source".  These guys
are selling a GPLed work without disclosing the source to their part
of the work. I have complained to them and so far they have given me
the brush off. I am now considering my options.

Don't support these guys by buying their product. The are not feeding
back to the rights holders (the University of Auckland and I are rights
holders and they didn't even have the courtesy to contact us).

--
Ross Ihaka                         Email:  ihaka at stat.auckland.ac.nz
Department of Statistics           Phone:  (64-9) 373-7599 x 85054
University of Auckland             Fax:    (64-9) 373-7018
Private Bag 92019, Auckland
New Zealand

Towards better analytical software

Here are some thoughts on using existing statistical software for better analytics and/or business intelligence (reporting)-

1) User Interface Design Matters- Most stats software have a legacy approach to user interface design. While the Graphical User Interfaces need to more business friendly and user friendly- example you can call a button T Test or You can call it Compare > Means of Samples (with a highlight called T Test). You can call a button Chi Square Test or Call it Compare> Counts Data. Also excessive reliance on drop down ignores the next generation advances in OS- namely touchscreen instead of mouse click and point.

Given the fact that base statistical procedures are the same across softwares, a more thoughtfully designed user interface (or revamped interface) can give softwares an edge over legacy designs.

2) Branding of Software Matters- One notable whine against SAS Institite products is a premier price. But really that software is actually inexpensive if you see other reporting software. What separates a Cognos from a Crystal Reports to a SAS BI is often branding (and user interface design). This plays a role in branding events – social media is often the least expensive branding and marketing channel. Same for WPS and Revolution Analytics.

3) Alliances matter- The alliances of parent companies are reflected in the sales of bundled software. For a complete solution , you need a database plus reporting plus analytical software. If you are not making all three of the above, you need to partner and cross sell. Technically this means that software (either DB, or Reporting or Analytics) needs to talk to as many different kinds of other softwares and formats. This is why ODBC in R is important, and alliances for small companies like Revolution Analytics, WPS and Netezza are just as important as bigger companies like IBM SPSS, SAS Institute or SAP. Also tie-ins with Hadoop (like R and Netezza appliance)  or  Teradata and SAS help create better usage.

4) Cloud Computing Interfaces could be the edge- Maybe cloud computing is all hot air. Prudent business planing demands that any software maker in analytics or business intelligence have an extremely easy to load interface ( whether it is a dedicated on demand website) or an Amazon EC2 image. Easier interfaces win and with the cloud still in early stages can help create an early lead. For R software makers this is critical since R is bad in PC usage for larger sets of data in comparison to counterparts. On the cloud that disadvantage vanishes. An easy to understand cloud interface framework is here ( its 2 years old but still should be okay) http://knol.google.com/k/data-mining-through-cloud-computing#

5) Platforms matter- Softwares should either natively embrace all possible platforms or bundle in middle ware themselves.

Here is a case study SAS stopped supporting Apple OS after Base SAS 7. Today Apple OS is strong  ( 3.47 million Macs during the most recent quarter ) and the only way to use SAS on a Mac is to do either

http://goo.gl/QAs2

or do a install of Ubuntu on the Mac ( https://help.ubuntu.com/community/MacBook ) and do this

http://ubuntuforums.org/showthread.php?t=1494027

Why does this matter? Well SAS is free to academics and students  from this year, but Mac is a preferred computer there. Well WPS can be run straight away on the Mac (though they are curiously not been able to provide academics or discounted student copies 😉 ) as per

http://goo.gl/aVKu

Does this give a disadvantage based on platform. Yes. However JMP continues to be supported on Mac. This is also noteworthy given the upcoming Chromium OS by Google, Windows Azure platform for cloud computing.

Open Source and Software Strategy

Curt Monash at Monash Research pointed out some ongoing open source GPL issues for WordPress and the Thesis issue (Also see http://ma.tt/2009/04/oracle-and-open-source/ and  http://www.mattcutts.com/blog/switching-things-around/).

As a user of both going upwards of 2 years- I believe open source and GPL license enforcement are general parts of software strategy of most software companies nowadays. Some thoughts on  open source and software strategy-Thesis remains a very very popular theme and has earned upwards of 100,000 $ for its creator (estimate based on 20k plus installs and 60$ avg price)

  • Little guys like to give away code to get some satisfaction/ recognition, big guys give away free code only when its necessary or when they are not making money in that product segment anyway.
  • As Ethan Hunt said, ” Every Hero needs a Villian”. Every software (market share) war between players needs One Big Company Holding more market share and Open Source Strategy between other player who is not able to create in house code, so effectively out sources by creating open source project. But same open source propent rarely gives away the secret to its own money making project.
    • Examples- Google creates open source Android, but wont reveal its secret algorithm for search which drives its main profits,
    • Google again puts a paper for MapReduce but it’s Yahoo that champions Hadoop,
    • Apple creates open source projects (http://www.apple.com/opensource/) but wont give away its Operating Source codes (why?) which help people buys its more expensive hardware,
    • IBM who helped kickstart the whole proprietary code thing (remember MS DOS) is the new champion of open source (http://www.ibm.com/developerworks/opensource/) and
    • Microsoft continues to spark open source debate but read http://blogs.technet.com/b/microsoft_blog/archive/2010/07/02/a-perspective-on-openness.aspx and  also http://www.microsoft.com/opensource/
    • SAS gives away a lot of open source code (Read Jim Davis , CMO SAS here , but will stick to Base SAS code (even though it seems to be making more money by verticals focus and data mining).
    • SPSS was the first big analytics company that helps supports R (open source stats software) but will cling to its own code on its softwares.
    • WordPress.org gives away its software (and I like Akismet just as well as blogging) for open source, but hey as anyone who is on WordPress.com knows how locked in you can get by its (pricy) platform.
    • Vendor Lock-in (wink wink price escalation) is the elephant in the room for Big Software Proprietary Companies.
    • SLA Quality, Maintenance and IP safety is the uh-oh for going in for open source software mostly.
  • Lack of IP protection for revenue models for open source code is the big bottleneck  for a lot of companies- as very few software users know what to do with source code if you give it to them anyways.
    • If companies were confident that they would still be earning same revenue and there would be less leakage or theft, they would gladly give away the source code.
    • Derivative softwares or extensions help popularize the original softwares.
      • Half Way Steps like Facebook Applications  the original big company to create a platform for third party creators),
      • IPhone Apps and Android Applications show success of creating APIs to help protect IP and software control while still giving some freedom to developers or alternate
      • User Interfaces to R in both SAS/IML and JMP is a similar example
  • Basically open source is mostly done by under dog while top dog mostly rakes in money ( and envy)
  • There is yet to a big commercial success in open source software, though they are very good open source softwares. Just as Google’s success helped establish advertising as an alternate ( and now dominant) revenue source for online companies , Open Source needs a big example of a company that made billions while giving source code away and still retaining control and direction of software strategy.
  • Open source people love to hate proprietary packages, yet there are more shades of grey (than black and white) and hypocrisy (read lies) within  the open source software movement than the regulated world of big software. People will be still people. Software is just a piece of code.  😉

(Art citation-http://gapingvoid.com/about/ and http://gapingvoidgallery.com/

Interview : R For Stata Users

Here is an interview with Bob Muenchen , author of ” R For SAS and SPSS Users” and co-author with Joe Hilbe of ” R for Stata Users”.

Describe your new book R for Stata Users and how it is helpful to users.

Stata is a marvelous software package. Its syntax is well designed, concise and easy to learn. However R offers Stata users advantages in two key areas: education and analysis.

Regarding education, R is quickly becoming the universal language of data analysis. Books, journal articles and conference talks often include R code because it’s a powerful language and everyone can run it. So R has become an essential part of the education of data analysts, statisticians and data miners.

Regarding analysis, R offers a vast array of methods that R users have written. Next to R, Stata probably has more useful user-written add-ons than any other analytic software. The Statistical Software Components collection at Boston College’s Department of Economics is quite impressive (http://ideas.repec.org/s/boc/bocode.html), containing hundreds of useful additions to Stata. However, R’s collection of add-ons currently contains 3,680 packages, and more are being added every week.  Stata users can access these fairly easily by doing their data management in Stata, saving a Stata format data set, importing it into R and running what they need. Working this way, the R program may only be a few lines long.

In our book, the section “Getting Started Quickly” outlines the most essential 50 pages for Stata users to read to work in this way. Of course the book covers all the basics of R, should the reader wish to learn more. Being enthusiastic programmers, we’ll be surprised if they don’t want to read it all.

There are many good books on R, but as I learned the language I found myself constantly wondering how each concept related to the packages I already knew. So in this book we describe R first using Stata terminology and then using R terminology. For example, when introducing the R data frame, we start out saying that it’s just like a Stata data set: a rectangular set of variables that are usually numeric with perhaps one or two character variables. Then we move on to say that R also considers it a special type of “list” which constrains all its “components” to be equal in length. That then leads into entirely new territory.

The entire book is laid out to make learning easy for Stata users. The names used in the table of contents are Stata-based. The reader may look up how to “collapse” a data set by a grouping variable to find that one way R can do that is with the mysteriously named “tapply” function. A Stata user would never have guessed to look for that name

. When reading from cover-to-cover that may not be that big of a deal, but as you go back to look things up it’s a huge time saver. The index is similar in that you can look every subject up by its Stata name to find the R function or vice versa. People see me with both my books near my desk and chuckle that they’re there for advertising. Not true! I look details up in them all the time.

I didn’t have enough in-depth knowledge of Stata to pull this off by myself, so I was pleased to get Joe Hilbe as a co-author. Joe is a giant in the world of Stata. He wrote several of the Stata commands that ship with the product including glm, logistic and manova. He was also the first editor of the Stata Technical Bulletin, which later turned into the Stata Journal. I have followed his work from his days as editor of the statistical software reviews section in the journal The American Statistician. There he not only edited but also wrote many of the reviews which I thoroughly enjoyed reading over the years. If you don’t already know Stata, his review of Stata 9.0 is still good reading (November 1, 2005, 59(4): 335-348).

Describe the relationship between Stata and R and how it is the same or different from SAS / SPSS and R.

This is a very interesting question. I pointed out in R for SAS and SPSS Users that SAS and SPSS are structured very similarly while R is totally different. Stata, on the other hand, has many similarities to R. Here I’ll quote directly from the book:

• Both include rich programming languages designed for writing new analytic methods, not just a set of prewritten commands.

• Both contain extensive sets of analytic commands written in their own languages.

• The pre-written commands in R, and most in Stata, are visible and open for you to change as you please.

• Both save command or function output in a form you can easily use as input to further analysis.

• Both do modeling in a way that allows you to readily apply your models for tasks such as making predictions on new data sets. Stata calls these postestimation commands and R calls them extractor functions.

• In both, when you write a new command, it is on an equal footing with commands written by the developers. There are no additional “Developer’s Kits” to purchase.

• Both have legions of devoted users who have written numerous extensions and who continue to add the latest methods many years before their competitors.

• Both can search the Internet for user-written commands and download them automatically to extend their capabilities quickly and easily.

• Both hold their data in the computer’s main memory, offering speed but limiting the amount of data they can handle.

Can the book be used by a R user for learning Stata

That’s certainly not ideal. The sections that describe the relationship between the two languages would be good to know and all the example programs are presented in both R and Stata form. However, we spend very little time explaining the Stata programs while going into the R ones step by step. That said, I continue to receive e-mails from R experts who learned SAS or SPSS from R for SAS and SPSS Users, so it is possible.

Describe the response to your earlier work R for SAS and SPSS users and if any new editions is forthcoming.

I am very pleased with the reviews for R for SAS and SPSS Users. You can read them all, even the one really bad one, at http://r4stats.com. We incorporated all the advice from those reviews into R for Stata Users, so we hope that this book will be well received too.

In the first book, Appendix B: A Comparison of SAS and SPSS Products with R Packages and Functions has been particularly popular for helping people find the R packages they need. As it expanded, I moved it to the web site: http://r4stats.com/add-on-modules. All three packages are changing so fast that I sometimes edit that table several times per week!
The second edition to R for SAS and SPSS Users is due to the publisher by the end of February, so it will be in the bookstores by sometime in April 2011, if all goes as planned. I have a list of thirty new topics to add, and those won’t all fit. I have some tough decisions to make!
On a personal note, Ajay, it was a pleasure getting to meet you when you came to UT, especially our chats on the current state of the analytics market and where it might be headed. I love the fact that the Internet allows people to meet across thousands of miles. I look forward to reading more on DecisionStats!
About –

Bob Muenchen has twenty-eight years of experience consulting, managing and teaching in a variety of complex, research oriented computing environments. You can read about him here http://web.utk.edu/~muenchen/RobertMuenchenResume.html

R for Stats : Updated

Here is the new website for statistical analysis using the free analytical software called R (which is enabled for cloud computing as well : see here http://bit.ly/OhriCloud

or http://rgrossman.com/2009/05/17/running-r-on-amazons-ec2/

for the R tutorial on running it on Amazon’s EC2 pay per demand RAM.

It is called R 4 stats or simply http://www.r4stats.com/

Hosted on Google’s Updated Google Sites Platform- it offers a preview to Bob’s earlier run away hit R for SAS and SPSS users updation as well as his upcoming work R for Stata Users.

In Bob’s words himself –

I have substantially expanded the table that compares SAS and SPSS
add-on modules to somewhat equivalent R packages. This new version is
at:
http://r4stats.com/add-on-modules
and I would very much appreciate any feedback you might have on it.

The site http://r4stats.com is the replacement to
http://RforSASandSPSSusers.com and includes the support files for both
“R for SAS and SPSS Users” and the new “R for Stata Users”, due out in
March from Springer.

Topic SAS Product SPSS Product R Package
Advanced Models
SAS/STAT IBM SPSS Advanced Statistics
R, MASS, many others
Association Analysis
Enterprise Miner
IBM SPSS Association
arules, arulesNBMiner, arulesSequences
Basics Base SAS
IBM SPSS Statistics Base
R
Bootstrapping
SAS/STAT
IBM SPSS Bootstrapping
BootCL, BootPR, boot, bootRes, BootStepAIC, bootspecdens, bootstrap, FRB, gPdtest, meboot, multtest, pvclust, rqmcmb2, scaleboot, simpleboot
Classification Analysis
Enterprise Miner
IBM SPSS Classification
rattle, see the neural networks and trees entries in this table.
Conjoint Analysis
SAS/STAT: PROC TRANSREG
IBM SPSS Conjoint
homals, psychoR, bayesm
Correspondence Analysis
SAS/STAT: PROC CORRESP
IBM SPSS Categories
ade4, cocorresp, FactoMineR, homals, made4, MASS, psychoR, PTAk, vegan
Custom Tables
Base SAS, PROC REPORT, PROC SQL, PROC TABULATE, Enterprise Reporter
IBM SPSS Custom Tables
reshape
Data Access
SAS/ACCESS
SPSS Data Access Pack
DBI, foreign, Hmisc: sas.get, sasxport.get, RODBC
Data Collection
SAS/FSP
IBM SPSS Data Collection Family
RSQLite, and the other open source programs MySQL or PostgreSQL are popular among R users for this purpose.
Data Mining
Enterprise Miner
IBM SPSS Modeler
(formerly Clementine)
arules, FactoMineR, rattle, various functions
Data Mining, In-database Processing
SAS In-Database Initiative with Teradata
IBM SPSS Modeler
PL/R
Data Preparation
Various procedures
IBM SPSS Data Preparation, various commands
dprep, plyr, reshape, sqldf, various functions
Developer Tools
SAS/AF, SAS/FSP, SAS Integration Technologies, SAS/TOOLKIT IBM SPSS Statistics Developer, IBM SPSS Statistics Programmability Extension
StatET, R links to most popular compilers, scripting languages, and databases.
Direct Marketing
Nothing quite like it
IBM SPSS Direct Marketing
Nothing quite like it
Exact Tests
SAS/STAT various
IBM SPSS Exact Tests
coin, elrm, exactLoglinTest, exactmaxsel, and options in many others
Excel Integration
SAS Enterprise BI Server IBM SPSS Advantage for Excel 2007
RExcel
Forecasting
SAS/ETS
IBM SPSS Forecasting
Over 40 packages that do time series are described at the Task View link above under Time Series.
Forecasting, Automated
Forecast Server IBM SPSS Forecasting
forecast
Genetics JMP Genomics
None http://www.bioconductor.org
Geographic Information Systems
SAS/GIS, SAS/GRAPH
None (Maps is defunct)
maps, mapdata, mapproj, GRASS via spgrass6, RColorBrewer, see Spatial in Task Views at link at top
Graphical user interfaces
Enterprise Guide, IML Studio, SAS/ASSIST, Analyst, Insight
IBM SPSS Statistics Base Deducer, JGR, R Commander, pmg, rattle, many others at http://www.sciviews.org/_rgui/
Graphics, Interactive
SAS/IML Studio, SAS/INSIGHT, JMP
None
GGobi via rggobi, iPlots, latticist, playwith
Graphics, Static
SAS/GRAPH
SPSS Base, Graphics Production Language
ggplot2, gplots, graphics, grid, gridBase, hexbin, lattice, plotrix, scatterplot3d, vcd, vioplot, geneplotter, Rgraphics
Graphics, Template Builder
Doesn’t use Grammar of Graphics model that forms the core of IBM SPSS Viz Designer or R’s ggplot2
IBM SPSS Viz Designer
Doesn’t use templates, but this GUI for ggplot2 http://www.stat.ucla.edu/~jeroen/ggplot2.html works similarly to IBM SPSS Viz Designer.
Guided Analytics
SAS/LAB
None
None
Matrix/linear Algebra
SAS/IML Studio
IBM SPSS Matrix
R, matlab, Matrix, sparseM
Missing Values Imputation
SAS/STAT: PROC MI
IBM SPSS Missing Values
amelia, Hmisc: aregImpute, EMV, rms (replaces Design): fit.mult.impute, mice, mitools, mvnmle, VIM
Neural Networks
Enterprise Miner
IBM SPSS Neural Networks
AMORE, grnnR, neuralnet, nnet, rattle
Operations Research
SAS/OR
None
glpk, linprog, LowRankQP, TSP
Power Analysis
SAS Power and Sample Size Application, SAS/STAT:
PROC POWER,
PROC GLMPOWER
SamplePower
asypow, powerpkg, pwr, MBESS
Quality Control
SAS/QC
IBM SPSS Statistics Base qcc, spc
Regression Models
SAS/STAT
IBM SPSS Regression
R, Hmisc, lasso, VGAM, pda, rms (replaces Design)
Sampling, Complex
SAS/STAT: PROC SURVEY SELECT, SURVEYMEANS, etc.
IBM SPSS Complex Samples
pps, sampfling, sampling, spsurvey, survey
Segmentation Analysis
Enterprise Miner
IBM Modeler Segmentation
cluster, rattle, som, see CRAN Task Views under Cluster for over 70 packages
Server Version
SAS for your particular server IBM SPSS Statistics Server,
IBM SPSS Modeler Server
rapache, R(D)COM Server, Rserve, StatET
Structural Equation Modeling
SAS/STAT: PROC CALIS
Amos OpenMX, sem
Text Analysis/Mining
Text Miner
IBM SPSS Text Analytics,
IBM SPSS Text Analysis for Surveys
Rstem, las, tm
Trees, Decision, Classification or Regression
Enterprise Miner
IBM SPSS Decision Trees, IBM SPSS AnswerTree, IBM SPSS Modeler (formerly Clementine)
ada, adabag, BayesTree, boost, GAMboost, gbev, gbm, maptree, mboost, mvpart, party, pinktoe,
quantregForest, rpart,rpart.permutation, randomForest, rattle, tree

All SAS and SPSS product names are registered trademarks of their respective companies.

Disclaimer- Bob Muenchen and I work for the same University. While we do have interesting conflicts often, his interview was one of the earliest where this blog began.

See- http://sites.google.com/site/r4statistics/interview