Which software do we buy? -It depends

Software (novel)
Image via Wikipedia

Often I am asked by clients, friends and industry colleagues on the suitability or unsuitability of particular software for analytical needs.  My answer is mostly-

It depends on-

1) Cost of Type 1 error in purchase decision versus Type 2 error in Purchase Decision. (forgive me if I mix up Type 1 with Type 2 error- I do have some weird childhood learning disabilities which crop up now and then)

Here I define Type 1 error as paying more for a software when there were equivalent functionalities available at lower price, or buying components you do need , like SPSS Trends (when only SPSS Base is required) or SAS ETS, when only SAS/Stat would do.

The first kind is of course due to the presence of free tools with GUI like R, R Commander and Deducer (Rattle does have a 500$ commercial version).

The emergence of software vendors like WPS (for SAS language aficionados) which offer similar functionality as Base SAS, as well as the increasing convergence of business analytics (read predictive analytics), business intelligence (read reporting) has led to somewhat brand clutter in which all softwares promise to do everything at all different prices- though they all have specific strengths and weakness. To add to this, there are comparatively fewer business analytics independent analysts than say independent business intelligence analysts.

2) Type 2 Error- In this case the opportunity cost of delayed projects, business models , or lower accuracy – consequences of buying a lower priced software which had lesser functionality than you required.

To compound the magnitude of error 2, you are probably in some kind of vendor lock-in, your software budget is over because of buying too much or inappropriate software and hardware, and still you could do with some added help in business analytics. The fear of making a business critical error is a substantial reason why open source software have to work harder at proving them competent. This is because writing great software is not enough, we need great marketing to sell it, and great customer support to sustain it.

As Business Decisions are decisions made in the constraints of time, information and money- I will try to create a software purchase matrix based on my knowledge of known softwares (and unknown strengths and weakness), pricing (versus budgets), and ranges of data handling. I will add in basically an optimum approach based on known constraints, and add in flexibility for unknown operational constraints.

I will restrain this matrix to analytics software, though you could certainly extend it to other classes of enterprise software including big data databases, infrastructure and computing.

Noted Assumptions- 1) I am vendor neutral and do not suffer from subjective bias or affection for particular software (based on conferences, books, relationships,consulting etc)

2) All software have bugs so all need customer support.

3) All software have particular advantages , strengths and weakness in terms of functionality.

4) Cost includes total cost of ownership and opportunity cost of business analytics enabled decision.

5) All software marketing people will praise their own software- sometimes over-selling and mis-selling product bundles.

Software compared are SPSS, KXEN, R,SAS, WPS, Revolution R, SQL Server,  and various flavors and sub components within this. Optimized approach will include parallel programming, cloud computing, hardware costs, and dependent software costs.

To be continued-

 

 

 

 

Top R Interviews

 

Portrait of baron A.I.Vassiliev (later - count)
Image via Wikipedia

 

Here is a list of the Top R Related Interviews I have done (in random order)-

1) John Fox , Creator of R Commander

https://decisionstats.com/2009/09/14/interview-professor-john-fox-creator-r-commander/

2) Dr Graham Williams, Creator of Rattle

https://decisionstats.com/2009/01/13/interview-dr-graham-williams/

3) David Smith, back when he was community Director of then Revolution Computing.

https://decisionstats.com/2009/05/29/interview-david-smith-revolution-computing/

and his second interview

https://decisionstats.com/2010/08/03/q-a-with-david-smith-revolution-analytics/

4) Robert Schultz, the first CEO of Revolution Computing (now Analytics)

https://decisionstats.com/2009/01/31/interviewrichard-schultz-ceo-revolution-computing/

5) Bob  Muenchen, author of R for SAS and SPSS users AND R for Stata users

https://decisionstats.com/2010/06/29/interview-r-for-stata-users/

https://decisionstats.com/2008/10/16/r-for-sas-and-spss-users/

6) Karim Chine, creator Biocep, Cloud Computing for R

https://decisionstats.com/2009/06/21/interview-karim-chine-biocep-cloud-computing-with-r/

7) Paul van Eikeran, Inference for R,the first enterprise package to use R from within MS Office.

https://decisionstats.com/2009/06/04/inference-for-r/

8) Hadley Wickham, creator GGPlot and R Author

https://decisionstats.com/2010/01/12/interview-hadley-wickham-r-project-data-visualization-guru/

Thats a lot of R interviews- I need to balance them out a bit I guess.

CommeRcial R- Integration in software

Some updates to R on the commercial side.

Revolution Computing is apparently now renamed Revolution Analytics. Hopefully this and the GUI development will help pay more focused attention on working in R in a mainstream office situation. I am still waiting for David Smith’s cheery hey-guys-we-changed-again blog post though at a new site called inside-r.org/ or his old blog site at blog.revolution-computing.com

They probably need to hire more people now – Curt Monash, noted all-things-data software guru has the inside dope here

Techworld writes more here at http://www.techworld.com.au/article/345288/startup_wants_r_alternative_ibm_sas

The company’s software is priced “aggressively” versus IBM and SAS. A single supported workstation costs $2,000 for an annual subscription. Pricing for server-based licenses varies depending on the implementation.

But Revolution Analytics faces a tough challenge from those larger vendors, as well as the likes of XLSolutions, which offers R training and a competing software package, R-Plus.

SPSS though continues to integrate R solidly and also march ahead with Python (which is likely to be the next gen in statistical programming if it keeps up) http://insideout.spss.com/

With the release of Version 18 of IBM SPSS Statistics and the Developer product, easy-to-install versions of the Python and R materials are posted.  In particular, look for the R Essentials link on the main page or from the Plugins page.  It installs the R Plugin, the correct version of R, and a bunch of example R integrations as bundles.  It’s much easier to get going with this now.

Netezza , a business intelligence vendor promises more integration and even a training in R based analytics here

R Modeling for TwinFin i-Class

Objective
Learn how to use TwinFin i-Class for scaling up the R language.

Description
In this class, you’ll learn how to use R to create models using huge data and how to create R algorithms that exploit our asymmetric massively parallel (AMPP®) architecture. Netezza has seamlessly integrated with R to offload the heavy lifting of the computational processing on TwinFin i-Class. This results in higher performance and increased scalability for R. Sign up for this class to learn how to take advantage of TwinFin i-Class for your R modeling. Topics include:

  1. R CRAN package installation on TwinFin i-Class
  2. Creating models using R on TwinFin i-Class
  3. Creating R algorithms for TwinFin i-Class

Format
Hands-on classroom lecture, lab exercises, tour

Audience
Knowledgeable R users – modelers, analytic developers, data miners

Course Length
0.5 day: 12pm-4pm Wednesday, June 23 OR 8am-12pm Thursday, June 24 OR 1pm-5pm Thursday, June 24, 2010

Delivery
Enzee Universe 2010, Boston, MA

Student Prerequisites

  • Working knowledge of R and parallel computing
  • Have analytic, compute-intensive challenges
  • Understanding of data mining and analytics”

My favourite GUI in stats , JMP (also from SAS Institute) is going to deploy R integration as soon as this September – Read more here- http://www.sas.com/news/preleases/JMP-to-R-integrationSGF10.html

Also SAS-IML studio is not lagging behind

The next release of SAS/IML will extend R integration to the server environment – enabling users to deploy results in batch mode and access R from SAS on additional platforms, such as UNIX and Linux.

I am kind of happy at one of the best GUI’s integrating with one of the most innovative stats softwares. It’s like two of your best friends getting married. (see screenshots of the softwares)

All in all- R as a platform making good overall progress from all sides of the corporate software spectrum which can only be good for R developers as well as users/students.

Norman Nie: R GUI and More

Here is an interview from Norman Nie, SPSS Founder and CEO, REvolution Computing (R Platform).

Some notable thoughts

For example, SPSS was really among the first to deliver rich GUIs that make it easier to use by more people. This is why one of the first things you’ll see from REvolution is a GUI for R – to make R more accessible and hereby further accelerate adoption.

This is good news if executed- I have often written (in agony actually because I use it) for the need for GUIs for R. My last post on that was here. Indeed the one reason SPSS was easily adopted by business school students (like me) in India in 2001-3 was the much better GUI over SAS ‘s GUIs.

However some self delusion/ PR / cognitive dissonance seems at play at Dr Nie’s words

If you look at the last 40 years of university curriculum, SPSS – the product I helped build – has been the dominant player, even becoming the common thread uniting a diverse range of disciplines, which have in turn been applied to business. Data is ubiquitous: tools and data warehouses allow you to query a given set of data repeatedly. R does these things better than the alternatives out there; it is indeed the wave of the future.

SPSS has been a strong number 2- but it has never overtaken SAS. Part of that is SAS handles much bigger datasets much more easily than SPSS did ( and that is where R’s RAM only size can be a concern). Given the decreasing prices of RAM memory, the BIG-LM like packages, and the shift for cloud based computing(with rampable memory on demand) this can be less of an issue- but analysts generally like to have a straight way of handling bigger datasets. Indeed SAS with vertical focus and the recent social media analytics continues to innovate both itself as well as through its alliance partnerships in the Enterprise software world- and REvolution Computing would further need to tie up or sew these analytical partners especially data warehousing or BI providers to ensure R’s analytical functions can be used where there is maximum value for their usage to the corporate customer as well as the academic customer.

Part 2 of Nie’s interview should be interesting .

2010-2011 would likely see

Round 2 : Red Corner ( Nie)                             Gray Corner (Goodnight)

if

Norman Nie can truly deliver a REvolution in Computing

or else

he becomes number two again the second time around to Jim Goodnight’s software giant.

News on R Commercial Development -Rattle- R Data Mining Tool

R RANT- while the European R Core leadership led by the Great Dane, Pierre Dalgaard focuses on the small picture and virtually handing the whole commercial side to Prof Nie and David Smith at Revo Computing other smaller package developers have refused to be treated as cheap R and D developers for enterprise software. How’s the book sales coming along, Prof Peter? Any plans to write another R Book or are you done with writing your version of Mathematica (Ref-Newton). Running the R Core project team must be so hard I recommend the Tarantino movie “Inglorious B…” for Herr Doktors. -END

I believe that individual R Package creators like Prof Harell (Hmisc) , or Hadley Wickham (plyr) deserve a share of the royalties or REVENUE that Revolution Computing, or ANY software company that uses R.

On this note-Some updated news on Rattle the Data Mining Tool created by Dr Graham Williams. Once again R development taken ahead by Down Under chaps while the Big Guys thrash out the road map across the Pond.

Data Mining Resources

Citation –http://datamining.togaware.com/

Rattle is a free and open source data mining toolkit written in the statistical language R using the Gnome graphical interface. It runs under GNU/Linux, Macintosh OS X, and MS/Windows. Rattle is being used in business, government, research and for teaching data mining in Australia and internationally. Rattle can be purchased on DVD (or made available as a downloadable CD image) as a standalone installation for $450USD ($560AUD), using one of the following payment buttons.

The free and open source book, The Data Mining Desktop Survival Guide (ISBN 0-9757109-2-3) simply explains the otherwise complex algorithms and concepts of data mining, with examples to illustrate each algorithm using the statistical language R. The book is being written by Dr Graham Williams, based on his 20 years research and consulting experience in machine learning and data mining. An electronic PDF version is available for a small fee from Togaware ($40AUD/$35USD to cover costs and ongoing development);

Other Resources

  • The Data Mining Software Repository makes available a collection of free (as in libre) open source software tools for data mining
  • The Data Mining Catalogue lists many of the free and commercial data mining tools that are available on the market.
  • The Australasian Data Mining Conferences are supported by Togaware, which also hosts the web site.
  • Information about the Pacific Asia Knowledge Discovery and Data Mining series of conferences is also available.
  • Data Mining course is taught at the Australian National University.
  • See also the Canberra Analytics Practise Group.
  • A Data Mining Course was held at the Harbin Institute of Technology Shenzhen Graduate School, China, 6 December – 13 December 2006. This course introduced the basic concepts and algorithms of data mining from an applications point of view and introduced the use of R and Rattle for data mining in practise.
  • Data Mining Workshop was held over two days at the University of Canberra, 27-28 November, 2006. This course introduced the basic concepts and algorithms for data mining and the use of R and Rattle.

Using R for Data Mining

The open source statistical programming language R (based on S) is in daily use in academia and in business and government. We use R for data mining within the Australian Taxation Office. Rattle is used by those wishing to interact with R through a GUI.

R is memory based so that on 32bit CPUs you are limited to smaller datasets (perhaps 50,000 up to 100,000, depending on what you are doing). Deploying R on 64bit multiple CPU (AMD64) servers running GNU/Linux with 32GB of main memory provides a powerful platform for data mining.

R is open source, thus providing assurance that there will always be the opportunity to fix and tune things that suit our specific needs, rather than rely on having to convince a vendor to fix or tune their product to suit our needs.

Also, by being open source, we can be sure that the code will always be available, unlike some of the data mining products that have disappearded (e.g., IBM’s Intelligent Miner).

See earlier interview-

https://decisionstats.wordpress.com/2009/01/13/interview-dr-graham-williams/

Who will forecast for the forecasters?

An interesting blog post appeared here at http://www.information-management.com/blogs/business_intelligence_bi_statistics-10016491-1.html basically laying down the competitive landscape for analytical companies.

“-One safe bet is that IBM, with newly-acquired SPSS and Cognos, is gearing up to take on SAS in the high-end enterprise analytics market that features very large data and operational analytics with significant capacity challenges. In this segment, IBM can leverage its hardware, database and consulting strengths to become a formidable SAS competitor.

and

A number of start-up companies promoting competitive SAS language tools at a fraction of SAS prices may begin chipping away at many SAS annuity customers. As I wrote in last week’s blog, WPS from World Programming Systems is an outstanding SAS compiler that can replace expensive SAS licenses in many cases – especially those primarily used for data step programs. Similarly, another competitor, Carolina, from Dulles Research, LLC, converts Base SAS program code to Java, which can then be deployed in a Java run-time environment. Large SAS customer Discover Card is currently evaluating Carolina as a replacement for some its SAS applications.

CITATION-Steve Miller’s blog can also be found at miller.openbi.com.”

I think all companies have hired smart enough people and many of their efforts would cancel each other out in a true game theory manner.

I also find it extremely hypocritical for commercial vendors not to incentive R algorithm developers and treat the 2000 plus packages as essentially free ware.

If used for academics and research, R package creators expect and demand no money. But if used commercially – shouldnt the leading analytical vendors like SAS, SPSS, and even the newly cash infused REVolution create some kind of royalty sharing agreement.

If iTunes can help sell songs for 99 cents per download and really help the Music industry come to the next generation- how much would commercial vendors agree to share their solutions which ARE DEPENDENT on popular R packages like Plier or even Dr Frank’s Hmisc.

Unless you think Britney Spears has a better right to intellectual property than venerable professors and academics.

Even a monthly 10000 USD prize for the best R package created ( that can be used by that specific company’s use for commercial packages) can help speed up the R software movement- just like NetFlix prize.

More importantly – it can free up valuable resources for companies to concentrate on customer solutions like data quality, systems integration and computational environment shift to clouds which even todayis sadly lacking in the whole analytical ecosystem.

One interesting paradigm I find is that who ever masters the new computational requirements of unstructured large amounts of data ( not just row and column numeric data) but text sentiment analysis like data, and can integrate this for a complete customer solution in an easy to understand data visualization enabled system- that specific package,platform  or company would be leading the next decade

( Q -if the 90s were the Nineties will the next decade be the teen years)

Portrait of a Lady

Thats a screenshot of Daneese Cooper’s Wikepedia page. Danese was fired without severance by the Intel Capital Series B investors at http://www.reolution-computing.com If this is what you get after a lifetime of working in open Source, maybe I should recommend

people get job with Prof Jim Goodnight, Phd who rarely fires people and has managed to steer his company profitably without an IPO or Series Z funding.

On the other hand I kind of admire ladies trying to work in software companies. They are so few. and look up to people like Daneese to say that yes they can make it big too.

Good bye Daneese. May your big heart rest in piece on your blog  http://danesecooper.blogs.com/.

Screenshot-28