KNIME and Zementis shake hands

Two very good and very customer centric (and open source ) companies shook hands on a strategic partnership today.

Knime  www.knime.org and Zementis www.zementis.com .

Decision Stats has been covering these companies and both the products are amazing good, synch in very well thanks to the support of the PMML standard and lower costs considerably for the consumer. (http://www.decisionstats.com/2009/02/knime/ ) and http://www.decisionstats.com/2009/02/interview-michael-zeller-ceozementis/ )

While Knime has both a free personal as well as a commercial license , it supports R thanks to the PMML (www.dmg.org initiative ). Knime also supports R very well .

See http://www.knime.org/blog/export-and-convert-r-models-pmml-within-knime

The following example R script learns a decision tree based on the Iris-Data and exports this as PMML and as an R model which is understood by the R Predictor node:

# load the library for learning a tree model
library(rpart);
# load the pmml export library
library(pmml);
# use class column as predicted column to build decision tree
dt <- rpart(class~., R)
# export to PMML
r_pmml <- pmml(dt)
# write the PMML model to an export file
write(toString(r_pmml), file="C:/R.pmml")
# provide the native R model at the out-port
R<-dt

 

Zementis takes the total cost of ownership and total pain of creating scored models to something close to 1$ /hour thanks to using their proprietary ADAPA engine.

The big big Analytics Conference

The Predictive Analytics Conference (http://www.predictiveanalyticsworld.com/ ) starts today in Hotel Nikko ,San Francisco . A whole who’s who of analytics experts is gathering there including SAS,SPSS ,SAP, Click Forensics ,Acxiom ,Amazon, Google and a big R user conference as well. It is really really huge so stay tuned for some exciting announcements happening there.

image

SAS , R and NYT – The Sequel

Here is a follow up article to the SAS vs. R articles by Ashlee V of the NYT.

 

The SAS Institute has borrowed a page from Sesame Street. It is now sponsoring the letter ‘R.’

Last month, I wrote an article about the rising popularity of the R programming language. The open-source software has turned into a favorite piece of technology for statisticians and other people looking to pull insights out of data.

On several levels, R represents a threat to SAS, which is the largest seller of commercial statistics software. Students at universities now learn R alongside SAS. In addition, the open-source nature of R allows the software to be tweaked at a pace that is hard for a commercial software maker to match.

All told, surging interest in the free R language could affect sales of SAS software, which can sell for thousands of dollars. Rather than running from the threat, SAS appears ready to try to understand R by adopting a more active role in its development.

You can read more at http://bits.blogs.nytimes.com/2009/02/16/sas-warms-to-open-source-one-letter-at-a-time/ or even by clicking on the Bits RSS feed in the sidebar on www.decisionstats.com

Ajay –

Note SAS is only opening up the SAS/IML product to integrate R’s matrix language capabilities. The base SAS software seems to be still not integrated with R and so is the statistics module SAS/Stat (SAS Institute sells in add on modules based on functionality and prices accordingly).

Many third party sources like http://www.minequest.com have created interfaces from Base SAS to R – they are priced at around 50 $ a piece.

An additional threat to SAS’s dominance is from the WPS software from a UK based company , World Programming http://www.teamwpc.co.uk/home (which has an alliance with IBM) . WPS software can read , and write in SAS language and read and write SAS datasets as well, and is priced at 660 $ almost one tenth of SAS Institute’s licenses.

The recession is also forcing many large license holders of statistical software (like Banks and Financial Services) to seek discounts and alternatives. SAS Institute remains the industry leader in analytics software after almost 35 years of dominance.

However this is a nice first step and it would be interesting to see follow up steps from SAS Institute rivals .

We can all go on our respective open source and closed source jets now.

comments from Anne H. Milley, director for technology product marketing at SAS, who relegated R to a limited role.

In the article, Ms. Milley said, “I think it addresses a niche market for high-end data analysts that want free, readily available code. We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet.”

Modeling : R Code,Books and Documents

Here is an equivalent of Proc Genmod in R .

If the SAS language code is as below-

PROC GENMOD DATA=X;
CLASS FLH;
MODEL BS/OCCUPANCY = distcrop distfor flh distcrop*flh /D=B LINK=LOGIT
TYPE3; RUN;

 

Then the R language equivalent would be :

glm(bs/occupancy ~ distcrop*flh+distcrop,
   family=binomial(logit), weights=occupancy)
where flh needs to be a factor

 

Credit to Peter Dalgaard from the R-Help List 

Peter is also author of the splendid standard R book

 

Speaking of books – Here is one R book I am looking /waiting for

 

A similar named free document ( Introduction to statistical modelling in R by P.M.E.Altham, Statistical Laboratory, University of Cambridge)  is available here –

http://www.statslab.cam.ac.uk/~pat/redwsheets.pdf

It is a pretty nice reference document if Modelling is what you do, and R is what you need to explore.It was dated 5 February 2009, so its quite updated and new.You can also check Dr Altham’s home page for a lot of R resources.

SAS adds support to R

From the official website itself http://support.sas.com/rnd/app/studio/Rinterface2.html

R Interface Coming to SAS/IML® Studio

While readers of the New York Times may have learned about R in recent weeks, it’s not news to many at SAS.

“R is a leading language for developing new statistical methods,” said Bob Rodriguez, Senior Director of Statistical Development at SAS. “Our new PhD developers learned R in their graduate programs and are quite versed in it.”

R is a matrix-based programming language that allows you to program statistical methods reasonably quickly. It’s open source software, and many add-on packages for R have emerged, providing statisticians with convenient access to new research. Many new statistical methods are first programmed in R.

While SAS is committed to providing the new statistical methodologies that the marketplace demands and will deliver new work more quickly with a recent decoupling of the analytical product releases from Base SAS, a commercial software vendor can only put out new work so fast. And never as as fast as a professor and a grad student writing an academic implementation of brand-new methodology.

Both R and SAS are here to stay, and finding ways to make them work better with each other is in the best interests of our customers.

“We know a lot of our users have both R and SAS in their tool kit, and we decided to make it easier for them to access R by making it available in the SAS environment,” said Rodriguez. “Our first interface to R will be in an upcoming version of SAS/IML Studio (currently known as SAS Stat Studio), scheduled for this summer.”

The SAS/IML Studio interface allows you to integrate R functionality with IML or SAS programs. You can also exchange data between SAS and R as data sets or matrices.

“This is just the first step,” said Radhika Kulkarni, Vice President of Advanced Analytics. “We are busy working on an R interface that can be surfaced in the SAS server or via other SAS clients. For example, users will be able to interface with R through the IML procedure, possibly as soon as the first part of 2010.“

SAS/IML Studio is distributed with SAS/IML software. Stay tuned for details on availability.

 

Note-SAS/IML ,Base SAS and SAS/Stat are  copyrighted products of SAS Institute.

This is a welcome step from the industry leader SAS Institute and also puts an effective stop to rumors of it being too arrogant or too conservative to change.

Perhaps no other software maker has dominated the niche in which it operates for as long as SAS has ( even before I was born !) without getting into any kind of hassles. The decision to stay  private as a company also means an incredibly wise decision given the carnage on stock markets today ( but it requires a lot of will power from the founders to say no to the easy billions that investment bankers would have lined up for the IPO).

This decision would also help the R project greatly as SAS support definitely means the matrix part of the R language has come to stay.However R is not just a matrix based programming language , it has capabilities for data mining and other statistical analysis as well. Would SAS extend SAS /Stat capabilities to R / What does recent decoupling of the analytical product releases from Base SAS mean ( is this due to the WPS challenge) .

Either way the consumer is the winner.Kudos SAS Institute !!

As mentioned before, Zementis is at the forefront of using Cloud Computing ( Amazon EC2 ) for open source analytics. Recently I came in contact with Michael Zeller for a business problem , and Mike being the gentleman he is not only helped me out but also agreed on an extensive and exclusive interview.(!)

image

Ajay- What are the traditional rivals to scoring solutions offered by you. How does ADAPA compare to each of them. Case Study- Assume I have 50000 leads daily on a Car buying website. How would ADAPA help me in scoring the model ( created say by KXEN or , R or,SAS, or SPSS).What would my approximate cost advantages be if I intend to mail say the top 5 deciles everyday.

Michael- Some of the traditional scoring solutions used today are based on SAS, in-database scoring like Oracle, MS SQL Server, or very often even custom code.  ADAPA is able to import the models from all tools that support the PMML standard, so any of the above tools, open source or commercial, could serve as an excellent development environment.

The key differentiators for ADAPA are simple and focus on cost-effective deployment:

1) Open Standards – PMML & SOA:

Freedom to select best-of-breed development tools without being locked into a specific vendor;  integrate easily with other systems.

2) SaaS-based Cloud Computing:

Delivers a quantum leap in cost-effectiveness without compromising on scalability.

In your example, I assume that you’d be able to score your 50,000 leads in one hour using one ADAPA engine on Amazon.  Therefore, you could choose to either spend US$100,000 or more on hardware, software, maintenance, IT services, etc., write a project proposal, get it approved by management, and be ready to score your model in 6-12 months…

OR, you could use ADAPA at something around US$1-$2 per day for the scenario above and get started today!  To get my point across here, I am of course simplifying the scenario a little bit, but in essence these are your choices.

Sounds too good to be true?  We often get this response, so please feel free to contact us today [http://www.zementis.com/contact.htm] and we will be happy show you how easy it can be to deploy predictive models with ADAPA!

 

Ajay- The ADAPA solution seems to save money on both hardware and software costs. Comment please. Also any benchmarking tests that you have done on a traditional scoring configuration system versus ADAPA.

Michael-Absolutely, the ADAPA Predictive Analytics Edition [http://www.zementis.com/predictive_analytics_edition.htm] on Amazon’s cloud computing infrastructure (Amazon EC2) eliminates the upfront investment in hardware and software.  It is a true Software as a Service (SaaS) offering on Amazon EC2 [http://www.zementis.com/howtobuy.htm] whereby users only pay for the actual machine time starting at less than US$1 per machine hour.  The ADAPA SaaS model is extremely dynamic, e.g., a user is able to select an instance type most appropriate for the job at hand (small, large, x-large) or launch one or even 100 instances within minutes.

In addition to the above savings in hardware/software, ADAPA also cuts the time-to-market for new models (priceless!) which adds to business agility, something truly critical for the current economic climate.

Regarding a benchmark comparison, it really depends on what is most important to the business.  Business agility, time-to-market, open standards for integration, or pure scoring performance?  ADAPA addresses all of the above.  At its core, it is a highly scalable scoring engine which is able to process thousands of transactions per second.  To tackle even the largest problems, it is easy to scale ADAPA via more CPUs, clustering, or parallel execution on multiple independent instances. 

Need to score lots of data once a month which would take 100 hours on one computer?  Simply launch 10 instances and complete the job in 10 hours over night.  No extra software licenses, no extra hardware to buy — that’s capacity truly on-demand, whenever needed, and cost-effective.

Ajay- What has been your vision for Zementis. What exciting products are we going to see from it next.

Michael – Our vision at Zementis [http://www.zementis.com] has been to make it easier for users to leverage analytics.  The primary focus of our products is on the deployment side, i.e., how to integrate predictive models into the business process and leverage them in real-time.  The complexity of deployment and the cost associated with it has been the main hurdle for a more widespread adoption of predictive analytics. 

Adhering to open standards like the Predictive Model Markup Language (PMML) [http://www.dmg.org/] and SOA-based integration, our ADAPA engine [http://www.zementis.com/products.htm] paves the way for new use cases of predictive analytics — wherever a painless, fast production deployment of models is critical or where the cost of real-time scoring has been prohibitive to date.

We will continue to contribute to the R/PMML export package [http://www.zementis.com/pmml_exporters.htm] and extend our free PMML converter [http://www.zementis.com/pmml_converters.htm] to support the adoption of the standard.  We believe that the analytics industry will benefit from open standards and we are just beginning to grasp what data-driven decision technology can do for us.  Without giving away much of our roadmap, please stay tuned for more exciting products that will make it easier for businesses to leverage the power of predictive analytics!

Ajay- Any India or Asia specific plans for the Zementis.

Michael-Zementis already serves customers in the Asia/Pacific region from its office in Hong Kong.  We expect rapid growth for predictive analytics in the region and we think our cost-effective SaaS solution on Amazon EC2 will be of great service to this market.  I could see various analytics outsourcing and consulting firms benefit from using ADAPA as their primary delivery mechanism to provide clients with predictive  models that are ready to be executed on-demand.

Ajay-What do you believe be the biggest challenges for analytics in 2009. What are the biggest opportunities.

Michael-The biggest challenge for analytics will most likely be the reduction in technology spending in a deep, global recession.  At the same time, companies must take advantage of analytics to cut cost, optimize processes, and to become more competitive.  Therefore, the biggest opportunity for analytics will be in the SaaS field, enabling clients to employ analytics without upfront capital expenditures.

Ajay – What made you choose a career in science. Describe your journey so far.What would your advice be to young science graduates in this recessionary times.

Michael- As a physicist, my research focused on neural networks and intelligent systems.  Predictive analytics is a great
way for me to stay close to science while applying such complex algorithms to solve real business problems.  Even in a recession, there is always a need for good people with the desire to excel in their profession.  Starting your career, I’d say the best way is to remain broad in expertise rather than being too specialized on one particular industry or proficient in a single analytics tool.  A good foundation of math and computer science, combined with curiosity in how to apply analytics to specific business problems will provide opportunities, even in the current economic climate.

About Zementis

Zementis, Inc. is a software company focused on predictive analytics and advanced Enterprise Decision Management technology. We combine science and software to create superior business imageand industrial solutions for our clients. Our scientific expertise includes statistical algorithms, machine learning, neural networks, and intelligent systems and our scientists have a proven record in producing effective predictive models to extract hidden patterns from a variety of data types. It is complemented by our product offering ADAPA®, a decision engine framework for real-time execution of predictive models and rules. For more information please visit www.zementis.com

Ajay-If you have a lot of data ( GB’s and GB’s) , an existing model ( in SAS,SPSS,R) which you converted to PMML, and it is time for you to choose between spending more money to upgrade your hardware, renew your software licenses  then instead take a look at the ADAPA from www.zementis.com and score models as low as 1$ per hour. Check it out ( test and control !!)

Do you have any additional queries from Michael ? Use the comments page to ask….

I just downloaded R Comp’s latest release of REvolution R. The individual Win 32 version is free, while Enterprise version with Win 64 versions. Tech support is included in services contract for the software which should help with any corporate willing to take R on a trial basis.

 

From the press release ,

REvolution Computing Makes High Performance ‘REvolution R’

Available For Download

New Haven, CT – January 28, 2009 – REvolution Computing, a leading provider of open source predictive analytics solutions, today announced that it has made a public version of its commercial grade REvolution R program available for download from its website. REvolution R is REvolution Computing’s distribution of the popular R statistical software, optimized for use in commercial environments.

With the latest release of REvolution R, REvolution Computing has added significant performance enhancements to the base system, which can prove to be of great value in both commercial and research settings. A key feature includes the use of powerful optimized libraries capable of boosting performance by a factor of 5 or 10 for commonly used operations. In addition, REvolution R has been put through a quality process designed to meet regulatory agency audit standards, making the subscription version reliable for use in mission critical research and production.

“In making our latest release of REvolution R available for download, REvolution Computing is providing all R users the ability to take advantage of optimized and validated software previously available only to commercial users,” said REvolution Computing CEO, Richard Schultz. “In a true commercial open source way, we have reached the point in our development that we are able to offer significant value to both sets of our community users – REvolution R for all users, and REvolution R Enterprise, with additional commercial-grade capabilities and support, available by annual subscription.”

REvolution’s commercial distribution, REvolution R Enterprise, features advanced functionality, including ParallelR, which speeds deployment across both multiprocessor workstations and clusters to enable the same codes to be used for prototyping and production. REvolution R Enterprise is functional with 64-bit platforms and Linux enterprise platforms and provides for telephone support and response guarantees.

Some background on the company itself ………..from the company itself-

 

About REvolution Computing

New Haven, Connecticut-based REvolution Computing is the leading commercial provider of software and support for the statistical computing language known as “R.” 

Our products, including REvolution R and REvolution R Enterprise, enable statisticians, scientists and others to create superior predictive models and derive meaning from large sets of mission-critical data in record time. REvolution Computing

 

works closely with the R community to incorporate the latest developments in open source R, and with our clients to support their efforts to produce groundbreaking innovations in life sciences, financial services, defense technology and other industries where high-level analytics are crucial to success. At REvolution Computing, “We do the math.”

The product names “RPro,” “ParallelR,” “REvolution R,” and “REvolution R Enterprise,” are trademarks of REvolution Computing.

 

This basically gives the company first mover

advantage in commercial R. The timing is also fortunate as companies across the world look to cut costs (unfortunately labor costs are being cut faster than software costs) as well as move beyond traditional analytics softwares that performed ah so well in the sub prime prediction market.

REvolution R is available for download on Windows and Intel MacOS X, both in 32-bit mode at http://www.revolution-computing.com/downloads/revolution-r.php