As mentioned before, Zementis is at the forefront of using Cloud Computing ( Amazon EC2 ) for open source analytics. Recently I came in contact with Michael Zeller for a business problem , and Mike being the gentleman he is not only helped me out but also agreed on an extensive and exclusive interview.(!)

image

Ajay- What are the traditional rivals to scoring solutions offered by you. How does ADAPA compare to each of them. Case Study- Assume I have 50000 leads daily on a Car buying website. How would ADAPA help me in scoring the model ( created say by KXEN or , R or,SAS, or SPSS).What would my approximate cost advantages be if I intend to mail say the top 5 deciles everyday.

Michael- Some of the traditional scoring solutions used today are based on SAS, in-database scoring like Oracle, MS SQL Server, or very often even custom code.  ADAPA is able to import the models from all tools that support the PMML standard, so any of the above tools, open source or commercial, could serve as an excellent development environment.

The key differentiators for ADAPA are simple and focus on cost-effective deployment:

1) Open Standards – PMML & SOA:

Freedom to select best-of-breed development tools without being locked into a specific vendor;  integrate easily with other systems.

2) SaaS-based Cloud Computing:

Delivers a quantum leap in cost-effectiveness without compromising on scalability.

In your example, I assume that you’d be able to score your 50,000 leads in one hour using one ADAPA engine on Amazon.  Therefore, you could choose to either spend US$100,000 or more on hardware, software, maintenance, IT services, etc., write a project proposal, get it approved by management, and be ready to score your model in 6-12 months

OR, you could use ADAPA at something around US$1-$2 per day for the scenario above and get started today!  To get my point across here, I am of course simplifying the scenario a little bit, but in essence these are your choices.

Sounds too good to be true?  We often get this response, so please feel free to contact us today [http://www.zementis.com/contact.htm] and we will be happy show you how easy it can be to deploy predictive models with ADAPA!

 

Ajay- The ADAPA solution seems to save money on both hardware and software costs. Comment please. Also any benchmarking tests that you have done on a traditional scoring configuration system versus ADAPA.

Michael-Absolutely, the ADAPA Predictive Analytics Edition [http://www.zementis.com/predictive_analytics_edition.htm] on Amazon’s cloud computing infrastructure (Amazon EC2) eliminates the upfront investment in hardware and software.  It is a true Software as a Service (SaaS) offering on Amazon EC2 [http://www.zementis.com/howtobuy.htm] whereby users only pay for the actual machine time starting at less than US$1 per machine hour.  The ADAPA SaaS model is extremely dynamic, e.g., a user is able to select an instance type most appropriate for the job at hand (small, large, x-large) or launch one or even 100 instances within minutes.

In addition to the above savings in hardware/software, ADAPA also cuts the time-to-market for new models (priceless!) which adds to business agility, something truly critical for the current economic climate.

Regarding a benchmark comparison, it really depends on what is most important to the business.  Business agility, time-to-market, open standards for integration, or pure scoring performance?  ADAPA addresses all of the above.  At its core, it is a highly scalable scoring engine which is able to process thousands of transactions per second.  To tackle even the largest problems, it is easy to scale ADAPA via more CPUs, clustering, or parallel execution on multiple independent instances. 

Need to score lots of data once a month which would take 100 hours on one computer?  Simply launch 10 instances and complete the job in 10 hours over night.  No extra software licenses, no extra hardware to buy — that’s capacity truly on-demand, whenever needed, and cost-effective.

Ajay- What has been your vision for Zementis. What exciting products are we going to see from it next.

Michael – Our vision at Zementis [http://www.zementis.com] has been to make it easier for users to leverage analytics.  The primary focus of our products is on the deployment side, i.e., how to integrate predictive models into the business process and leverage them in real-time.  The complexity of deployment and the cost associated with it has been the main hurdle for a more widespread adoption of predictive analytics. 

Adhering to open standards like the Predictive Model Markup Language (PMML) [http://www.dmg.org/] and SOA-based integration, our ADAPA engine [http://www.zementis.com/products.htm] paves the way for new use cases of predictive analytics — wherever a painless, fast production deployment of models is critical or where the cost of real-time scoring has been prohibitive to date.

We will continue to contribute to the R/PMML export package [http://www.zementis.com/pmml_exporters.htm] and extend our free PMML converter [http://www.zementis.com/pmml_converters.htm] to support the adoption of the standard.  We believe that the analytics industry will benefit from open standards and we are just beginning to grasp what data-driven decision technology can do for us.  Without giving away much of our roadmap, please stay tuned for more exciting products that will make it easier for businesses to leverage the power of predictive analytics!

Ajay- Any India or Asia specific plans for the Zementis.

Michael-Zementis already serves customers in the Asia/Pacific region from its office in Hong Kong.  We expect rapid growth for predictive analytics in the region and we think our cost-effective SaaS solution on Amazon EC2 will be of great service to this market.  I could see various analytics outsourcing and consulting firms benefit from using ADAPA as their primary delivery mechanism to provide clients with predictive  models that are ready to be executed on-demand.

Ajay-What do you believe be the biggest challenges for analytics in 2009. What are the biggest opportunities.

Michael-The biggest challenge for analytics will most likely be the reduction in technology spending in a deep, global recession.  At the same time, companies must take advantage of analytics to cut cost, optimize processes, and to become more competitive.  Therefore, the biggest opportunity for analytics will be in the SaaS field, enabling clients to employ analytics without upfront capital expenditures.

Ajay – What made you choose a career in science. Describe your journey so far.What would your advice be to young science graduates in this recessionary times.

Michael- As a physicist, my research focused on neural networks and intelligent systems.  Predictive analytics is a great
way for me to stay close to science while applying such complex algorithms to solve real business problems.  Even in a recession, there is always a need for good people with the desire to excel in their profession.  Starting your career, I’d say the best way is to remain broad in expertise rather than being too specialized on one particular industry or proficient in a single analytics tool.  A good foundation of math and computer science, combined with curiosity in how to apply analytics to specific business problems will provide opportunities, even in the current economic climate.

About Zementis

Zementis, Inc. is a software company focused on predictive analytics and advanced Enterprise Decision Management technology. We combine science and software to create superior business imageand industrial solutions for our clients. Our scientific expertise includes statistical algorithms, machine learning, neural networks, and intelligent systems and our scientists have a proven record in producing effective predictive models to extract hidden patterns from a variety of data types. It is complemented by our product offering ADAPA, a decision engine framework for real-time execution of predictive models and rules. For more information please visit www.zementis.com

Ajay-If you have a lot of data ( GBs and GBs) , an existing model ( in SAS,SPSS,R) which you converted to PMML, and it is time for you to choose between spending more money to upgrade your hardware, renew your software licenses  then instead take a look at the ADAPA from www.zementis.com and score models as low as 1$ per hour. Check it out ( test and control !!)

Do you have any additional queries from Michael ? Use the comments page to ask.

Interview –Michael Zeller CEO,Zementis

As mentioned before, Zementis is at the forefront of using Cloud Computing ( Amazon EC2 ) for open source analytics. Recently I came in contact with Michael Zeller for a business problem , and Mike being the gentleman he is not only helped me out but also agreed on an extensive and exclusive interview.(!)

image

Ajay- What are the traditional rivals to scoring solutions offered by you. How does ADAPA compare to each of them. Case Study- Assume I have 50000 leads daily on a Car buying website. How would ADAPA help me in scoring the model ( created say by KXEN or , R or,SAS, or SPSS).What would my approximate cost advantages be if I intend to mail say the top 5 deciles everyday.

Michael- Some of the traditional scoring solutions used today are based on SAS, in-database scoring like Oracle, MS SQL Server, or very often even custom code.  ADAPA is able to import the models from all tools that support the PMML standard, so any of the above tools, open source or commercial, could serve as an excellent development environment.

The key differentiators for ADAPA are simple and focus on cost-effective deployment:

1) Open Standards – PMML & SOA:

Freedom to select best-of-breed development tools without being locked into a specific vendor;  integrate easily with other systems.

2) SaaS-based Cloud Computing:

Delivers a quantum leap in cost-effectiveness without compromising on scalability.

In your example, I assume that you’d be able to score your 50,000 leads in one hour using one ADAPA engine on Amazon.  Therefore, you could choose to either spend US$100,000 or more on hardware, software, maintenance, IT services, etc., write a project proposal, get it approved by management, and be ready to score your model in 6-12 months…

OR, you could use ADAPA at something around US$1-$2 per day for the scenario above and get started today!  To get my point across here, I am of course simplifying the scenario a little bit, but in essence these are your choices.

Sounds too good to be true?  We often get this response, so please feel free to contact us today [http://www.zementis.com/contact.htm] and we will be happy show you how easy it can be to deploy predictive models with ADAPA!

 

Ajay- The ADAPA solution seems to save money on both hardware and software costs. Comment please. Also any benchmarking tests that you have done on a traditional scoring configuration system versus ADAPA.

Michael-Absolutely, the ADAPA Predictive Analytics Edition [http://www.zementis.com/predictive_analytics_edition.htm] on Amazon’s cloud computing infrastructure (Amazon EC2) eliminates the upfront investment in hardware and software.  It is a true Software as a Service (SaaS) offering on Amazon EC2 [http://www.zementis.com/howtobuy.htm] whereby users only pay for the actual machine time starting at less than US$1 per machine hour.  The ADAPA SaaS model is extremely dynamic, e.g., a user is able to select an instance type most appropriate for the job at hand (small, large, x-large) or launch one or even 100 instances within minutes.

In addition to the above savings in hardware/software, ADAPA also cuts the time-to-market for new models (priceless!) which adds to business agility, something truly critical for the current economic climate.

Regarding a benchmark comparison, it really depends on what is most important to the business.  Business agility, time-to-market, open standards for integration, or pure scoring performance?  ADAPA addresses all of the above.  At its core, it is a highly scalable scoring engine which is able to process thousands of transactions per second.  To tackle even the largest problems, it is easy to scale ADAPA via more CPUs, clustering, or parallel execution on multiple independent instances. 

Need to score lots of data once a month which would take 100 hours on one computer?  Simply launch 10 instances and complete the job in 10 hours over night.  No extra software licenses, no extra hardware to buy — that’s capacity truly on-demand, whenever needed, and cost-effective.

Ajay- What has been your vision for Zementis. What exciting products are we going to see from it next.

Michael – Our vision at Zementis [http://www.zementis.com] has been to make it easier for users to leverage analytics.  The primary focus of our products is on the deployment side, i.e., how to integrate predictive models into the business process and leverage them in real-time.  The complexity of deployment and the cost associated with it has been the main hurdle for a more widespread adoption of predictive analytics. 

Adhering to open standards like the Predictive Model Markup Language (PMML) [http://www.dmg.org/] and SOA-based integration, our ADAPA engine [http://www.zementis.com/products.htm] paves the way for new use cases of predictive analytics — wherever a painless, fast production deployment of models is critical or where the cost of real-time scoring has been prohibitive to date.

We will continue to contribute to the R/PMML export package [http://www.zementis.com/pmml_exporters.htm] and extend our free PMML converter [http://www.zementis.com/pmml_converters.htm] to support the adoption of the standard.  We believe that the analytics industry will benefit from open standards and we are just beginning to grasp what data-driven decision technology can do for us.  Without giving away much of our roadmap, please stay tuned for more exciting products that will make it easier for businesses to leverage the power of predictive analytics!

Ajay- Any India or Asia specific plans for the Zementis.

Michael-Zementis already serves customers in the Asia/Pacific region from its office in Hong Kong.  We expect rapid growth for predictive analytics in the region and we think our cost-effective SaaS solution on Amazon EC2 will be of great service to this market.  I could see various analytics outsourcing and consulting firms benefit from using ADAPA as their primary delivery mechanism to provide clients with predictive  models that are ready to be executed on-demand.

Ajay-What do you believe be the biggest challenges for analytics in 2009. What are the biggest opportunities.

Michael-The biggest challenge for analytics will most likely be the reduction in technology spending in a deep, global recession.  At the same time, companies must take advantage of analytics to cut cost, optimize processes, and to become more competitive.  Therefore, the biggest opportunity for analytics will be in the SaaS field, enabling clients to employ analytics without upfront capital expenditures.

Ajay – What made you choose a career in science. Describe your journey so far.What would your advice be to young science graduates in this recessionary times.

Michael- As a physicist, my research focused on neural networks and intelligent systems.  Predictive analytics is a grea
t way for me to stay close to science while applying such complex algorithms to solve real business problems.  Even in a recession, there is always a need for good people with the desire to excel in their profession.  Starting your career, I’d say the best way is to remain broad in expertise rather than being too specialized on one particular industry or proficient in a single analytics tool.  A good foundation of math and computer science, combined with curiosity in how to apply analytics to specific business problems will provide opportunities, even in the current economic climate.

About Zementis

Zementis, Inc. is a software company focused on predictive analytics and advanced Enterprise Decision Management technology. We combine science and software to create superior business imageand industrial solutions for our clients. Our scientific expertise includes statistical algorithms, machine learning, neural networks, and intelligent systems and our scientists have a proven record in producing effective predictive models to extract hidden patterns from a variety of data types. It is complemented by our product offering ADAPA®, a decision engine framework for real-time execution of predictive models and rules. For more information please visit www.zementis.com

Ajay-If you have a lot of data ( GB’s and GB’s) , an existing model ( in SAS,SPSS,R) which you converted to PMML, and it is time for you to choose between spending more money to upgrade your hardware, renew your software licenses  then instead take a look at the ADAPA from www.zementis.com and score models as low as 1$ per hour. Check it out ( test and control !!)

Do you have any additional queries from Michael ? Use the comments page to ask….

SPSS and R

I rarely use SPSS now, but in college ( www.iiml.ac.in) my marketing professors kind of ensured I was buried in it for weeks. Much later I did to some ARIMA forecasting in SPSS for macro economic indicators prediction ( details coming up)–

 

However the SPSS help list is a great one ( SPSSX-L@LISTSERV.UGA.EDU) , not just for staying in touch with SPSS but also with the latest statistical modeling techniques. Here is an extract from the list ( www.listserv.uga.edu/archives/spssx-l.html ) on using SPSS and R together

 

Assuming version 16 or later, you need to install the R plug-in from Developer Central.  Then your R syntax can be run in the syntax window between

BEGIN PROGRAM R.

and

END PROGRAM R.

The output automatically appears in the SPSS Viewer with two cautions.  1) In version 16, R graphics are written to files and don’t appear in the Viewer.  Version 17 integrates the graphics directly.  2) When using R interactively, expression output appears in your console windows, e.g.,

summary(dta)

displays the summary statistics for a data frame, dta.  In non-interactive mode, which is what you are in when running BEGIN PROGRAM, you need to enclose the expression in a print function for it to display, e.g.,

print(summary(dta))

The documentation for the apis to communicate between SPSS and R is installed along with the plug-in, and there are examples in the Data Management book linked on Developer Central (www.spss.com/devcentral).

You might also go through the PowerPoint article on Developer Central, "Programmability in SPSS Statistics 17", which you will find on the front page of the site.  It includes a detailed example of using the R Quantreg package in SPSS as an extension command.  There is also a download in the R section on creating an SPSS dialog box that generates an R program directly.  Look for Rboxplot – Creating an R Program from a Dialog.  This has a simple dialog box that generates code for an R boxplot along with an article that explains what is happening.

 

Ajay ‘s 2 cents– SPSS treats R as an opportunity rather than a threat, partly because SPSS is a much lower priced software , and has been working to displace SAS in vain for some time now.

SAS ( the company and not the language) as the market leader has the most to lose due to

  • its high market share ( which it has maintained by aggressively seeking both legal action as well as by pumping in or investing or generously giving — huge amounts of money in hosting conferences,papers and research and keeping alumni and current employees happy and loyal),

and

  • premium pricing ( which comes under greater pricing pressure amid a general economic downturn amongst its preferred customers -especially banks and companies like Amazon , GE Money etc)

and

  • multi pronged competition with tacit support from bigger players waiting on sidelines
  • ( like IBM has an alliance with WPS which is almost a de facto Base SAS clone as it can take in SAS datasets, SAS code, and output SAS code, SAS datasets besides having it’s own Eclipse based design for the Workbench
  • Microsoft expanding data mining capabilities in SQL Server and initiatives like Microsoft Azure ( OS for Cloud Computers ) and Microsoft Mesh .
  • open source players like R, KNIME, Rapid Miner getting commercial momentum due to better value for cost ( 0 ).

and

  • data and code portability between SAS,SPSS,R due to PMML standards means switching barriers are getting lowered. There are almost no switching barriers between Base SAS and WPS in my testing experience.

The coming market share battles between SAS, and WPS and R will be interesting to watch for the analyst/customers — that is if the current economic crisis doesn’t claim any of the companies or the clients first. Alliances as well community networking among users and developers could be critical.

Still innovation flows from creative destruction of old ideas, mindsets, attitudes and yes even software.

SAS L Get out the Vote

Voting is still on for SAS L rookie of the year. 

One of the earliest Rookies nominated from India is ehm, me, which can be pretty rare.

Name            # of 2007  # of 2008
                       posts      posts
Ajay Ohri             0          351
Joe Matise           0          250
Karma Dorjetarap  4          219
Akshaya Nathilvar  0          169
Scott Bucher        0          154
Stefan Pohl          5          148
Richard Wright      0           79
Choy Junyu          0           73
Scott Raynaud      0           69
SAS 9 BI USER      0           55
Tom Smith           0           53
Jim Agnew           0           52
Josh Lee              0           51

 

If you are on the SAS-L list, you can vote for the following

You can vote (one vote per person please) at:
http://ires.ku.edu/~ipsr/SGF2009/saslbof.htm
Voting will end February 12th.

KNIME

Check out KNIME from www.knime.org if you are looking for modular data extensibility and ability to do exploratory analysis. You can use it for data modeling using decision tree and extend it further. Thanks to Bob Schultz and REVolution Computing guys as well as Mike Zeller of www.zementis.com for leading- pointing me this way.

  • Yes they (KNIME.org) have a commercial version as well as free version.
  • No, they wont charge you in hidden costs. Including training or learning time etc.
  • Yes , they do use PMML for porting data from platforms.
  • Best of all , it is great website with video tutorials and segmented data downloads ( German efficiency !!) while the www.r-project.org website is functional but uses HTML 4.0 ( which is from the seventies.or the eighties. or whatever)
  • No, they wont charge you for it !!!

From the website –

Welcome!

KNIME, pronounced [naim], is a modular data exploration platform that enables the user to visually create data flows (often referred to as pipelines), selectively execute some or all analysis steps, and later investigate the results through interactive views on data and models.

KNIME was developed (and will continue to be expanded) by the Chair for Bioinformatics and Information Mining at the University of Konstanz, Germany. The group headed by Michael Berthold is also using KNIME for teaching and research at the University. Quite a number of new data analysis methods developed at the chair are integrated in KNIME. Let us know if you are looking for something in particular, not all of those modules are part of the standard KNIME release just yet…

KNIME base version already incorporates over 100 processing nodes for data I/O, preprocessing and cleansing, modelling, analysis and data mining as well as various interactive views, such as scatter plots, parallel coordinates and others. It integrates all analysis modules of the well known Weka data mining environment and additional plugins allow R-scripts to be run, offering access to a vast library of statistical routines.

KNIME is based on the Eclipse platform and, through its modular API, easily extensible.

image

image

Coming up- Technical comparison of KNIME with Rapid Miner (http://rapid-i.com/content/view/26/82/) – which is similar in both free version and commercial licensing.These are both data mining rather than predictive analytics products.

I wish they host both KNIME as well as Rapid Miner on the Cloud using the Ohri Framework ( a joke which began on the SAS Consulting Group) on a Windows 64 OS , with remote desktop like functionality.So me just  logins , uploads the data,press button, wait for a sec and downloads the results.

Sigh  !!

( All screenshots in this post are acknowledged of www.knime.org)

Trusting Google

If Google is to believed the error was a human error in their bad site list, when someone wrote “\” as a bad file. This led to all sites being flagged as malware.

When that happened, customers for a time sample of 40 minutes , did the following

1) Went ahead and clicked on site they knew was okay

2) Wrote to Google on the error

3) Clicked on some sites but didn’t click on less trustable sites

The data collected from that sample is now being studied by Google. Why are they studying it ? Because in some way that clicking data, including time of search, time of clicks, frequency of repeated searches can lead to a ranking system for flagging malware sites which use popular keywords using the Adwords system ,and serving the newly discovered viruses in recent history ( including the ones which create dummy bots ) of computers.

Has Adwords been corrupted? Can Adwords be infiltrated ? Would Google tell us or try and fix the problem and then tell us?

As Andy Grove said ‘ Where the Paranoid survive”. Store all information of your Google searches, your Google Analytics data,your Orkut ,your emails and your YouTube for last nine months and anyone can have a pretty fair idea of what work,play ,hobbies you are up to. Remember Click fraud makes money for Google too- and even a 1 % increase in Click fraud rates increases Google’s quarterly earnings.

I trust Google and the “ Don’t be evil “ philosophy. But the philosophy and an apology cannot be the only safeguards for the privacy for billions of humans.

 

We Trust God. Everyone else has to bring data. Even Google. But guess what – Google wont share the data even for how they build the Chinese walls between commercial ads and search results.That’s more like a closed –source route,isn’t it.

Don’t worry. Just trust Google.