Interview David Smith REvolution Computing

Here is an Interview with REvolution Computing’s Director of Community David Smith.

Our development team spent more than six months making R work on 64-bit Windows (and optimizing it for speed), which we released as REvolution R Enterprise bundled with ParallelR.” David Smith

Ajay -Tell us about your journey in science. In particular tell us what attracted you to R and the open source movement.

David- I got my start in science in 1990 working with CSIRO (the government science organization in Australia) after I completed my degree in mathematics and computer science. Seeing the diversity of projects the statisticians there worked on really opened my eyes to statistics as the way of objectively answering questions about science.

That’s also when I was first introduced to the S language, the forerunner of R. I was hooked immediately; it was just so natural for doing the work I had to do. I also had the benefit of a wonderful mentor, Professor Bill Venables, who at the time was teaching S to CSIRO scientists at remote stations around Australia. He brought me along on his travels as an assistant. I learned a lot about the practice of statistical computing helping those scientists solve their problems (and got to visit some great parts of Australia, too).

Ajay- How do you think we should help bring more students to the fields of mathematics and science-

David- For me, statistics is the practical application of mathematics to the real world of messy data, complex problems and difficult conclusions. And in recent years, lots of statistical problems have broken out of geeky science applications to become truly mainstream, even sexy. In our new information society, graduating statisticians have a bright future ahead of them which I think will inevitably draw more students to the field.

Ajay- Your blog at REVolution Computing is one of the best technical corporate blogs. In particular the monthly round up of new packages, R events and product launches all written in a lucid style. Are there any plans for a REvolution computing community or network as well instead of just the blog.

David- Yes, definitely. We recently hired Danese Cooper as our Open Source Diva to help us in this area. Danese has a wealth of experience building open-source communities, such as for Java at Sun. We’ll be announcing some new community initiatives this summer. In the meantime, of course, we’ll continue with the Revolutions blog, which has proven to be a great vehicle for getting the word out about R to a community that hasn’t heard about it before. Thanks for the kind words about the blog, by the way — it’s been a lot of fun to write. It will be a continuing part of our community strategy, and I even plan to expand the roster of authors in the future, too. (If you’re an aspiring R blogger, please get in touch!)

Ajay- I kind of get confused between what exactly is 32 bit or 64 bit computing in terms of hardware and software. What is the deal there. How do Enterprise solutions from REvolution take care of the 64 bit computing. How exactly does Parallel computing and optimized math libraries in REvolution R help as compared to other flavors of R.

David– Fundamentally, 64-bit systems allow you to process larger data sets with R — as long as you have a version of R compiled to take advantage of the increased memory available. (I wrote about some of the technical details behind this recently on the blog.)  One of the really exciting trends I’ve noticed over the past 6 months is that R is being applied to larger and more complex problems in areas like predictive analytics and social networking data, so being able to process the largest data sets is key.

One common mis perception is that 64-bit systems are inherently faster than their 32-bit equivalents, but this isn’t generally the case. To speed up large problems, the best approach is to break the problem down into smaller components and run them in parallel on multiple machines. We created the ParallelR suite of packages to make it easy to break down such problems in R and run them on a multiprocessor workstation, a local cluster or grid, or even cloud computing systems like Amazon’s EC2 .

” While the core R team produces versions of R for 64-bit Linux systems, they don’t make one for Windows. Our development team spent more than six months making R work on 64-bit Windows (and optimizing it for speed), which we released as REvolution R Enterprise bundled with ParallelR. We’re excited by the scale of the applications our subscribers are already tackling with a combination of 64-bit and parallel computing”

Ajay-  Command line is oh so commanding. Please describe any plans to support or help any R GUI like rattle or R Commander. Do you think Revolution R can get more users if it does help a GUI.

David- Right now we’re focusing on making R easier to use for programmers by creating a new GUI for programming and debugging R code. We heard feedback from some clients who were concerned about training their programmers in R without a modern development environment available. So we’re addressing that by improving R to make the “standard” features programmers expect (like step debugging and variable inspection) work in R and integrating it with the standard environment for programmers on Windows, Visual Studio.

In my opinion R’s strength lies in its combination of high-quality of statistical algorithms with a language ideal for applying them, so “hiding” the language behind a general-purpose GUI negates that strength a bit, I think. On the other hand it would be nice to have an open-source “user-friendly” tool for desktop statistical analysis, so I’m glad others are working to extend R in that area.

Ajay- Companies like SAS are investing in SaaS and cloud computing. Zementis offers scored models on the cloud through PMML. Any views on just building the model or analytics on the cloud itself.

David- To me, cloud computing is a cost-effective way of dynamically scaling hardware to the problem at hand. Not everyone has access to a 20-machine cluster for high-performing computing — and even those that do can’t instantly convert it to a cluster of 100 or 1000 machines to satisfy a sudden spike in demand. REvolution R Enterprise with ParallelR is unique in that it provides a platform for creating sophisticated data analysis applications distributed in the cloud, quickly and easily.

Using clouds for building models is a no-brainer for parallel-computing problems: I recently wrote about how parallel backtesting for financial trading can easily be deployed on Amazon EC2, for example. PMML is a great way of deploying static models, but one of the big advantages of cloud computing is that it makes it possible to update your model much more frequently, to keep your predictions in tune with the latest source data.

Ajay- What are the major alliances that REvolution has in the industry.

David- We have a number of industry partners. Microsoft and Intel, in particular, provide financial and technical support allowing us to really strengthen and optimize R on Windows, a platform that has been somewhat underserved by the open-source community. With Sybase, we’ve been working on combing REvolution R and Sybase Rap to produce some exciting advances in financial risk analytics. Similarly, we’ve been doing work with Vhayu’s Velocity database to provide high-performance data extraction. On the life sciences front, Pfizer is not only a valued client but in many ways a partner who has helped us “road-test” commercial grade R deployment with great success.

Ajay- What are the major R packages that REvolution supports and optimizes and how exactly do they work/help?

David- REvolution R works with all the R packages: in fact, we provide a mirror of CRAN so our subscribers have access to the truly amazing breadth and depth of analytic and graphical methods available in third-party R packages. Those packages that perform intensive mathematical calculations automatically benefit from the optimized math libraries that we incorporate in REvolution R Enterprise. In the future, we plan to work with authors of some key packages provide further improvements — in particular, to make packages work with ParallelR to reduce computation times in multiprocessor or cloud computing environments.

Ajay- Are you planning to lay off people during the recession. does REvolution Computing offer internships to college graduates. What do people at REvolution Computing do to have fun?

David- On the contrary, we’ve been hiring recently. We don’t have an intern program in place just yet, though. For me, it’s been a really fun place to work. Working for an open-source company has a different vibe than the commercial software companies I’ve worked for before. The most fun for me has been meeting with R users around the country and sharing stories about how R is really making a difference in so many different venues — over a few beers of course!


David Smith
Director of Community

David has a long history with the statistical community.  After graduating with a degree in Statistics from the University of Adelaide, South Australia, David spent four years researching statistical methodology at Lancaster University (United Kingdom), where he also developed a number of packages for the S-PLUS statistical modeling environment. David continued his association with S-PLUS at Insightful (now TIBCO Spotfire) where for more than eight years he oversaw the product management of S-PLUS and other statistical and data mining products. David is the co-author (with Bill Venables) of the tutorial manual, An Introduction to R , and one of the originating developers of ESS: Emacs Speaks Statistics. Prior to joining REvolution, David was Vice President, Product Management at Zynchros, Inc.

AjayTo know more about David Smith and REvolution Computing do visit http://www.revolution-computing.com and

http://www.blog.revolution-computing.com
Also see interview with Richard Schultz ,­CEO REvolution Computing here.

http://www.decisionstats.com/2009/01/31/interviewrichard-schultz-ceo-revolution-computing/

Google Custom Search

Here is a revised version of the Custom Search Engine that I first talked of last year- this year it now includes Business Intelligence Sites.

Try it out and let me know of you want to help create a customized Data Mining Engine- Note it already has 800 plus analytics and Business Intelligence Sites.

I got much better results than Google when searching for R, but thats to be expected 🙂

Building KXEN Models on Ubuntu

Doing analytics on Linux sometimes seems user unfriendly but the reality is itis not so- and it is actually cheaper for you as you can focus on the analytical software rather than the Operating System licensing costs.

virtualization-kvm-ubuntu
kxen2

virtualization-kvm-ubuntu

Note: The software used in this were KXEN Linux version 2.4 and Ubuntu Hardy Heron.

Using KXEN on an Ubuntu Linux proved surprisingly even more easy. Thanks to some excellent help provided by the KXEN support team and some discussions with KXEN’s head of Research,Bertrand the following 5 step procedure should help you start building models in KXEN right away.

Using Ubuntu has the added advantages of security, low costs as well as all the ease of a Graphical User Interface-

1)Backward Compatability

$ sudo apt-get install libstdc++5

(it will then ask the password).

2)Installing Java

$ sudo apt-get install sun-java6-jre

3) Download and Unzipping the Software
Download the zipped folder from the KXEN Download Site.
Unzip the Linux Version of KXEN system from the Download Site- This creates the master folder (example Kxen_X86-Linux-2.4.21-4.Elsmp_v5_0_3 )

4)Licensing
Run the KXEN Node Generator in KxNodeCodeGenerator folder within the master folder above.
The new file KXEN Node.txt is then sent back to support team and they send the License_nl.cfg

5)Installation ( for stand alone client)

Install a JVM 1.4.2 and export the java/bin directory in the $PATH environment variable.

The exe is located in the folder KJWizardJNI- The following commands
$ cd KJWizardJNI
$ PATH=/opt/j2sdk1.4.2_10/bin:$PATH ; export PATH
$ ./KJWizardJNI.sh

Run KXEN Models Happily ever after !!!

Note KXEN offers the ability to export models in a variety of formats including PMML, SAS, SQL and other languages

Disclaimer- I am a consultant on social media to KXEN

Google Custom Search

Here is a revised version of the Custom Search Engine that I first talked of last year- this year it now includes Business Intelligence Sites.

Try it out and let me know of you want to help create a customized Data Mining Engine- Note it already has 800 plus analytics and Business Intelligence Sites.

I got much better results than Google when searching for R, but thats to be expected 🙂

saP or saS or sasR or saaS

Some pending news and posts- It appears that the company SAP is moving closer to major acquisitions. This includes launching more and more applications that are analytical in nature as well coming together in an alliance with hardware major Teradata. Teradata off course is a very close partner to SAS Institute. So could SAP and SAS and or Terdata be moving closer to a major announcement on BI and BA merging.

The open source database movement with Hadoop is the one which can be the real game changer in the managed database industry and AsterData is the company to watch here.

However R with its modular extensions is a different paradigm in language developement and SAS no longer has the nimbleness or flexibity in creating such apps- at the same time it has lost a fair deal of credibility in the young academia (due to R) as well cost sensitive consumers (due to WPS)

The succession issue of Jim Goodnight continues to be the biggest problem for SAS Institute- Jim is not getting younger and his second line is not expected to be of the same class as the Sall/ Goodnight partnership. Of all the major companies in software, Jim Goodnight stood alone in remaining private and thus managed to escape distractions of share prices while building up the franchise. Surviving oil shocks, cold wars, three recessions Mr Goodnight has cared for his local community as well despite being active in SAS and fending off sustained attempts by open source languages.

. An automatic partner for Mr Goodnight should have been Google or even Google Labs with the Brin/Page duo being the top data miners ( commerically) of this generation as Sall/Goodnight were 30 years ago.

SAP may spend a lot of its cash but the supply chain paradigm is best served by SaaS and exemplified by Salesforce.com and Force.com developers.

As the ancient Chinese said- May you live in interesting times.

SPSS launches two more PASWs

Just got news from the Chicago school of analytics, or the company known as SPSS. they have decided to lauch two more PASW products and you can see this from the release itself.

SPSS Inc. and the value of Predictive Analytics.

This week we announced PASW Data Collection 5.6 feedback management and survey research software, and PASW Collaboration & Deployment Services 4, our integrated platform to share, manage, automate and integrate analytic assets directly into business processes.

PASW Data Collection 5.6 (formerly Dimensions)

* The use of surveys to capture “Voice of the Customer” across multiple touch-points is integral to bringing data about people’s attitudes into analytical decision-making to improve customer intimacy.
* PASW Data Collection 5.6 supports the entire survey lifecycle — from authoring to managing the data collection process to survey reporting and analysis — supporting global, multichannel research and feedback collection.
* New functionality includes data entry capabilities, an enhanced authoring interface suitable for the novice and the research professional, and new phone-based interviewing capabilities designed to shape the modern survey research call center. This release also further extends the enterprise readiness of the data collection platform with enhancements to performance and security.

You can read the press release at http://www.spss.com/press/template_view.cfm?PR_ID=1088

PASW Collaboration and Deployment Services 4 (formerly Predictive Enterprise Services)

* The platform automates analytical processes for greater consistency and control, and deploys results to business users, consumers or directly into operational systems to reduce customer churn, improve marketing campaigns or identify cases of fraud.
* PASW Collaboration and Deployment Services 4 provides the foundation to integrate analytics into key business processes, so the right decisions are made and the best actions are taken on a consistent, repeatable basis.
* New functionality includes enhanced collaboration capabilities that provide more options for publishing analytical results; enhancements to the Automation Service with additional integration options; and a Real-time Scoring Service to deploy analytical scores into existing applications.

You can read the full press release at http://www.spss.com/press/template_view.cfm?PR_ID=1087