Interview Dan Steinberg Founder Salford Systems

Here is an interview with Dan Steinberg, Founder and President of Salford Systems (http://www.salford-systems.com/ )

Ajay- Describe your journey from academia to technology entrepreneurship. What are the key milestones or turning points that you remember.

 Dan- When I was in graduate school studying econometrics at Harvard,  a number of distinguished professors at Harvard (and MIT) were actively involved in substantial real world activities.  Professors that I interacted with, or studied with, or whose software I used became involved in the creation of such companies as Sun Microsystems, Data Resources, Inc. or were heavily involved in business consulting through their own companies or other influential consultants.  Some not involved in private sector consulting took on substantial roles in government such as membership on the President’s Council of Economic Advisors. The atmosphere was one that encouraged free movement between academia and the private sector so the idea of forming a consulting and software company was quite natural and did not seem in any way inconsistent with being devoted to the advancement of science.

 Ajay- What are the latest products by Salford Systems? Any future product plans or modification to work on Big Data analytics, mobile computing and cloud computing.

 Dan- Our central set of data mining technologies are CART, MARS, TreeNet, RandomForests, and PRIM, and we have always maintained feature rich logistic regression and linear regression modules. In our latest release scheduled for January 2012 we will be including a new data mining approach to linear and logistic regression allowing for the rapid processing of massive numbers of predictors (e.g., one million columns), with powerful predictor selection and coefficient shrinkage. The new methods allow not only classic techniques such as ridge and lasso regression, but also sub-lasso model sizes. Clear tradeoff diagrams between model complexity (number of predictors) and predictive accuracy allow the modeler to select an ideal balance suitable for their requirements.

The new version of our data mining suite, Salford Predictive Modeler (SPM), also includes two important extensions to the boosted tree technology at the heart of TreeNet.  The first, Importance Sampled learning Ensembles (ISLE), is used for the compression of TreeNet tree ensembles. Starting with, say, a 1,000 tree ensemble, the ISLE compression might well reduce this down to 200 reweighted trees. Such compression will be valuable when models need to be executed in real time. The compression rate is always under the modeler’s control, meaning that if a deployed model may only contain, say, 30 trees, then the compression will deliver an optimal 30-tree weighted ensemble. Needless to say, compression of tree ensembles should be expected to be lossy and how much accuracy is lost when extreme compression is desired will vary from case to case. Prior to ISLE, practitioners have simply truncated the ensemble to the maximum allowable size.  The new methodology will substantially outperform truncation.

The second major advance is RULEFIT, a rule extraction engine that starts with a TreeNet model and decomposes it into the most interesting and predictive rules. RULEFIT is also a tree ensemble post-processor and offers the possibility of improving on the original TreeNet predictive performance. One can think of the rule extraction as an alternative way to explain and interpret an otherwise complex multi-tree model. The rules extracted are similar conceptually to the terminal nodes of a CART tree but the various rules will not refer to mutually exclusive regions of the data.

 Ajay- You have led teams that have won multiple data mining competitions. What are some of your favorite techniques or approaches to a data mining problem.

 Dan- We only enter competitions involving problems for which our technology is suitable, generally, classification and regression. In these areas, we are  partial to TreeNet because it is such a capable and robust learning machine. However, we always find great value in analyzing many aspects of a data set with CART, especially when we require a compact and easy to understand story about the data. CART is exceptionally well suited to the discovery of errors in data, often revealing errors created by the competition organizers themselves. More than once, our reports of data problems have been responsible for the competition organizer’s decision to issue a corrected version of the data and we have been the only group to discover the problem.

In general, tackling a data mining competition is no different than tackling any analytical challenge. You must start with a solid conceptual grasp of the problem and the actual objectives, and the nature and limitations of the data. Following that comes feature extraction, the selection of a modeling strategy (or strategies), and then extensive experimentation to learn what works best.

 Ajay- I know you have created your own software. But are there other software that you use or liked to use?

 Dan- For analytics we frequently test open source software to make sure that our tools will in fact deliver the superior performance we advertise. In general, if a problem clearly requires technology other than that offered by Salford, we advise clients to seek other consultants expert in that other technology.

 Ajay- Your software is installed at 3500 sites including 400 universities as per http://www.salford-systems.com/company/aboutus/index.html What is the key to managing and keeping so many customers happy?

 Dan- First, we have taken great pains to make our software reliable and we make every effort  to avoid problems related to bugs.  Our testing procedures are extensive and we have experts dedicated to stress-testing software . Second, our interface is designed to be natural, intuitive, and easy to use, so the challenges to the new user are minimized. Also, clear documentation, help files, and training videos round out how we allow the user to look after themselves. Should a client need to contact us we try to achieve 24-hour turn around on tech support issues and monitor all tech support activity to ensure timeliness, accuracy, and helpfulness of our responses. WebEx/GotoMeeting and other internet based contact permit real time interaction.

 Ajay- What do you do to relax and unwind?

 Dan- I am in the gym almost every day combining weight and cardio training. No matter how tired I am before the workout I always come out energized so locating a good gym during my extensive travels is a must. I am also actively learning Portuguese so I look to watch a Brazilian TV show or Portuguese dubbed movie when I have time; I almost never watch any form of video unless it is available in Portuguese.

 Biography-

http://www.salford-systems.com/blog/dan-steinberg.html

Dan Steinberg, President and Founder of Salford Systems, is a well-respected member of the statistics and econometrics communities. In 1992, he developed the first PC-based implementation of the original CART procedure, working in concert with Leo Breiman, Richard Olshen, Charles Stone and Jerome Friedman. In addition, he has provided consulting services on a number of biomedical and market research projects, which have sparked further innovations in the CART program and methodology.

Dr. Steinberg received his Ph.D. in Economics from Harvard University, and has given full day presentations on data mining for the American Marketing Association, the Direct Marketing Association and the American Statistical Association. After earning a PhD in Econometrics at Harvard Steinberg began his professional career as a Member of the Technical Staff at Bell Labs, Murray Hill, and then as Assistant Professor of Economics at the University of California, San Diego. A book he co-authored on Classification and Regression Trees was awarded the 1999 Nikkei Quality Control Literature Prize in Japan for excellence in statistical literature promoting the improvement of industrial quality control and management.

His consulting experience at Salford Systems has included complex modeling projects for major banks worldwide, including Citibank, Chase, American Express, Credit Suisse, and has included projects in Europe, Australia, New Zealand, Malaysia, Korea, Japan and Brazil. Steinberg led the teams that won first place awards in the KDDCup 2000, and the 2002 Duke/TeraData Churn modeling competition, and the teams that won awards in the PAKDD competitions of 2006 and 2007. He has published papers in economics, econometrics, computer science journals, and contributes actively to the ongoing research and development at Salford.

Do android hackers tweet about electric sheep?

Here is a very amusing site where bunch of hackers discuss black hat techniques to game social media- they meet in the MJ website. LOL

Thats actually the official MJ website. (also see my Poem on MJ at

https://decisionstats.com/2011/04/29/tribute-to-michael-jackson/

and https://decisionstats.com/2009/12/01/obama-and-mj-on-history/)

But back to the funny twitter gamers

http://www.michaeljackson.com/us/node/703109

MICHAEL JACKSON YOU ARE OVER THE STATUS UPDATE LIMIT. PLEASE WAIT A FEW HOURS AND TRY AGAIN.

Jim Goodnight on Open Source- and why he is right -sigh

Logo Open Source Initiative
Image via Wikipedia

Jim Goodnight – grand old man and Godfather of the Cosa Nostra of the BI/Database Analytics software industry said recently on open source in BI (btw R is generally termed in business analytics and NOT business intelligence software so these remarks were more apt to Pentaho and Jaspersoft )

Asked whether open source BI and data integration software from the likes of Jaspersoft, Pentaho and Talend is a growing threat, [Goodnight] said: “We haven’t noticed that a lot. Most of our companies need industrial strength software that has been tested, put through every possible scenario or failure to make sure everything works correctly.”

quotes from Jim Goodnight are courtesy Jason’s  story here:
http://www.cbronline.com/news/sas-ceo-says-cep-open-source-and-cloud-bi-have-limited-appeal

and the Pentaho follow-up reaction is here

http://bi.cbronline.com/news/pentaho-fires-back-across-sas-bows-over-limited-open-source-appeal

 

 

While you can rage and screech- here is the reality in terms of market share-

From Merv Adrian-‘s excellent article on market shares in BI

http://www.enterpriseirregulars.com/22444/decoding-bi-market-share-numbers-%E2%80%93-play-sudoku-with-analysts/

The first, labeled BI Platforms, is drawn fromGartner Market Share Analysis: Business Intelligence, Analytics and Performance Management Software, Worldwide, 2009, published May 2010 , and Gartner Dataquest Market Share: Business Intelligence, Analytics and Performance Management Software, Worldwide, 2009.

and

Advanced Analytics category.

and 

so whats the performance of Talend, Pentaho and Jaspersoft

From http://www.dbms2.com/category/products-and-vendors/talend/

It seems that Talend’s revenue was somewhat shy of $10 million in 2008.

and Talend itself says

http://www.talend.com/press/Talend-Announces-Record-2009-and-Continues-Growth-in-the-New-Year.php

Additional 2009 highlights include:

  • Achieved record revenue, more then doubling from 2008. The fourth quarter of 2009 was Talend’s tenth consecutive quarter of growth.
  • Grew customer base by 140% to over 1,000 customers, up from 420 at the end of 2008. Of these new customers, over 50% are Fortune 1000 companies.
  • Total downloads reached seven million, with over 300,000 users of the open source products.
  • Talend doubled its staff, increasing to 200 global employees. Continuing this trend, Talend has already hired 15 people in 2010 to support its rapid growth.

now for Jaspersoft numbers

http://www.dbms2.com/2008/09/14/jaspersoft-numbers/

Highlights include:

  • Revenue run rate in the double-digit millions.
  • 40% sequential growth most recent quarter. (I didn’t ask whether there was any reason to suspect seasonality.)
  • 130% annual revenue growth run rate.
  • “Not quite” profitable.
  • Several hundred commercial subscribers, at an average of $25K annually per, including >100 in Europe.
  • 9,000 paying customers of some kind.
  • 100,000+ total deployments, “very conservatively,” counting OEMs as one deployment each and not double-counting for OEMs’ customers. (Nick said Business Objects quotes 45,000 deployments by the same standards.)
  • 70% of revenue from the mid-market, defined as $100 million – $1 billion revenue. 30% from bigger enterprises. (Hmm. That begs a couple of questions, such as where OEM revenue comes in, and whether <$100 million enterprises were truly a negligible part of revenue.)

and for Pentaho numbers-

http://www.dbms2.com/2009/01/27/introduction-to-pentaho/

and http://www.monash.com/uploads/Pentaho-January-2009.pdf

suggests there are far far away from the top 5-6 vendors in BI

and a special mention  for postgreSQL– which is a non Profit but is seriously denting Oracle/MySQL

http://www.postgresql.org/about/

Limit Value
Maximum Database Size Unlimited
Maximum Table Size 32 TB
Maximum Row Size 1.6 TB
Maximum Field Size 1 GB
Maximum Rows per Table Unlimited
Maximum Columns per Table 250 – 1600 depending on column types
Maximum Indexes per Table Unlimited

and leading vendor is EnterpriseDB which is again IBM-partnering as well as IBM funded

http://www.sramanamitra.com/2009/05/18/enterprise-db/

and

http://www.enterprisedb.com/company/news_events/press_releases/2010_21.do

suggest it is still in early stages.

————————————————————–

So what do we conclude-

1) There is a complete lack of transparency in open source BI market shares as almost all these companies are privately held and do not disclose revenues.

2) What may be a pure play open source company may actually be a company funded by a big BI vendor (like Revolution Analytics is funded among others by Intel-Microsoft) and EnterpriseDB has IBM as an investor.MySQL and Sun of course are bought by Oracle

The degree of control by proprietary vendors on open source vendors is still not disclosed- whether they are holding a stake for strategic reasons or otherwise.

3) None of the Open Source Vendors are even close to a 1 Billion dollar revenue number.

Jim Goodnight is pointing out market reality when he says he has not seen much impact (in terms of market share). As for the rest of his remarks, well he’s got a job to do as CEO and thats talk up his company and trash the competition- which he as been doing for 3 decades and unlikely to change now unless there is severe market share impact. Unless you expect him to notice companies less than 5% of his size in revenue.

http://www.cbronline.com/news/sas-ceo-says-cep-open-source-and-cloud-bi-have-limited-appeal

http://bi.cbronline.com/news/pentaho-fires-back-across-sas-bows-over-limited-open-source-appeal

 

Libre Office

Some ambiguity about Libre Office and why it needed to change from Open Office- just when Open Office seemed so threatening on the desktop

FROM- http://www.documentfoundation.org/faq/

Q: So is this a breakaway project?

A: Not at all. The Document Foundation will continue to be focused on developing, supporting, and promoting the same software, and it’s very much business as usual. We are simply moving to a new and more appropriate organisational model for the next decade – a logical development from Sun’s inspirational launch a decade ago.

Q: Why are you calling yourselves “The Document Foundation”?

A: For ten years we have used the same name – “OpenOffice.org” – for both the Community and the software. We’ve decided it removes ambiguity to have a different name for the two, so the Community is now “The Document Foundation”, and the software “LibreOffice”. Note: there are other examples of this usage in the free software community – e.g. the Mozilla Foundation with the Firefox browser.

Q: Does this mean you intend to develop other pieces of software?

A: We would like to have that possibility open to us in the future…

Q: And why are you calling the software “LibreOffice” instead of “OpenOffice.org”?

A: The OpenOffice.org trademark is owned by Oracle Corporation. Our hope is that Oracle will donate this to the Foundation, along with the other assets it holds in trust for the Community, in due course, once legal etc issues are resolved. However, we need to continue work in the meantime – hence “LibreOffice” (“free office”).

Q: Why are you building a new web infrastructure?

A: Since Oracle’s takeover of Sun Microsystems, the Community has been under “notice to quit” from our previous Collabnet infrastructure. With today’s announcement of a Foundation, we now have an entity which can own our emerging new infrastructure.

Q: What does this announcement mean to other derivatives of OpenOffice.org?

A: We want The Document Foundation to be open to code contributions from as many people as possible. We are delighted to announce that the enhancements produced by the Go-OOo team will be merged into LibreOffice, effective immediately. We hope that others will follow suit.

Q: What difference will this make to the commercial products produced by Oracle Corporation, IBM, Novell, Red Flag, etc?

A: The Document Foundation cannot answer for other bodies. However, there is nothing in the licence arrangements to stop companies continuing to release commercial derivatives of LibreOffice. The new Foundation will also mean companies can contribute funds or resources without worries that they may be helping a commercial competitor.

Q: What difference will The Document Foundation make to developers?

A: The Document Foundation sets out deliberately to be as developer friendly as possible. We do not demand that contributors share their copyright with us. People will gain status in our community based on peer evaluation of their contributions – not by who their employer is.

Q: What difference will The Document Foundation make to users of LibreOffice?

A: LibreOffice is The Document Foundation’s reason for existence. We do not have and will not have a commercial product which receives preferential treatment. We only have one focus – delivering the best free office suite for our users – LibreOffice.

—————————————————————————————————-

Non Microsoft and Non Oracle vendors are indeed going to find it useful the possiblities of bundling a free Libre Office that reduces the total cost of ownership for analytics software. Right now, some of the best free advertising for Microsoft OS and Office is done by enterprise software vendors who create Windows Only Products and enable MS Office integration better than  Open Office integration. This is done citing user demand- but it is a chicken egg dilemma- as functionality leads to enhanced demand. Microsoft on the other hand is aware of this dependence and has made SQL Server and SQL Analytics (besides investing in analytics startups like Revolution Analytics) along with it’s own infrastructure -Azure Cloud Platform/EC2 instances.

Microsoft Online Games

No, this is not about the X Box kind of games. It is about Microsoft ‘s tactical shift in the online space from going it alone, and building stuff itself, –to partnering, and sometimes investing and exiting business.

In Blogs- It recently announced a migration of MS Live Spaces to WordPress.com – It gives Automattic 30 million more users- no small change consider there were 26 million existing WP users.

Microsoft Messenger, which is the oldest online app in the suite, now provides instant messaging services to about 350 million users, and from now on Windows Live Writer works specifically with the WordPress.com blog service by default. Hopefully Skype, and Google Voice will show MS the way to monitize that business app yet.

Google buying blogger-blogspot seems to have done little, but given Biz Stone room to create another content disruption-Twitter.

With the round of lawsuits by proxy, in Android -Motorola, or for acquisitions – MS is just doing what Marc Anderseen (who’s apparently a better VC than Paul Allen was), Sun and co did to it in the nineties.

Google seems to be regretting putting a spade in the Yahoo acquisition- that would have tied up a big chunk of Idle MS cash- leaving it little room for niche investments (like the 250 mill that helped Facebook ramp up in time).

The real surprise here could be Apple- it has shown little interest in cloud computing- and it seems to be testing the waters with Ping. But Apple sure smells competition- and Android is doing to Iphone what Windows did to the Mac in the early 1990’s.

Google lacks presence in online gaming (despite it’s own Zynga investment)- and needs to start monetizing properties like Android OS (say 10$ for every phone license ??), Google Maps (as an app for GPS) and Google Voice. Indeed it may be time for the big G to start thinking of spinning off atleast some products- earning better returns, while retaining control (dual stock splits) and killing those anti trust lawyer fees forever.

As the Ancient Chinese said, May you live in interesting times. Fun to watch the online games people play.

 

 

Hive Tutorial: Cloud Computing

Here is a nice video from Cloudera on a HIVE tutorial. I wonder what would happen if they put a real analytical system and not just basic analytics and reporting … like R or SPSS or JMP or SAS on big database system like Hadoop (including some text mined data from legacy company documents)

Unlike Oracle or other data base systems, Hadoop is free now and in reasonable future  (like MySQL used to be before acquired by big fish Sun acquired by bigger Oracle).

Citation-

http://wiki.apache.org/hadoop/Hive

Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files

Hive is based on Hadoop which is a batch processing system. Accordingly, this system does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real time queries. As a result it should not be compared with systems like Oracle where analysis is done on a significantly smaller amount of data but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes. For Hive queries response times for even the smallest jobs can be of the order of 5-10 minutes and for larger jobs this may even run into hours.

If your input data is small you can execute a query in a short time. For example, if a table has 100 rows you can ‘set mapred.reduce.tasks=1’ and ‘set mapred.map.tasks=1’ and the query time will be ~15 seconds.