Interview Dan Steinberg Founder Salford Systems

Here is an interview with Dan Steinberg, Founder and President of Salford Systems (http://www.salford-systems.com/ )

Ajay- Describe your journey from academia to technology entrepreneurship. What are the key milestones or turning points that you remember.

 Dan- When I was in graduate school studying econometrics at Harvard,  a number of distinguished professors at Harvard (and MIT) were actively involved in substantial real world activities.  Professors that I interacted with, or studied with, or whose software I used became involved in the creation of such companies as Sun Microsystems, Data Resources, Inc. or were heavily involved in business consulting through their own companies or other influential consultants.  Some not involved in private sector consulting took on substantial roles in government such as membership on the President’s Council of Economic Advisors. The atmosphere was one that encouraged free movement between academia and the private sector so the idea of forming a consulting and software company was quite natural and did not seem in any way inconsistent with being devoted to the advancement of science.

 Ajay- What are the latest products by Salford Systems? Any future product plans or modification to work on Big Data analytics, mobile computing and cloud computing.

 Dan- Our central set of data mining technologies are CART, MARS, TreeNet, RandomForests, and PRIM, and we have always maintained feature rich logistic regression and linear regression modules. In our latest release scheduled for January 2012 we will be including a new data mining approach to linear and logistic regression allowing for the rapid processing of massive numbers of predictors (e.g., one million columns), with powerful predictor selection and coefficient shrinkage. The new methods allow not only classic techniques such as ridge and lasso regression, but also sub-lasso model sizes. Clear tradeoff diagrams between model complexity (number of predictors) and predictive accuracy allow the modeler to select an ideal balance suitable for their requirements.

The new version of our data mining suite, Salford Predictive Modeler (SPM), also includes two important extensions to the boosted tree technology at the heart of TreeNet.  The first, Importance Sampled learning Ensembles (ISLE), is used for the compression of TreeNet tree ensembles. Starting with, say, a 1,000 tree ensemble, the ISLE compression might well reduce this down to 200 reweighted trees. Such compression will be valuable when models need to be executed in real time. The compression rate is always under the modeler’s control, meaning that if a deployed model may only contain, say, 30 trees, then the compression will deliver an optimal 30-tree weighted ensemble. Needless to say, compression of tree ensembles should be expected to be lossy and how much accuracy is lost when extreme compression is desired will vary from case to case. Prior to ISLE, practitioners have simply truncated the ensemble to the maximum allowable size.  The new methodology will substantially outperform truncation.

The second major advance is RULEFIT, a rule extraction engine that starts with a TreeNet model and decomposes it into the most interesting and predictive rules. RULEFIT is also a tree ensemble post-processor and offers the possibility of improving on the original TreeNet predictive performance. One can think of the rule extraction as an alternative way to explain and interpret an otherwise complex multi-tree model. The rules extracted are similar conceptually to the terminal nodes of a CART tree but the various rules will not refer to mutually exclusive regions of the data.

 Ajay- You have led teams that have won multiple data mining competitions. What are some of your favorite techniques or approaches to a data mining problem.

 Dan- We only enter competitions involving problems for which our technology is suitable, generally, classification and regression. In these areas, we are  partial to TreeNet because it is such a capable and robust learning machine. However, we always find great value in analyzing many aspects of a data set with CART, especially when we require a compact and easy to understand story about the data. CART is exceptionally well suited to the discovery of errors in data, often revealing errors created by the competition organizers themselves. More than once, our reports of data problems have been responsible for the competition organizer’s decision to issue a corrected version of the data and we have been the only group to discover the problem.

In general, tackling a data mining competition is no different than tackling any analytical challenge. You must start with a solid conceptual grasp of the problem and the actual objectives, and the nature and limitations of the data. Following that comes feature extraction, the selection of a modeling strategy (or strategies), and then extensive experimentation to learn what works best.

 Ajay- I know you have created your own software. But are there other software that you use or liked to use?

 Dan- For analytics we frequently test open source software to make sure that our tools will in fact deliver the superior performance we advertise. In general, if a problem clearly requires technology other than that offered by Salford, we advise clients to seek other consultants expert in that other technology.

 Ajay- Your software is installed at 3500 sites including 400 universities as per http://www.salford-systems.com/company/aboutus/index.html What is the key to managing and keeping so many customers happy?

 Dan- First, we have taken great pains to make our software reliable and we make every effort  to avoid problems related to bugs.  Our testing procedures are extensive and we have experts dedicated to stress-testing software . Second, our interface is designed to be natural, intuitive, and easy to use, so the challenges to the new user are minimized. Also, clear documentation, help files, and training videos round out how we allow the user to look after themselves. Should a client need to contact us we try to achieve 24-hour turn around on tech support issues and monitor all tech support activity to ensure timeliness, accuracy, and helpfulness of our responses. WebEx/GotoMeeting and other internet based contact permit real time interaction.

 Ajay- What do you do to relax and unwind?

 Dan- I am in the gym almost every day combining weight and cardio training. No matter how tired I am before the workout I always come out energized so locating a good gym during my extensive travels is a must. I am also actively learning Portuguese so I look to watch a Brazilian TV show or Portuguese dubbed movie when I have time; I almost never watch any form of video unless it is available in Portuguese.

 Biography-

http://www.salford-systems.com/blog/dan-steinberg.html

Dan Steinberg, President and Founder of Salford Systems, is a well-respected member of the statistics and econometrics communities. In 1992, he developed the first PC-based implementation of the original CART procedure, working in concert with Leo Breiman, Richard Olshen, Charles Stone and Jerome Friedman. In addition, he has provided consulting services on a number of biomedical and market research projects, which have sparked further innovations in the CART program and methodology.

Dr. Steinberg received his Ph.D. in Economics from Harvard University, and has given full day presentations on data mining for the American Marketing Association, the Direct Marketing Association and the American Statistical Association. After earning a PhD in Econometrics at Harvard Steinberg began his professional career as a Member of the Technical Staff at Bell Labs, Murray Hill, and then as Assistant Professor of Economics at the University of California, San Diego. A book he co-authored on Classification and Regression Trees was awarded the 1999 Nikkei Quality Control Literature Prize in Japan for excellence in statistical literature promoting the improvement of industrial quality control and management.

His consulting experience at Salford Systems has included complex modeling projects for major banks worldwide, including Citibank, Chase, American Express, Credit Suisse, and has included projects in Europe, Australia, New Zealand, Malaysia, Korea, Japan and Brazil. Steinberg led the teams that won first place awards in the KDDCup 2000, and the 2002 Duke/TeraData Churn modeling competition, and the teams that won awards in the PAKDD competitions of 2006 and 2007. He has published papers in economics, econometrics, computer science journals, and contributes actively to the ongoing research and development at Salford.

Credit Downgrade of USA and Triple A Whining

As a person trained , deployed and often asked to comment on macroeconomic shenanigans- I have the following observations to make on the downgrade of US Debt by S&P

1) Credit rating is both a mathematical exercise of debt versus net worth as well as intention to repay. Given the recent deadlock in United States legislature on debt ceiling, it is natural and correct to assume that holding US debt is slightly more risky in 2011 as compared to 2001. That means if the US debt was AAA in 2001 it sure is slightly more risky in 2011.

2) Politicians are criticized the world over in democracies including India, UK and US. This is natural , healthy and enforced by checks and balances by constitution of each country. At the time of writing this, there are protests in India on corruption, in UK on economic disparities, in US on debt vs tax vs spending, Israel on inflation. It is the maturity of the media as well as average educational level of citizenry that amplifies and inflames or dampens sentiment regarding policy and business.

3) Conspicuous consumption has failed both at an environmental and economic level. Cheap debt to buy things you do not need may have made good macro economic sense as long as the things were made by people locally but that is no longer the case. Outsourcing is not all evil, but it sure is not a perfect solution to economics and competitiveness. Outsourcing is good or outsourcing is bad- well it depends.

4) In 1944 , the US took debt to fight Nazism, build atomic power and generally wage a lot of war and lots of dual use inventions. In 2004-2010 the US took debt to fight wars in Iraq, Afghanistan and bail out banks and automobile companies. Some erosion in the values represented by a free democracy has taken place, much to the delight of authoritarian regimes (who have managed to survive Google and Facebook).

5) A Double A rating is still quite a good rating. Noone is moving out of the US Treasuries- I mean seriously what are your alternative financial resources to park your government or central bank assets, euro, gold, oil, rare earth futures, metals or yen??

6) Income disparity as a trigger for social unrest in UK, France and other parts is an ominous looming threat that may lead to more action than the poor maths of S &P. It has been some time since riots occured in the United States and I believe in time series and cycles especially given the rising Gini coefficients .

Gini indices for the United States at various times, according to the US Census Bureau:[8][9][10]

  • 1929: 45.0 (estimated)
  • 1947: 37.6 (estimated)
  • 1967: 39.7 (first year reported)
  • 1968: 38.6 (lowest index reported)
  • 1970: 39.4
  • 1980: 40.3
  • 1990: 42.8
    • (Recalculations made in 1992 added a significant upward shift for later values)
  • 2000: 46.2
  • 2005: 46.9
  • 2006: 47.0 (highest index reported)
  • 2007: 46.3
  • 2008: 46.69
  • 2009: 46.8

7) Again I am slightly suspicious of an American Corporation downgrading the American Governmental debt when it failed to reconcile numbers by 2 trillion and famously managed to avoid downgrading Lehman Brothers.  What are the political affiliations of the S &P board. What are their backgrounds. Check the facts, Watson.

The Chinese government should be concerned if it is holding >1000 tonnes of Gold and >1 trillion plus of US treasuries lest we have a third opium war (as either Gold or US Treasuries will burst)

. Opium in 1850 like the US Treasuries in 2010 have no inherent value except for those addicted to them.

8   ) Ron Paul and Paul Krugman are the two extremes of economic ideology in the US.

Reminds me of the old saying- Robbing Peter to pay Paul. Both the Pauls seem equally unhappy and biased.

I have to read both WSJ and NYT to make sense of what actually is happening in the US as opinionated journalism has managed to elbow out fact based journalism. Do we need analytics in journalism education/ reporting?

9) Panic buying and selling would lead to short term arbitrage positions. People like W Buffet made more money in the crash of 2008 than people did in the boom years of 2006-7

If stocks are cheap- buy. on the dips. Acquire companies before they go for IPOs. Go buy your own stock if you are sitting on  a pile of cash. Buy some technology patents in cloud , mobile, tablet and statistical computing if you have a lot of cash and need to buy some long term assets.

10) Follow all advice above at own risk and no liability to this author 😉

 

Data Visualization: Central Banks

Iron Ore Company of Canada
Image via Wikipedia

Trying to compare the transparency of central banks via the data visualization of two very different central banks.

One is Reserve Bank of India and the other is Federal Reserve Bank of New York

Here are some points-

1) The federal bank gives you a huge clutter of charts to choose from and sometimes gives you very difficult to understand charts.

see http://www.newyorkfed.org/research/global_economy/usecon_charts.html

and http://www.newyorkfed.org/research/directors_charts/us18chart.pdf

us18chart

2) The Reserve bank of India choose Business Objects and gives you a proper drilldown kind  of  graph and tables. ( thats a lot of heavy metal and iron ore China needs from India 😉 😉

Foreign Trade – Export      Time-line: ALL

TIME LINE COUNTRY COMMODITY AMOUNT (US $ MILLION) EXPORT QUANTITY
2010:07 (JUL) – P China IRON ORE (Units: TON) 205.06 1878456
2010:06 (JUN) – P China IRON ORE (Units: TON) 427.68 6808528
2010:05 (MAY) – P China IRON ORE (Units: TON) 550.67 5290450
2010:04 (APR) – P China IRON ORE (Units: TON) 922.46 9931500
2010:03 (MAR) – P China IRON ORE (Units: TON) 829.75 13177672
2010:02 (FEB) – P China IRON ORE (Units: TON) 706.04 10141259
2010:01 (JAN) – P China IRON ORE (Units: TON) 577.13 8498784
2009:12 (DEC) – P China IRON ORE (Units: TON) 545.68 9264544
2009:11 (NOV) – P China IRON ORE (Units: TON) 508.17 9509213
2009:10 (OCT) – P China IRON ORE (Units: TON) 422.6 7691652
2009:09 (SEP) – P China IRON ORE (Units: TON) 278.04 4577943
2009:08 (AUG) – P China IRON ORE (Units: TON) 276.96 4371847
2009:07 (JUL) China IRON ORE (Units: TON) 266.11 4642237
2009:06 (JUN) China IRON ORE (Units: TON) 241.08 4584354

Source : DGCI & S, Ministry of Commerce & Industry, GoI

 

You can see the screenshots of the various visualization tools of the New York Fed Reserve Bank and Indian Reserve Bank- if the US Fed is serious about cutting the debt maybe it should start publishing better visuals

Which software do we buy? -It depends

Software (novel)
Image via Wikipedia

Often I am asked by clients, friends and industry colleagues on the suitability or unsuitability of particular software for analytical needs.  My answer is mostly-

It depends on-

1) Cost of Type 1 error in purchase decision versus Type 2 error in Purchase Decision. (forgive me if I mix up Type 1 with Type 2 error- I do have some weird childhood learning disabilities which crop up now and then)

Here I define Type 1 error as paying more for a software when there were equivalent functionalities available at lower price, or buying components you do need , like SPSS Trends (when only SPSS Base is required) or SAS ETS, when only SAS/Stat would do.

The first kind is of course due to the presence of free tools with GUI like R, R Commander and Deducer (Rattle does have a 500$ commercial version).

The emergence of software vendors like WPS (for SAS language aficionados) which offer similar functionality as Base SAS, as well as the increasing convergence of business analytics (read predictive analytics), business intelligence (read reporting) has led to somewhat brand clutter in which all softwares promise to do everything at all different prices- though they all have specific strengths and weakness. To add to this, there are comparatively fewer business analytics independent analysts than say independent business intelligence analysts.

2) Type 2 Error- In this case the opportunity cost of delayed projects, business models , or lower accuracy – consequences of buying a lower priced software which had lesser functionality than you required.

To compound the magnitude of error 2, you are probably in some kind of vendor lock-in, your software budget is over because of buying too much or inappropriate software and hardware, and still you could do with some added help in business analytics. The fear of making a business critical error is a substantial reason why open source software have to work harder at proving them competent. This is because writing great software is not enough, we need great marketing to sell it, and great customer support to sustain it.

As Business Decisions are decisions made in the constraints of time, information and money- I will try to create a software purchase matrix based on my knowledge of known softwares (and unknown strengths and weakness), pricing (versus budgets), and ranges of data handling. I will add in basically an optimum approach based on known constraints, and add in flexibility for unknown operational constraints.

I will restrain this matrix to analytics software, though you could certainly extend it to other classes of enterprise software including big data databases, infrastructure and computing.

Noted Assumptions- 1) I am vendor neutral and do not suffer from subjective bias or affection for particular software (based on conferences, books, relationships,consulting etc)

2) All software have bugs so all need customer support.

3) All software have particular advantages , strengths and weakness in terms of functionality.

4) Cost includes total cost of ownership and opportunity cost of business analytics enabled decision.

5) All software marketing people will praise their own software- sometimes over-selling and mis-selling product bundles.

Software compared are SPSS, KXEN, R,SAS, WPS, Revolution R, SQL Server,  and various flavors and sub components within this. Optimized approach will include parallel programming, cloud computing, hardware costs, and dependent software costs.

To be continued-

 

 

 

 

KXEN Case Studies : Financial Sector

Here are the summaries of some excellent success stories that KXEN has achieved working with partners in the financial world over the years.

Fraud Modeling- Disbank (acquired by Fortis) Turkey

1. Dısbank increased the number of identified fraudulent applications by 200% from 7 to 21 per day.

2.More than 50 fraudsters using counterfeit cards at merchant locations or fraudulent applications have been arrested after April 2004 when the fraud modeling system was set.


A large Bank on the U.S. East Coast

1.Response Modeling

Previously it took the modeling group four weeks to build one model with several hundred variables, using traditional modeling tools. KXEN took one hour for the same problem and doubled the lift in the top decile because it included variables that had not been used for this business question before.

2.Data Quality

Building a Cross/Up-sell Model for a direct marketing campaign to high net worth customers, the modelers needed four weeks using 1500 variables. Again it took one hour with KXEN, which uncovered significant problems with some of the top predictive variables. Further investigation proved that these problems were created in the data merge of the mail file and response file, creating several “perfect” predictors. The model was re-run, removing these variables, and immediately put into production.

Le Crédit Lyonnais

1.Around 160 score models now built annually – compared to around 10 previously – for 130 direct marketing campaigns.
2.KXEN software has allowed LCL to drive up response rates, leading to more value-added services for customers.

Finansbank, Turkey

1.Within 4 months of starting the project to combat dormancy using KXEN’s solution, the bank had successfully reactivated half its previously dormant customers as per Kunter Kutluay, Finansbank Director of Marketing and Risk Analytics.

Bank Austria Creditanstalt , Austria

1.Some 4.5 terabytes of data are held in the bank’s operational systems, with a further 2 terabytes archived. Analytical models created in KXEN are automatically fed through the bank’s scoring engine in batches weekly
or monthly depending on the schema.

“But we are looking at a success rate of target customer deals in the area of three to five per cent with KXEN.
Before that, it was one per cent or less. ”
Werner Widhalm, Head of the Customer Knowledge Management Unit.

Barclays

1.Barclays’ Teradata warehouse holds information on some 14 million active customers, with data
on many different aspects of customer behaviour. Previously, analysts had to manually whittle down several thousand fields of data to a core of only a few hundred to fit the limitations of the modelling process. Now, all of the variables can be fed straight into the predictive model.

Summary– KXEN has achieved tremendous response in all aspects of data modelling in financial sector where time in building, deploying and analyzing model is much more crucial than many other sectors. I would be following this with other case studies on other KXEN successes across multiple domains.

Source – http://www.kxen.com/index.php?option=com_content&task=view&id=220&Itemid=786

Disclaimer- I am a social media consultant for KXEN.