Decisionstats Interviews

Here is a list of interviews that I have published- these are specific to analytics and data mining and include only the most recent interviews. If I have missed out any notable recent interview related to analytics and data mining, kindly do let me know. Hat Tip to Karl Rexer, for this suggestion .

Date    Name of Interviewee    Designation and Organization

09-Jun    Karl Rexer                          President, Rexer Analytics
05-Jun    Jim Daves                          CMO, SAS Institute
04-Jun    Paul van Eikeren                 President and CEO, Blue Reference
29-May    David Smith                      Director of Community, REvolution Computing
17-May    Dominic Pouzin                 CEO, Data Applied
11-May    Bruno Delahaye                 VP, KXEN
04-May    Ron Ramos                        Director, Zementis
30-Apr    Oliver Jouve                       VP, SPSS Inc
21-Apr    Fabian Dill                         Co- Founder, Knime.com
18-Apr    Alicia Mcgreevey                 Head Marketing, Visual Numerics
27-Mar    Francoise Soulie Fogelman    VP, KXEN
17-Mar    Jon Peck                            Principal Software Engineer, SPSS Inc
06-Mar    Anne Milley                        Director of product marketing, SAS Institute
04-Mar    Anne Milley                        Director of product marketing, SAS Institute
03-Feb    Phil Rack                            Creator, Bridge to R,and CEO Minequest
03-Feb    Michael Zeller                     CEO, Zementis
31-Jan    Richard Schultz                   CEO, Revolution Computing
21-Jan    Bob Muenchen                    Author, R for SAS and SPSS Users
13-Jan    Dr Graham Williams           Creator, Rattle GUI for R
05-Jan    Roger Haddad                    CEO, KXEN
26-Sep    June Dershewitz                  VP, Semphonic
04-Sep    Vincent Granville                 Head, Analyticbridge

The URl’s to specific interviews are also in this sheet.

http://spreadsheets.google.com/pub?key=rWTqcMe9mqwHeFv1e4GS_yg&single=true&gid=0&range=a1%3Ae24&output=html

Interview Karl Rexer -Rexer Analytics

Here is an interview with Karl Rexer of Rexer Analytics. His annual survey is considered a benchmark in the data mining and analytics industry. Here Karl talks of his career, his annual survey and his views on the industry direction and trends.

Almost 20% of data miners report that their company/organizations have only minimal analytic capabilities – Karl Rexer

IMG_2031

Ajay- Describe your career in science. What advice would you give to young science graduates in this recession? What advice would you give to high school students choosing from science – non science careers?

Karl- My interests in science began as a child. My father has multiple science degrees, and I grew up listening to his descriptions of the cool things he was building, or the cool investigative tools he was using, in his lab. He worked in an industrial setting, so visiting was difficult. But when I could, I loved going in to see the high-temperature furnaces he was designing, the carbon-fiber production processes he was developing, and the electron microscope that allowed him to look at his samples. Both of my parents encouraged me to ask why, and to think critically about both scientific and social issues. It was also the time of the Apollo moon landings, and I was totally absorbed in watching and thinking about them. Together these things motivated me and shaped my world-view.

I have also had the good fortune to work across many diverse areas and with some truly outstanding people. In graduate school I focused on applied statistics and the use of scientific methods in the social sciences. As a grad student and young academic, I applied those skills to researching how our brains process language. But on the side, I pursued a passion for using the scientific method and analytics to address ….well anything I could. We called it “statistical consulting” then, but it often extended to research design and many other parts of the scientific process. Some early projects included assisting people with AIDS outcome studies, psycholinguistic research, and studies of adolescent adjustment.

My first taste of applying these skills outside of an academic environment was with my mentor Len Katz. The US Navy hired us to help assess the new recruits that were entering the submarine school. Early identification of sailors who would excel in this unusual and stressful environment was critical. Perhaps even more important was identifying sailors who would not perform well in that environment. Luckily, the Navy had years of academic and psychological testing on many sailors, and this data proved quite useful in predicting later job performance onboard the submarines. Even though we never got the promised submarine ride, I was hooked on applying measurement, scientific methods, and analytics in non-academic settings.

And that’s basically what I have continued to do – apply those skills and methods in diverse scientific and business settings. I worked for two banks and two consulting firms before founding Rexer Analytics in 2002. Last year we supported 30 clients. I’ve got great staff and they have great quant skills. Importantly, we also don’t hesitate to challenge each other, and we’re continually learning from each other and from each client engagement. We share a love of project diversity, and we seek it out in our engagements. We’ve forecasted sales for medical devices, measured B2B customer loyalty, identified manufacturing problems by analyzing product returns, predicted which customers will close their bank accounts, analyzed millions of tax returns, helped identify the dimensions of business team cohesion that result in better performance, found millions of dollars of B2B and B2C fraud, and helped many companies understand their customers better with segmentations, surveys, and analyses of sales and customer behavior.

The advice I would give to young science grads in this recession is to expand your view of where you can apply your scientific training. This applies to high school students considering science careers too. All science does not happen in universities, labs and other traditional science locations. Think about applying scientific methods everywhere! Sometimes our projects at Rexer Analytics seem far away from what most people would consider “science.” But we’re always asking “what data is available that can be brought to bear on the business issue we’re addressing.” Sometimes the best solution is to go out and collect more data – so we frequently help our clients improve their measurement processes or design surveys to collect the necessary data. I think there are enormous opportunities for science grads to apply their scientific training in the business world. The opportunities are not limited to physics wiz-kids making models for Wall Street trading or computer science students moving to Silicon Valley. One of the best analytic teams I ever worked on was at Fleet Bank in the late 90s. We had an economist, two physicists, a sociologist, a psychologist, an operations research guy, and person with a degree in marketing science. We were all very focused on data, measurement, and analytic methods.

I recommend that all science grads read Tom Davenport’s book Competing on Analytics *. It illustrates, with compelling examples, how businesses can benefit from using science and analytics. Several examples in Tom’s book come from Gary Loveman, CEO of Harrah’s Entertainment. I think that Gary also serves as a great example of how scientific methods can be applied in every industry. Gary has a PhD in economics from MIT, he’s worked at the Federal Reserve Bank, he’s been a professor at Harvard, but more recently he runs the world’s largest casino and gaming company. And he’s famously said many times that there are three ways to get fired at Harrah’s: steal, harass women, or not use a control group. Business leaders across all industries are increasingly wanting data, analytics and scientific decision-making. Science grads have great training that enables them to take on these roles and to demonstrate the success of these methods.

Ajay- One more survey- How does the Rexer survey differentiate itself from other surveys out there?

Karl- The Annual Rexer Analytics Data Miner Survey is the only broad-reaching research that investigates the analytic behaviors, views and preferences of data mining professionals. Each year our sample grows — in 2009 we had over 700 people around the globe complete our survey. Our participants include large numbers of both academic and business people.

Another way our survey is differentiated from other surveys is that each year we ask our participants to provide suggestions on ways to improve the survey. Incorporating participants’ suggestions improves our survey. For example, in 2008 several people suggested adding questions about model deployment and off-shoring. We asked about both of these topics in the 2009 survey.

Ajay -Could you please share some sneak previews of the survey results? What impact is the recession likely to have on IT spending?

Karl- We’re just starting to analyze the 2009 survey data. But, yes, here’s a peek at some of the findings that relate to the impact of the recession:

* Many data miners report that funding for data mining projects can sometimes be a problem.
* However, when asked what will happen in 2009 if the economic downturn continues, many data miners still anticipate that their company/organization will conduct more data mining projects in 2009 than in previous years (41% anticipate more projects in 2009; 27% anticipate fewer projects).
* The vast majority of companies conduct their data mining internally, and very few are sending data mining off-shore.

I don’t have a crystal ball that tells me about the trends in overall corporate spending on IT, Business Intelligence, or Data Mining. It’s my personal experience that many budgets are tight this year, but that key projects are still getting funded. And it is my strong opinion that in the coming years many companies will increase their focus on analytics, and I think that increasingly analytics will be a source of competitive advantage for these companies.

There are other people and other surveys that provide better insight into the trends in IT spending. For example, Gartner’s recent survey of over 1,500 CIOs (http://www.gartner.com/it/page.jsp?id=855612 ) suggests that 2009 IT spending is likely to be flat. I’m personally happy to see that in the Gartner survey, Business Intelligence is again CIOs’ top technology priority, and that “increasing the use of information/analytics” is the #5 business priority.

Ajay- I noticed you advise SPSS among others. Describe what an advisory role is for an analytics company and how can small open source companies get renowned advisors?

Karl- We have advised Oracle, SPSS, Hewlett-Packard and several smaller companies. We find that advisory roles vary greatly. The biggest source of variation is what the company wants advice about. Example include:

* assessing opportunity areas for the application of analytics
* strategic data assessments
* analytic strategy
* product strategy
* reviewing software

Both large and small companies that look to apply analytics to their businesses can benefit from analytic advisors. So can open source companies that sell analytic software. Companies can find analytic advisors in several ways. One way is to look around for analytic experts whose advice you trust, and hire them. Networking in your own industry and in the analytic communities can identify potential advisors. Don’t forget to look in both academia and the business world. Many skilled people cross back and forth between these two worlds. Another way for these companies to obtain analytic advice is to look in their business networks and user communities for analytic specialists who share some of the goals of the company – they will be motivated for your company to succeed. Especially if focused topic areas or time-constrained tasks can be identified, outside experts may be willing to donate their time, and they may be flattered that you asked.

Ajay- What made you decide to begin the Rexer Surveys? Describe some results of last year’s surveys and any trends from the last three years that you have seen.

Karl- I’ve been involved on the organizing committees of several data mining workshops and conferences. At these conferences I talk with a lot of data miners and companies involved in data mining. I found that many people were interested in hearing about what other data miners were doing: what algorithms, what types of data, what challenges were being faced, what they liked and disliked about their data mining tools, etc. Since we conduct online surveys for several of our clients, and my network of data miners is pretty large, I realized that we could easily do a survey of data miners, and share the results with the data mining community. In the first year, 314 data miners participated, and it’s just grown from there. In 2009 over 700 people completed the survey. The interest we’ve seen in our research summaries has also been astounding – we’ve had thousands of requests. Overall, this just confirms what we originally thought: people are hungry for information about data mining.

Here is a preview of findings from the initial analyses of the 2009 survey data:

* Each year we’ve seen that the most commonly used algorithms are decision trees, regression, and cluster analysis.
* Consistently, some of the top challenges data miners report are dirty data and explaining data mining to others. Previously, data access issues were also reported as a big challenge, but in 2009 fewer data miners reported facing this challenge.
* The most prevalent concerns with how data mining is being utilized are: insufficient training of some data miners, and resistance to using data mining in contexts where it would be beneficial.
* Data mining is playing an important role in organizations. Half of data miners indicate their results are helping to drive strategic decisions and operational processes.
* But there’s room for data mining to grow – almost 20% of data miners report that their company/organizations have only minimal analytic capabilities.

Bio-

Karl Rexer, PhD is President of Rexer Analytics, a small Boston-based consulting firm. Rexer Analytics provides analytic and CRM consulting to help clients use their data to make better strategic and tactical decisions. Recent projects include fraud detection, sales forecasting, customer segmentation, loyalty analyses, predictive modeling for cross-sell and attrition, and survey research. Rexer Analytics also conducts an annual survey of data miners and freely distributes research summaries to the data mining community. Karl has been on the organizing committees of several international data mining conferences, including 3 KDD conferences, and BIWA-2008. Karl is on the SPSS Customer Advisory Board and on the Board of Directors of the Oracle Business Intelligence, Warehousing, & Analytics (BIWA) Special Interest Group. Karl and other Rexer Analytics staff are frequent invited speakers at MBA data mining classes and conferences.

To know more do check out the website on www.rexeranalytics.com

*

Interview David Smith REvolution Computing

Here is an Interview with REvolution Computing’s Director of Community David Smith.

Our development team spent more than six months making R work on 64-bit Windows (and optimizing it for speed), which we released as REvolution R Enterprise bundled with ParallelR.” David Smith

Ajay -Tell us about your journey in science. In particular tell us what attracted you to R and the open source movement.

David- I got my start in science in 1990 working with CSIRO (the government science organization in Australia) after I completed my degree in mathematics and computer science. Seeing the diversity of projects the statisticians there worked on really opened my eyes to statistics as the way of objectively answering questions about science.

That’s also when I was first introduced to the S language, the forerunner of R. I was hooked immediately; it was just so natural for doing the work I had to do. I also had the benefit of a wonderful mentor, Professor Bill Venables, who at the time was teaching S to CSIRO scientists at remote stations around Australia. He brought me along on his travels as an assistant. I learned a lot about the practice of statistical computing helping those scientists solve their problems (and got to visit some great parts of Australia, too).

Ajay- How do you think we should help bring more students to the fields of mathematics and science-

David- For me, statistics is the practical application of mathematics to the real world of messy data, complex problems and difficult conclusions. And in recent years, lots of statistical problems have broken out of geeky science applications to become truly mainstream, even sexy. In our new information society, graduating statisticians have a bright future ahead of them which I think will inevitably draw more students to the field.

Ajay- Your blog at REVolution Computing is one of the best technical corporate blogs. In particular the monthly round up of new packages, R events and product launches all written in a lucid style. Are there any plans for a REvolution computing community or network as well instead of just the blog.

David- Yes, definitely. We recently hired Danese Cooper as our Open Source Diva to help us in this area. Danese has a wealth of experience building open-source communities, such as for Java at Sun. We’ll be announcing some new community initiatives this summer. In the meantime, of course, we’ll continue with the Revolutions blog, which has proven to be a great vehicle for getting the word out about R to a community that hasn’t heard about it before. Thanks for the kind words about the blog, by the way — it’s been a lot of fun to write. It will be a continuing part of our community strategy, and I even plan to expand the roster of authors in the future, too. (If you’re an aspiring R blogger, please get in touch!)

Ajay- I kind of get confused between what exactly is 32 bit or 64 bit computing in terms of hardware and software. What is the deal there. How do Enterprise solutions from REvolution take care of the 64 bit computing. How exactly does Parallel computing and optimized math libraries in REvolution R help as compared to other flavors of R.

David– Fundamentally, 64-bit systems allow you to process larger data sets with R — as long as you have a version of R compiled to take advantage of the increased memory available. (I wrote about some of the technical details behind this recently on the blog.)  One of the really exciting trends I’ve noticed over the past 6 months is that R is being applied to larger and more complex problems in areas like predictive analytics and social networking data, so being able to process the largest data sets is key.

One common mis perception is that 64-bit systems are inherently faster than their 32-bit equivalents, but this isn’t generally the case. To speed up large problems, the best approach is to break the problem down into smaller components and run them in parallel on multiple machines. We created the ParallelR suite of packages to make it easy to break down such problems in R and run them on a multiprocessor workstation, a local cluster or grid, or even cloud computing systems like Amazon’s EC2 .

” While the core R team produces versions of R for 64-bit Linux systems, they don’t make one for Windows. Our development team spent more than six months making R work on 64-bit Windows (and optimizing it for speed), which we released as REvolution R Enterprise bundled with ParallelR. We’re excited by the scale of the applications our subscribers are already tackling with a combination of 64-bit and parallel computing”

Ajay-  Command line is oh so commanding. Please describe any plans to support or help any R GUI like rattle or R Commander. Do you think Revolution R can get more users if it does help a GUI.

David- Right now we’re focusing on making R easier to use for programmers by creating a new GUI for programming and debugging R code. We heard feedback from some clients who were concerned about training their programmers in R without a modern development environment available. So we’re addressing that by improving R to make the “standard” features programmers expect (like step debugging and variable inspection) work in R and integrating it with the standard environment for programmers on Windows, Visual Studio.

In my opinion R’s strength lies in its combination of high-quality of statistical algorithms with a language ideal for applying them, so “hiding” the language behind a general-purpose GUI negates that strength a bit, I think. On the other hand it would be nice to have an open-source “user-friendly” tool for desktop statistical analysis, so I’m glad others are working to extend R in that area.

Ajay- Companies like SAS are investing in SaaS and cloud computing. Zementis offers scored models on the cloud through PMML. Any views on just building the model or analytics on the cloud itself.

David- To me, cloud computing is a cost-effective way of dynamically scaling hardware to the problem at hand. Not everyone has access to a 20-machine cluster for high-performing computing — and even those that do can’t instantly convert it to a cluster of 100 or 1000 machines to satisfy a sudden spike in demand. REvolution R Enterprise with ParallelR is unique in that it provides a platform for creating sophisticated data analysis applications distributed in the cloud, quickly and easily.

Using clouds for building models is a no-brainer for parallel-computing problems: I recently wrote about how parallel backtesting for financial trading can easily be deployed on Amazon EC2, for example. PMML is a great way of deploying static models, but one of the big advantages of cloud computing is that it makes it possible to update your model much more frequently, to keep your predictions in tune with the latest source data.

Ajay- What are the major alliances that REvolution has in the industry.

David- We have a number of industry partners. Microsoft and Intel, in particular, provide financial and technical support allowing us to really strengthen and optimize R on Windows, a platform that has been somewhat underserved by the open-source community. With Sybase, we’ve been working on combing REvolution R and Sybase Rap to produce some exciting advances in financial risk analytics. Similarly, we’ve been doing work with Vhayu’s Velocity database to provide high-performance data extraction. On the life sciences front, Pfizer is not only a valued client but in many ways a partner who has helped us “road-test” commercial grade R deployment with great success.

Ajay- What are the major R packages that REvolution supports and optimizes and how exactly do they work/help?

David- REvolution R works with all the R packages: in fact, we provide a mirror of CRAN so our subscribers have access to the truly amazing breadth and depth of analytic and graphical methods available in third-party R packages. Those packages that perform intensive mathematical calculations automatically benefit from the optimized math libraries that we incorporate in REvolution R Enterprise. In the future, we plan to work with authors of some key packages provide further improvements — in particular, to make packages work with ParallelR to reduce computation times in multiprocessor or cloud computing environments.

Ajay- Are you planning to lay off people during the recession. does REvolution Computing offer internships to college graduates. What do people at REvolution Computing do to have fun?

David- On the contrary, we’ve been hiring recently. We don’t have an intern program in place just yet, though. For me, it’s been a really fun place to work. Working for an open-source company has a different vibe than the commercial software companies I’ve worked for before. The most fun for me has been meeting with R users around the country and sharing stories about how R is really making a difference in so many different venues — over a few beers of course!


David Smith
Director of Community

David has a long history with the statistical community.  After graduating with a degree in Statistics from the University of Adelaide, South Australia, David spent four years researching statistical methodology at Lancaster University (United Kingdom), where he also developed a number of packages for the S-PLUS statistical modeling environment. David continued his association with S-PLUS at Insightful (now TIBCO Spotfire) where for more than eight years he oversaw the product management of S-PLUS and other statistical and data mining products. David is the co-author (with Bill Venables) of the tutorial manual, An Introduction to R , and one of the originating developers of ESS: Emacs Speaks Statistics. Prior to joining REvolution, David was Vice President, Product Management at Zynchros, Inc.

AjayTo know more about David Smith and REvolution Computing do visit http://www.revolution-computing.com and

http://www.blog.revolution-computing.com
Also see interview with Richard Schultz ,­CEO REvolution Computing here.

http://www.decisionstats.com/2009/01/31/interviewrichard-schultz-ceo-revolution-computing/

Business Intelligence and The Heisenberg Principle

The Heisenberg Principle states that for certain things accuracy and certainty in knowing one quality ( say position of an atom) has to be a trade off with certainty of another quality (like momentum). I was drawn towards the application of this while in an email interaction with Edith Ohri , who is a leading data mining person in Israel and has her own customized GT solution.Edith said that it seems it is impossible to have data that is both accurate (data quality) and easy to view across organizations (data transparency). More often than not the metrics that we measure are the metrics we are forced to measure due to data adequacy and data quality issues.

Now there exists a tradeoff in the price of perfect information in managerial economics , but is it really true that the Business Intelligence we deploy is more often than not constrained by simple things like input data and historic database tables.And that more often than not Data quality is the critical constraint that determines speed and efficacy of deployment.

I personally find that much more of the time in database projects goes in data measurement,aggregation, massaging outliers, missing value assumptions than in the “high value” activities like insight generation and business issue resolution.

Is it really true ? Analysis is easy but the data which is tough ?

What do you think in terms of the uncertainty inherent in data quality and data transparency-

Ten ways to build a wrong scoring model

 

Some ways to build a wrong scoring model are below- The author doesn’t take any guarantee if your modeling team is using one of these and still getting a correct model.

1) Over fit the model to the sample. This over fitting can be checked by taking a random sample again and fitting the scoring equation and compared predicted conversion rates versus actual conversion rates. The over fit model does not rank order – deciles with lower average probability may show equal or more conversions than deciles with higher probability scores.

2) Choose non random samples for building and validating the scoring equation. Read over fitting above.

3) Use Multicollinearity (http://en.wikipedia.org/wiki/Multicollinearity ) without business judgment to remove variables which may make business sense.Usually happens a few years after you studied and forgot Multicollinearity.

If you don’t know the difference between Multicollinearity , Heteroskedasticity http://en.wikipedia.org/wiki/Heteroskedasticity this could be the real deal breaker for you

4) Using legacy codes for running scoring usually with step wise forward and backward  regression .Happens usually on Fridays and when in a hurry to make models.

5) Ignoring signs or magnitude of parameter estimates ( that’s the output or the weightage of the variable in the equation).

6) Not knowing the difference between Type 1 and Type 2 error especially when rejecting variables based on P value. ( Not knowing P value means you may kindly stop reading and click the You Tube video in the right margin )

7) Excessive zeal in removing variables. Why ? Ask yourself this question every time you are removing a variable.

8) Using the wrong causal event (like mailings for loans) for predicting the future with scoring model (for mailings of deposit accounts) . or using the right causal event in the wrong environment ( rapid decline/rise of sales due to factors not present in model like competitor entry/going out of business ,oil prices, credit shocks sob sob sigh)

9) Over fitting

10) Learning about creating models from blogs and not  reading and refreshing your old statistics textbooks

Award : Analyticbridge.com

Blog post on http://www.analyticbridge.com/group/memberofthemonth

Ajay Ohri has been selected as our Member of the Month for the second time. Most recently, Ajay recruited great new members, posted numerous interesting messages on his blog, added many applications and feeds to his profile, and contributed in many other ways to make AnalyticBridge better.
To be eligible for the Member of the Month award, a candidate must ….

 

More details here

http://www.analyticbridge.com/group/memberofthemonth

ps — I am still waiting for the SAS Rookie Slumdog of the Year Award.

Google Chrome wont work with Google Analytics

As the screenshot shows ..it treats Google Analytics Terms of Service as unreliable content…..a Freudian coding error…?

image