Here is an interview with noted Data Mining practitioner Michael Berry, author of seminal books in data mining, noted trainer and consultant
Ajay- Your famous book “Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management” came out in 2004, and an update is being planned for 2011. What are the various new data mining techniques and their application that you intend to talk about in that book.
Michael- Each time we do a revision, it feels like writing a whole new book. The first edition came out in 1997 and it is hard to believe how much the world has changed since then. I’m currently spending most of my time in the on-line retailing world. The things I worry about today–improving recommendations for cross-sell and up-sell,and search engine optimization–wouldn’t have even made sense to me back then. And the data sizes that are routine today were beyond the capacity of the most powerful super computers of the nineties. But, if possible, Gordon and I have changed even more than the data mining landscape. What has changed us is experience. We learned an awful lot between the first and second editions, and I think we’ve learned even more between the second and third.
One consequence is that we now have to discipline ourselves to avoid making the book too heavy to lift. For the first edition, we could write everything we knew (and arguably, a bit more!); now we have to remind ourselves that our intended audience is still the same–intelligent laymen with a practical interest in getting more information out of data. Not statisticians. Not computer scientists. Not academic researchers. Although we welcome all readers, we are primarily writing for someone who works in a marketing department and has a title with the word “analyst” or “analytics” in it. We have relaxed our “no equations” rule slightly for cases when the equations really do make things easier to explain, but the core explanations are still in words and pictures.
The third edition completes a transition that was already happening in the second edition. We have fully embraced standard statistical modeling techniques as full-fledged components of the data miner’s toolkit. In the first edition, it seemed important to make a distinction between old, dull, statistics, and new, cool, data mining. By the second edition, we realized that didn’t really make sense, but remnants of that attitude persisted. The third edition rectifies this. There is a chapter on statistical modeling techniques that explains linear and logistic regression, naive Bayes models, and more. There is also a brand new chapter on text mining, a curious omission from previous editions.
There is also a lot more material on data preparation. Three whole chapters are devoted to various aspects of data preparation. The first focuses on creating customer signatures. The second is focused on using derived variables to bring information to the surface, and the third deals with data reduction techniques such as principal components. Since this is where we spend the greatest part of our time in our work, it seemed important to spend more time on these subjects in the book as well.
Some of the chapters have been beefed up a bit. The neural network chapter now includes radial basis functions in addition to multi-layer perceptrons. The clustering chapter has been split into two chapters to accommodate new material on soft clustering, self-organizing maps, and more. The survival analysis chapter is much improved and includes material on some of our recent application of survival analysis methods to forecasting. The genetic algorithms chapter now includes a discussion of swarm intelligence.
Ajay- Describe your early career and how you came into Data Mining as a profession. What do you think of various universities now offering MS in Analytics. How do you balance your own teaching experience with your consulting projects at The Data Miners.
Michael- I fell into data mining quite by accident. I guess I always had a latent interest in the topic. As a high school and college student, I was a fan of Martin Gardner‘s mathematical games in in Scientific American. One of my favorite things he wrote about was a game called New Eleusis in which one players, God, makes up a rule to govern how cards can be played (“an even card must be followed by a red card”, say) and the other players have to figure out the rule by watching what plays are allowed by God and which ones are rejected. Just for my own amusement, I wrote a computer program to play the game and presented it at the IJCAI conference in, I think, 1981.
That paper became a chapter in a book on computer game playing–so my first book was about finding patterns in data. Aside from that, my interest in finding patterns in data lay dormant for years. At Thinking Machines, I was in the compiler group. In particular, I was responsible for the run-time system of the first Fortran Compiler for the CM-2 and I represented Thinking Machines at the Fortran 8X (later Fortran-90) standards meetings.
What changed my direction was that Thinking Machines got an export license to sell our first machine overseas. The machine went to a research lab just outside of Paris. The connection machine was so hard to program, that if you bought one, you got an applications engineer to go along with it. None of the applications engineers wanted to go live in Paris for a few months, but I did.
Paris was a lot of fun, and so, I discovered, was actually working on applications. When I came back to the states, I stuck with that applied focus and my next assignment was to spend a couple of years at Epsilon, (then a subsidiary of American Express) working on a database marketing system that stored all the “records of charge” for American Express card members. The purpose of the system was to pick ads to go in the billing envelope. I also worked on some more general purpose data mining software for the CM-5.
When Thinking Machines folded, I had the opportunity to open a Cambridge office for a Virginia-based consulting company called MRJ that had been a major channel for placing Connection Machines in various government agencies. The new group at MRJ was focused on data mining applications in the commercial market. At least, that was the idea. It turned out that they were more interested in data warehousing projects, so after a while we parted company.
That led to the formation of Data Miners. My two partners in Data Miners, Gordon Linoff and Brij Masand, share the Thinking Machines background.
To tell the truth, I really don’t know much about the university programs in data mining that have started to crop up. I’ve visited the one at NC State, but not any of the others.
I myself teach a class in “Marketing Analytics” at the Carroll School of Management at Boston College. It is an elective part of the MBA program there. I also teach short classes for corporations on their sites and at various conferences.
Ajay- At the previous Predictive Analytics World, you took a session on Forecasting and Predicting Subsciber levels (http://www.predictiveanalyticsworld.com/dc/2009/agenda.php#day2-6) .
It seems inability to forecast is a problem many many companies face today. What do you think are the top 5 principles of business forecasting which companies need to follow.
Michael- I don’t think I can come up with five. Our approach to forecasting is essentially simulation. We try to model the underlying processes and then turn the crank to see what happens. If there is a principal behind that, I guess it is to approach a forecast from the bottom up rather than treating aggregate numbers as a time series.
Ajay- You often partner your talks with SAS Institute, and your blog at http://blog.data-miners.com/ sometimes contain SAS code as well. What particular features of the SAS software do you like. Do you use just the Enterprise Miner or other modules as well for Survival Analysis or Forecasting.
Michael- Our first data mining class used SGI’s Mineset for the hands-on examples. Later we developed versions using Clementine, Quadstone, and SAS Enterprise Miner. Then, market forces took hold. We don’t market our classes ourselves, we depend on others to market them and then share in the revenue.
SAS turned out to be much better at marketing our classes than the other companies, so over time we stopped updating the other versions. An odd thing about our relationship with SAS is that it is only with the education group. They let us use Enterprise Miner to develop course materials, but we are explicitly forbidden to use it in our consulting work. As a consequence, we don’t use it much outside of the classroom.
Ajay- Also any other software you use (apart from SQL and J)
Michael- We try to fit in with whatever environment our client has set up. That almost always is SQL-based (Teradata, Oracle, SQL Server, . . .). Often SAS Stat is also available and sometimes Enterprise Miner.
We run into SPSS, Statistica, Angoss, and other tools as well. We tend to work in big data environments so we’ve also had occasion to use Ab Initio and, more recently, Hadoop. I expect to be seeing more of that.
Together with his colleague, Gordon Linoff, Michael Berry is author of some of the most widely read and respected books on data mining. These best sellers in the field have been translated into many languages. Michael is an active practitioner of data mining. His books reflect many years of practical, hands-on experience down in the data mines.
A data mining educator as well as a consultant, Michael is in demand as a keynote speaker and seminar leader in the area of data mining generally and the application of data mining to customer relationship management in particular.
Prior to founding Data Miners in December, 1997, Michael spent 8 years at Thinking Machines Corporation. There he specialized in the application of massively parallel supercomputing techniques to business and marketing applications, including one of the largest database marketing systems of the time.