Interview Dr. Ian Fellows #rstats Deducer

Here is an interview with Dr Ian Fellows, creator of acclaimed packages in R like Deducer and the Founder and President of
Ajay- Describe your involvement with the Deducer Project, and the various plugins associated with it. What has been the usage and response for Deducer from R Community.
Ian- Deducer is a graphical user interface for data analysis built on R. It sprung out of a disconnect between the toolchain used by myself and the toolchain of the psychologists that I worked with at the University of California, San DIego. They were primarily SPSS user, whereas I liked to use R, especially for anything that was not a standard analysis.
I felt that there was a big gap in the audience that R serves. Not all consumers or producers of statistics can be expected to have the computational background (command-line programming) that R requires. I think it is important to recognize and work with the areas of expertise that statistical users have. I’m not an expert in psychology, and they didn’t expect me to be one. They are not experts in computation, and I don’t think that we should expect them to be in order to be a part of the R toolchain community.ian
This was the impetus behind Deducer, so it is fundamentally designed to be a familiar experience for users coming from an SPSS background and provides a full implementation of the standard methods in statistics, and data manipulation from descriptives to generalized linear models. Additionally, it has an advanced GUI for creating visualizations which has been well received, and won the John Chambers award for statistical software in 2011.
Uptake of the system is difficult to measure as CRAN does not track package downloads, but from what I can tell there has been a steadily increasing user base. The online manual has been accessed by over 75,000 unique users, with over 400,000 page views. There is a small, active group of developers creating add-on packages supporting various sub-diciplines of statistics. There are 8 packages on CRAN extending/using Deducer, and quite a few more on r-forge.
Ajay- Do you see any potential for Deducer as an enterprise software product (like R Studio et al)
Ian- Like R Studio, Deducer is used in enterprise environments but is not specifically geared towards that environment. I do see potential in that realm, but don’t have any particular plan to make an enterprise version of Deducer.
Ajay- Describe your work in Texas Hold’em Poker. Do you see any potential for R for diversifying into the casino analytics – which has hitherto been served exclusively by non open source analytics vendors.
Ian- As a Statistician, I’m very much interested in problems of inference under uncertainty, especially when the problem space is huge. Creating an Artificial Intelligence that can play (heads-up limit) Texas Hold’em Poker at a high level is a perfect example of this. There is uncertainty created by the random drawing of cards, the problem space is 10^{18}, and our opponent can adapt to any strategy that we employ.
While high level chess A.I.s have existed for decades, the first viable program to tackle full scale poker was introduced in 2003 by the incomparable Computer Poker Research group at the University of Alberta. Thus poker represents a significant challenge which can be used as a test bed to break new ground in applied game theory. In 2007 and 2008 I submitted entries to the AAA’s annual computer poker competition, which pits A.I.s from universities across the world against each other. My program, which was based on an approximate game theoretic equilibrium calculated using a co-evolutionary process called fictitious play, came in second behind the Alberta team.
Ajay- Describe your work in social media analytics for R. What potential do you see for Social Network Analysis given the current usage of it in business analytics and business intelligence tools for enterprise.
Ian- My dissertation focused on new model classes for social network analysis ( and R has a great collection of tools for social network analysis in the statnet suite of packages, which represents the forefront of the literature on the statistical modeling of social networks. I think that if the analytics data is small enough for the models to be fit, these tools can represent a qualitative leap in the understanding and prediction of user behavior.
Most uses of social networks in enterprise analytics that I have seen are limited to descriptive statistics (what is a user’s centrality; what is the degree distribution), and the use of these descriptive statistics as fixed predictors in a model. I believe that this approach is an important first step, but ignores the stochastic nature of the network, and the dynamics of tie formation and dissolution. Realistic modeling of the network can lead to more principled, and more accurate predictions of the quantities that enterprise users care about.
The rub is that the Markov Chain Monte Carlo Maximum Likelihood algorithms used to fit modern generative social network models (such as exponential-family random graph models) do not scale well at all. These models are typically limited to fitting networks with fewer than 50,000 vertices, which is clearly insufficient for most analytics customers who have networks more on the order of 50,000,000.
This problem is not insoluble though. Part of my ongoing research involves scalable algorithms for fitting social network models.
Ajay- You decided to go from your Phd into consulting ( . What were some of the options you considered in this career choice.
Ian– I’ve been working in the role of a statistical consultant for the last 7 years, starting as an in-house consultant at UCSD after obtaining my MS. Fellows Statistics has been operating for the last 3 years, though not fulltime until January of this year. As I had already been consulting, it was a natural progression to transition to consulting fulltime once I graduated with my Phd.
This has allowed me to both work on interesting corporate projects, and continue research related to my dissertation via sub-awards from various universities.
Ajay- What does offer in its consulting practice.
Ian– Fellows Statistics offers personalized analytics services to both corporate and academic clients. We are a boutique company, that can scale from a single statistician to a small team of analysts chosen specifically with the client’s needs in mind. I believe that by being small, we can provide better, close-to-the-ground responsive service to our clients.
As a practice, we live at the intersection of mathematical sophistication, and computational skill, with a hint of UI design thrown into the mix. Corporate clients can expect a diverse range of analytic skills from the development of novel algorithms to the design and presentation of data for a general audience. We’ve worked with Revolution Analytics developing algorithms for their ScaleR product, the Center for Disease Control developing graphical user interfaces set to be deployed for world-wide HIV surveillance, and Prospectus analyzing clinical trial data for retinal surgery. With access to the cutting edge research taking place in the academic community, and the skills to implement them in corporate environments, Fellows Statistics is able to offer clients world-class analytics services.
Ajay- How does big data affect the practice of statistics in business decisions.
Ian– There is a big gap in terms of how the basic practice of statistics is taught in most universities, and the types of analyses that are useful when data sizes become large. Back when I was at UCSD, I remember a researcher there jokingly saying that everything is correlated rho=.2. He was joking, but there is a lot of truth to that statement. As data sizes get larger everything becomes significant if a hypothesis test is done, because the test has the power to detect even trivial relationships.
Ajay- How is the R community including developers coping with the Big Data era? What do you think R can do more for Big Data?
Ian- On the open source side, there has been a lot of movement to improve R’s handling of big data. The bigmemory project and the ff package both serve to extend R’s reach beyond in-memory data structures.  Revolution Analytics also has the ScaleR package, which costs money, but is lightning fast and has an ever growing list of analytic techniques implemented. There are also several packages integrating R with hadoop.
Ajay- Describe your research into data visualization including word cloud and other packages. What do you think of Shiny, D3.Js and online data visualization?
Ian- I recently had the opportunity to delve into d3.js for a client project, and absolutely love it. Combined with Shiny, d3 and R one can very quickly create a web visualization of an R modeling technique. One limitation of d3 is that it doesn’t work well with internet explorer 6-8. Once these browsers finally leave the ecosystem, I expect an explosion of sites using d3.
Ajay- Do you think wordcloud is an overused data visualization type and how can it be refined?
Ian- I would say yes, but not for the reasons you would think. A lot of people criticize word clouds because they convey the same information as a bar chart, but with less specificity. With a bar chart you can actually see the frequency, whereas you only get a relative idea with word clouds based on the size of the word.
I think this is both an absolutely correct statement, and misses the point completely. Visualizations are about communicating with the reader. If your readers are statisticians, then they will happily consume the bar chart, following the bar heights to their point on the y-axis to find the frequencies. A statistician will spend time with a graph, will mull it over, and consider what deeper truths are found there. Statisticians are weird though. Most people care as much about how pretty the graph looks as its content. To communicate to these people (i.e. everyone else) it is appropriate and right to sacrifice statistical specificity to design considerations. After all, if the user stops reading you haven’t conveyed anything.
But back to the question… I would say that they are over used because they represent a very superficial analysis of a text or corpus. The word counts do convey an aspect of a text, but not a very nuanced one. The next step in looking at a corpus of texts would be to ask how are they different and how are they the same. The wordcloud package has the comparison and commonality word clouds, which attempt to extend the basic word cloud to answer these questions (see:

Dr. Ian Fellows is a professional statistician based out of the University of California, Los Angeles. His research interests range over many sub-disciplines of statistics. His work in statistical visualization won the prestigious John Chambers Award in 2011, and in 2007-2008 his Texas Hold’em AI programs were ranked second in the world.

Applied data analysis has been a passion for him, and he is accustomed to providing accurate, timely analysis for a wide range of projects, and assisting in the interpretation and communication of statistical results. He can be contacted at

Interview John Myles White , Machine Learning for Hackers

Here is an interview with one of the younger researchers  and rock stars of the R Project, John Myles White,  co-author of Machine Learning for Hackers.

Ajay- What inspired you guys to write Machine Learning for Hackers. What has been the public response to the book. Are you planning to write a second edition or a next book?

John-We decided to write Machine Learning for Hackers because there were so many people interested in learning more about Machine Learning who found the standard textbooks a little difficult to understand, either because they lacked the mathematical background expected of readers or because it wasn’t clear how to translate the mathematical definitions in those books into usable programs. Most Machine Learning books are written for audiences who will not only be using Machine Learning techniques in their applied work, but also actively inventing new Machine Learning algorithms. The amount of information needed to do both can be daunting, because, as one friend pointed out, it’s similar to insisting that everyone learn how to build a compiler before they can start to program. For most people, it’s better to let them try out programming and get a taste for it before you teach them about the nuts and bolts of compiler design. If they like programming, they can delve into the details later.

We once said that Machine Learning for Hackers  is supposed to be a chemistry set for Machine Learning and I still think that’s the right description: it’s meant to get readers excited about Machine Learning and hopefully expose them to enough ideas and tools that they can start to explore on their own more effectively. It’s like a warmup for standard academic books like Bishop’s.
The public response to the book has been phenomenal. It’s been amazing to see how many people have bought the book and how many people have told us they found it helpful. Even friends with substantial expertise in statistics have said they’ve found a few nuggets of new information in the book, especially regarding text analysis and social network analysis — topics that Drew and I spend a lot of time thinking about, but are not thoroughly covered in standard statistics and Machine Learning  undergraduate curricula.
I hope we write a second edition. It was our first book and we learned a ton about how to write at length from the experience. I’m about to announce later this week that I’m writing a second book, which will be a very short eBook for O’Reilly. Stay tuned for details.

Ajay-  What are the key things that a potential reader can learn from this book?

John- We cover most of the nuts and bolts of introductory statistics in our book: summary statistics, regression and classification using linear and logistic regression, PCA and k-Nearest Neighbors. We also cover topics that are less well known, but are as important: density plots vs. histograms, regularization, cross-validation, MDS, social network analysis and SVM’s. I hope a reader walks away from the book having a feel for what different basic algorithms do and why they work for some problems and not others. I also hope we do just a little to shift a future generation of modeling culture towards regularization and cross-validation.

Ajay- Describe your journey as a science student up till your Phd. What are you current research interests and what initiatives have you done with them?

John-As an undergraduate I studied math and neuroscience. I then took some time off and came back to do a Ph.D. in psychology, focusing on mathematical modeling of both the brain and behavior. There’s a rich tradition of machine learning and statistics in psychology, so I got increasingly interested in ML methods during my years as a grad student. I’m about to finish my Ph.D. this year. My research interests all fall under one heading: decision theory. I want to understand both how people make decisions (which is what psychology teaches us) and how they should make decisions (which is what statistics and ML teach us). My thesis is focused on how people make decisions when there are both short-term and long-term consequences to be considered. For non-psychologists, the classic example is probably the explore-exploit dilemma. I’ve been working to import more of the main ideas from stats and ML into psychology for modeling how real people handle that trade-off. For psychologists, the classic example is the Marshmallow experiment. Most of my research work has focused on the latter: what makes us patient and how can we measure patience?

Ajay- How can academia and private sector solve the shortage of trained data scientists (assuming there is one)?

John- There’s definitely a shortage of trained data scientists: most companies are finding it difficult to hire someone with the real chops needed to do useful work with Big Data. The skill set required to be useful at a company like Facebook or Twitter is much more advanced than many people realize, so I think it will be some time until there are undergraduates coming out with the right stuff. But there’s huge demand, so I’m sure the market will clear sooner or later.

The changes that are required in academia to prepare students for this kind of work are pretty numerous, but the most obvious required change is that quantitative people need to be learning how to program properly, which is rare in academia, even in many CS departments. Writing one-off programs that no one will ever have to reuse and that only work on toy data sets doesn’t prepare you for working with huge amounts of messy data that exhibit shifting patterns. If you need to learn how to program seriously before you can do useful work, you’re not very valuable to companies who need employees that can hit the ground running. The companies that have done best in building up data teams, like LinkedIn, have learned to train people as they come in since the proper training isn’t typically available outside those companies.
Of course, on the flipside, the people who do know how to program well need to start learning more about theory and need to start to have a better grasp of basic mathematical models like linear and logistic regressions. Lots of CS students seem not to enjoy their theory classes, but theory really does prepare you for thinking about what you can learn from data. You may not use automata theory if you work at Foursquare, but you will need to be able to reason carefully and analytically. Doing math is just like lifting weights: if you’re not good at it right now, you just need to dig in and get yourself in shape.
John Myles White is a Phd Student in  Ph.D. student in the Princeton Psychology Department, where he studies human decision-making both theoretically and experimentally. Along with the political scientist Drew Conway, he is  the author of a book published by O’Reilly Media entitled “Machine Learning for Hackers”, which is meant to introduce experienced programmers to the machine learning toolkit. He is also working with Mark Hansenon a book for laypeople about exploratory data analysis.John is the lead maintainer for several R packages, including ProjectTemplate and log4r.

(TIL he has played in several rock bands!)

You can read more in his own words at his blog at
He can be contacted via social media at Google Plus at or twitter at

Book Review- Machine Learning for Hackers

This is review of the fashionably named book Machine Learning for Hackers by Drew Conway and John Myles White (O’Reilly ). The book is about hacking code in R.


The preface introduces the reader to the authors conception of what machine learning and hacking is all about. If the name of the book was machine learning for business analytsts or data miners, I am sure the content would have been unchanged though the popularity (and ambiguity) of the word hacker can often substitute for its usefulness. Indeed the many wise and learned Professors of statistics departments through out the civilized world would be mildly surprised and bemused by their day to day activities as hacking or teaching hackers. The book follows a case study and example based approach and uses the GGPLOT2 package within R programming almost to the point of ignoring any other native graphics system based in R. It can be quite useful for the aspiring reader who wishes to understand and join the booming market for skilled talent in statistical computing.

Chapter 1 has a very useful set of functions for data cleansing and formatting. It walks you through the basics of formatting based on dates and conditions, missing value and outlier treatment and using ggplot package in R for graphical analysis. The case study used is an Infochimps dataset with 60,000 recordings of UFO sightings. The case study is lucid, and done at a extremely helpful pace illustrating the powerful and flexible nature of R functions that can be used for data cleansing.The chapter mentions text editors and IDEs but fails to list them in a tabular format, while listing several other tables like Packages used in the book. It also jumps straight from installation instructions to functions in R without getting into the various kinds of data types within R or specifying where these can be referenced from. It thus assumes a higher level of basic programming understanding for the reader than the average R book.

Chapter 2 discusses data exploration, and has a very clear set of diagrams that explain the various data summary operations that are performed routinely. This is an innovative approach and will help students or newcomers to the field of data analysis. It introduces the reader to type determination functions, as well different kinds of encoding. The introduction to creating functions is quite elegant and simple , and numerical summary methods are explained adequately. While the chapter explains data exploration with the help of various histogram options in ggplot2 , it fails to create a more generic framework for data exploration or rules to assist the reader in visual data exploration in non standard data situations. While the examples are very helpful for a reader , there needs to be slightly more depth to step out of the example and into a framework for visual data exploration (or references for the same). A couple of case studies however elaborately explained cannot do justice to the vast field of data exploration and especially visual data exploration.

Chapter 3 discussed binary classification for the specific purpose for spam filtering using a dataset from SpamAssassin. It introduces the reader to the naïve Bayes classifier and the principles of text mining suing the tm package in R. Some of the example codes could have been better commented for easier readability in the book. Overall it is quite a easy tutorial for creating a naïve Bayes classifier even for beginners.

Chapter 4 discusses the issues in importance ranking and creating recommendation systems specifically in the case of ordering email messages into important and not important. It introduces the useful grepl, gsub, strsplit, strptime ,difftime and strtrim functions for parsing data. The chapter further introduces the reader to the concept of log (and affine) transformations in a lucid and clear way that can help even beginners learn this powerful transformation concept. Again the coding within this chapter is sparsely commented which can cause difficulties to people not used to learn reams of code. ( it may have been part of the code attached with the book, but I am reading an electronic book and I did not find an easy way to go back and forth between the code and the book). The readability of the chapters would be further enhanced by the use of flow charts explaining the path and process followed than overtly verbose textual descriptions running into multiple pages. The chapters are quite clearly written, but a helpful visual summary can help in both revising the concepts and elucidate the approach taken further.A suggestion for the authors could be to compile the list of useful functions they introduce in this book as a sort of reference card (or Ref Card) for R Hackers or atleast have a chapter wise summary of functions, datasets and packages used.

Chapter 5 discusses linear regression , and it is a surprising and not very good explanation of regression theory in the introduction to regression. However the chapter makes up in practical example what it oversimplifies in theory. The chapter on regression is not the finest chapter written in this otherwise excellent book. Part of this is because of relative lack of organization- correlation is explained after linear regression is explained. Once again the lack of a function summary and a process flow diagram hinders readability and a separate section on regression metrics that help make a regression result good or not so good could be a welcome addition. Functions introduced include lm.

Chapter 6 showcases Generalized Additive Model (GAM) and Polynomial Regression, including an introduction to singularity and of over-fitting. Functions included in this chapter are transform, and poly while the package glmnet is also used here. The chapter also introduces the reader formally to the concept of cross validation (though examples of cross validation had been introduced in earlier chapters) and regularization. Logistic regression is also introduced at the end in this chapter.

Chapter 7 is about optimization. It describes error metric in a very easy to understand way. It creates a grid by using nested loops for various values of intercept and slope of a regression equation and computing the sum of square of errors. It then describes the optim function in detail including how it works and it’s various parameters. It introduces the curve function. The chapter then describes ridge regression including definition and hyperparameter lamda. The use of optim function to optimize the error in regression is useful learning for the aspiring hacker. Lastly it describes a case study of breaking codes using the simplistic Caesar cipher, a lexical database and the Metropolis method. Functions introduced in this chapter include .Machine$double.eps .

Chapter 8 deals with Principal Component Analysis and unsupervised learning. It uses the ymd function from lubridate package to convert string to date objects, and the cast function from reshape package to further manipulate the structure of data. Using the princomp functions enables PCA in R.The case study creates a stock market index and compares the results with the Dow Jones index.

Chapter 9 deals with Multidimensional Scaling as well as clustering US senators on the basis of similarity in voting records on legislation .It showcases matrix multiplication using %*% and also the dist function to compute distance matrix.

Chapter 10 has the subject of K Nearest Neighbors for recommendation systems. Packages used include class ,reshape and and functions used include cor, function and log. It also demonstrates creating a custom kNN function for calculating Euclidean distance between center of centroids and data. The case study used is the R package recommendation contest on Kaggle. Overall a simplistic introduction to creating a recommendation system using K nearest neighbors, without getting into any of the prepackaged packages within R that deal with association analysis , clustering or recommendation systems.

Chapter 11 introduces the reader to social network analysis (and elements of graph theory) using the example of Erdos Number as an interesting example of social networks of mathematicians. The example of Social Graph API by Google for hacking are quite new and intriguing (though a bit obsolete by changes, and should be rectified in either the errata or next edition) . However there exists packages within R that should be atleast referenced or used within this chapter (like TwitteR package that use the Twitter API and ROauth package for other social networks). Packages used within this chapter include Rcurl, RJSONIO, and igraph packages of R and functions used include rbind and ifelse. It also introduces the reader to the advanced software Gephi. The last example is to build a recommendation engine for whom to follow in Twitter using R.

Chapter 12 is about model comparison and introduces the concept of Support Vector Machines. It uses the package e1071 and shows the svm function. It also introduces the concept of tuning hyper parameters within default algorithms . A small problem in understanding the concepts is the misalignment of diagram pages with the relevant code. It lastly concludes with using mean square error as a method for comparing models built with different algorithms.


Overall the book is a welcome addition in the library of books based on R programming language, and the refreshing nature of the flow of material and the practicality of it’s case studies make this a recommended addition to both academic and corporate business analysts trying to derive insights by hacking lots of heterogeneous data.

Have a look for yourself at-

Why open source companies dont dance?

I have been pondering on this seemingly logical paradox for some time now-

1) Why are open source solutions considered technically better but not customer friendly.

2) Why do startups and app creators in social media or mobile get much more press coverage than

profitable startups in enterprise software.

3) How does tech journalism differ in covering open source projects in enterprise versus retail software.

4) What are the hidden rules of the game of enterprise software.

Some observations-

1) Open source companies often focus much more on technical community management and crowd sourcing code. Traditional software companies focus much more on managing the marketing community of customers and influencers. Accordingly the balance of power is skewed in favor of techies and R and D in open source companies, and in favor of marketing and analyst relations in traditional software companies.

Traditional companies also spend much more on hiring top notch press release/public relationship agencies, while open source companies are both financially and sometimes ideologically opposed to older methods of marketing software. The reverse of this is you are much more likely to see Videos and Tutorials by an open source company than a traditional company. You can compare the websites of ClouderaDataStax, Hadapt ,Appistry and Mapr and contrast that with Teradata or Oracle (which has a much bigger and much more different marketing strategy.

Social media for marketing is also more efficiently utilized by smaller companies (open source) while bigger companies continue to pay influential analysts for expensive white papers that help present the brand.

Lack of budgets is a major factor that limits access to influential marketing for open source companies particularly in enterprise software.

2 and 3) Retail software is priced at 2-100$ and sells by volume. Accordingly technology coverage of these software is based on volume.

Enterprise software is much more expensively priced and has much more discreet volume or sales points. Accordingly the technology coverage of enterprise software is more discreet, in terms of a white paper coming every quarter, a webinar every month and a press release every week. Retail software is covered non stop , but these journalists typically do not charge for “briefings”.

Journalists covering retail software generally earn money by ads or hosting conferences. So they have an interest in covering new stuff or interesting disruptive stuff. Journalists or analysts covering enterprise software generally earn money by white papers, webinars, attending than hosting conferences, writing books. They thus have a much stronger economic incentive to cover existing landscape and technologies than smaller startups.

4) What are the hidden rules of the game of enterprise software.

  • It is mostly a white man’s world. this can be proved by statistical demographic analysis
  • There is incestuous intermingling between influencers, marketers, and PR people. This can be proved by simple social network analysis of who talks to who and how much. A simple time series between sponsorship and analysts coverage also will prove this (I am working on quantifying this ).
  • There are much larger switching costs to enterprise software than retail software. This leads to legacy shoddy software getting much chances than would have been allowed in an efficient marketplace.
  • Enterprise software is a less efficient marketplace than retail software in all definitions of the term “efficient markets”
  • Cloud computing, and SaaS and Open source threatens to disrupt the jobs and careers of a large number of people. In the long term, they will create many more jobs, but in the short term, people used to comfortable living of enterprise software (making,selling,or writing) will actively and passively resist these changes to the  paradigms in the current software status quo.
  • Open source companies dont dance and dont play ball. They prefer to hire 4 more college grads than commission 2 more white papers.

and the following with slight changes from a comment I made on a fellow blog-

  • While the paradigm on how to create new software has evolved from primarily silo-driven R and D departments to a broader collaborative effort, the biggest drawback is software marketing has not evolved.
  • If you want your own version of the open source community editions to be more popular, some standardization is necessary for the corporate decision makers, and we need better marketing paradigms.
  • While code creation is crowdsourced, solution implementation cannot be crowdsourced. Customers want solutions to a problem not code.
  • Just as open source as a production and licensing paradigm threatens to disrupt enterprise software, it will lead to newer ways to marketing software given the hostility of existing status quo.



Who writes white papers?

A social network diagram
Image via Wikipedia

There are four main types of commercial white papers:

  • Business benefits: Makes a business case for a certain technology or methodology.
  • Technical: Describes how a certain technology works.
  • Hybrid: Combines business benefits with technical details in a single document.
  • Policy: Makes a case for a certain political solution to a societal or economic challenge.
Name the best white paper you ever read? (comment that in the field)..
What categoy of white papers is the best?
Do you think white papers are too expensive or they give adequate ROI?
To be continued- including

  1. demographic and social network analysis of analysts and white paper sponsors to measure interaction effects.
  2. white papers segmented by type of software company
  3. proc freq analysis of the words frequency data viz in white papers written by same analysts for different companies on same topics.
  4. Race and ethnic analysis of influencers and analysts in Business Analysts and Business Intelligence. – Null hypothesis – it is not a white mans world, women, Hispanics and other minorities are adequately represented.
Why I am doing this?
I am writing a white paper on WHO writes a white paper? 
Sponsorships are invited- but academics and startups in analytics may be preferred.

Carole-Ann’s 2011 Predictions for Decision Management

Carole-Ann’s 2011 Predictions for Decision Management

For Ajay Ohri on

What were the top 5 events in 2010 in your field?
  1. Maturity: the Decision Management space was made up of technology vendors, big and small, that typically focused on one or two aspects of this discipline.  Over the past few years, we have seen a lot of consolidation in the industry – first with Business Intelligence (BI) then Business Process Management (BPM) and lately in Business Rules Management (BRM) and Advanced Analytics.  As a result the giant Platform vendors have helped create visibility for this discipline.  Lots of tiny clues finally bubbled up in 2010 to attest of the increasing activity around Decision Management.  For example, more products than ever were named Decision Manager; companies advertised for Decision Managers as a job title in their job section; most people understand what I do when I am introduced in a social setting!
  2. Boredom: unfortunately, as the industry matures, inevitably innovation slows down…  At the main BRMS shows we heard here and there complaints that the technology was stalling.  We heard it from vendors like Red Hat (Drools) and we heard it from bored end-users hoping for some excitement at Business Rules Forum’s vendor panel.  They sadly did not get it
  3. Scrum: I am not thinking about the methodology there!  If you have ever seen a rugby game, you can probably understand why this is the term that comes to mind when I look at the messy & confusing technology landscape.  Feet blindly try to kick the ball out while superhuman forces are moving randomly the whole pack – or so it felt when I played!  Business Users in search of Business Solutions are facing more and more technology choices that feel like comparing apples to oranges.  There is value in all of them and each one addresses a specific aspect of Decision Management but I regret that the industry did not simplify the picture in 2010.  On the contrary!  Many buzzwords were created or at least made popular last year, creating even more confusion on a muddy field.  A few examples: Social CRM, Collaborative Decision Making, Adaptive Case Management, etc.  Don’t take me wrong, I *do* like the technologies.  I sympathize with the decision maker that is trying to pick the right solution though.
  4. Information: Analytics have been used for years of course but the volume of data surrounding us has been growing to unparalleled levels.  We can blame or thank (depending on our perspective) Social Media for that.  Sites like Facebook and LinkedIn have made it possible and easy to publish relevant (as well as fluffy) information in real-time.  As we all started to get the hang of it and potentially over-publish, technology evolved to enable the storage, correlation and analysis of humongous volumes of data that we could not dream of before.  25 billion tweets were posted in 2010.  Every month, over 30 billion pieces of data are shared on Facebook alone.  This is not just about vanity and marketing though.  This data can be leveraged for the greater good.  Carlos pointed to some fascinating facts about catastrophic event response team getting organized thanks to crowd-sourced information.  We are also seeing, in the Decision management world, more and more applicability for those very technology that have been developed for the needs of Big Data – I’ll name for example Hadoop that Carlos (yet again) discussed in his talks at Rules Fest end of 2009 and 2010.
  5. Self-Organization: it may be a side effect of the Social Media movement but I must admit that I was impressed by the success of self-organizing initiatives.  Granted, this last trend has nothing to do with Decision Management per se but I think it is a great evolution worth noting.  Let me point to a couple of examples.  I usually attend traditional conferences and tradeshows in which the content can be good but is sometimes terrible.  I was pleasantly surprised by the professionalism and attendance at *un-conferences* such as P-Camp (P stands for Product – an event for Product Managers).  When you think about it, it is already difficult to get a show together when people are dedicated to the tasks.  How crazy is it to have volunteers set one up with no budget and no agenda?  Well, people simply show up to do their part and everyone has fun voting on-site for what seems the most appealing content at the time.  Crowdsourcing applied to shows: it works!  Similar experience with meetups or tweetups.  I also enjoyed attending some impromptu Twitter jam sessions on a given topic.  Social Media is certainly helping people reach out and get together in person or virtually and that is wonderful!

A segment of a social network
Image via Wikipedia

What are the top three trends you see in 2011?

  1. Performance:  I might be cheating here.   I was very bullish about predicting much progress for 2010 in the area of Performance Management in your Decision Management initiatives.  I believe that progress was made but Carlos did not give me full credit for the right prediction…  Okay, I am a little optimistic on timeline…  I admit it…  If it did not fully happen in 2010, can I predict it again in 2011?  I think that companies want to better track their business performance in order to correct the trajectory of course but also to improve their projections.  I see that it is turning into reality already here and there.  I expect it to become a trend in 2011!
  2. Insight: Big Data being available all around us with new technologies and algorithms will continue to propagate in 2011 leading to more widely spread Analytics capabilities.  The buzz at Analytics shows on Social Network Analysis (SNA) is a sign that there is interest in those kinds of things.  There is tremendous information that can be leveraged for smart decision-making.  I think there will be more of that in 2011 as initiatives launches in 2010 will mature into material results.
    5 Ways to Cultivate an Active Social Network
    Image by Intersection Consulting via Flickr
  3. Collaboration:  Social Media for the Enterprise is a discipline in the making.  Social Media was initially seen for the most part as a Marketing channel.  Over the years, companies have started experimenting with external communities and ideation capabilities with moderate success.  The few strategic initiatives started in 2010 by “old fashion” companies seem to be an indication that we are past the early adopters.  This discipline may very well materialize in 2011 as a core capability, well, or at least a new trend.  I believe that capabilities such Chatter, offered by Salesforce, will transform (slowly) how people interact in the workplace and leverage the volumes of social data captured in LinkedIn and other Social Media sites.  Collaboration is of course a topic of interest for me personally.  I even signed up for Kare Anderson’s collaboration collaboration site – yes, twice the word “collaboration”: it is really about collaborating on collaboration techniques.  Even though collaboration does not require Social Media, this medium offers perspectives not available until now.

Brief Bio-

Carole-Ann is a renowned guru in the Decision Management space. She created the vision for Decision Management that is widely adopted now in the industry. Her claim to fame is the strategy and direction of Blaze Advisor, the then-leading BRMS product, while she also managed all the Decision Management tools at FICO (business rules, predictive analytics and optimization). She has a vision for Decision Management both as a technology and a discipline that can revolutionize the way corporations do business, and will never get tired of painting that vision for her audience. She speaks often at Industry conferences and has conducted university classes in France and Washington DC.

Leveraging her Masters degree in Applied Mathematics / Computer Science from a “Grande Ecole” in France, she started her career building advanced systems using all kinds of technologies — expert systems, rules, optimization, dashboarding and cubes, web search, and beta version of database replication – as well as conducting strategic consulting gigs around change management.

She now tweets as @CMatignon, blogs at and interacts at

She started her career building advanced systems using all kinds of technologies — expert systems, rules, optimization, dashboarding and cubes, web search, and beta version of database replication.  At Cleversys (acquired by Kurt Salmon & Associates), she also conducted strategic consulting gigs mostly around change management.

While playing with advanced software components, she found a passion for technology and joined ILOG (acquired by IBM).  She developed a growing interest in Optimization as well as Business Rules.  At ILOG, she coined the term BRMS while brainstorming with her Sales counterpart.  She led the Presales organization for Telecom in the Americas up until 2000 when she joined Blaze Software (acquired by Brokat Technologies, HNC Software and finally FICO).

Her 360-degree experience allowed her to gain appreciation for all aspects of a software company, giving her a unique perspective on the business.  Her technical background kept her very much in touch with technology as she advanced.

She also became addicted to Twitter in the process.  She is active on all kinds of social media, always looking for new digital experience!

Outside of work, Carole-Ann loves spending time with her two boys.  They grow fruits in their Northern California home and cook all together in the French tradition.

profile on LinkedIn

TwitterFollow me on Twitter

Filtering to Gain Social Network Value
Image by Intersection Consulting via Flickr
Social Networks Hype Cycle
Image by fredcavazza via Flickr

PAW Videos

A message from Predictive Analytics World on  newly available videos. It has many free videos as well so you can check them out.

Predictive Analytics World March 2011 in San Francisco

Access PAW DC Session Videos Now

Predictive Analytics World is pleased to announce on-demand access to the videos of PAW Washington DC, October 2010, including over 30 sessions and keynotes that you may view at your convenience. Access this leading predictive analytics content online now:

View the PAW DC session videos online

Register by January 18th and receive $150 off the full 2-day conference program videos (enter code PAW150 at checkout)

Trial videos – view the following for no charge:

Select individual conference sessions, or recognize savings by registering for access to one or two full days of sessions. These on-demand videos deliver PAW DC right to your desk, covering hot topics and advanced methods such as:

Social data 

Text mining

Search marketing

Risk management

Survey analysis

Consumer privacy

Sales force optimization

Response & cross-sell

Recommender systems

Featuring experts such as:
Usama Fayyad, Ph.D.
CEO, Open Insights Former Chief Data Officer, Yahoo!

Andrew Pole
Sr Mgr, Media/DB Mktng
View Keynote for Free

John F. Elder, Ph.D.
CEO and Founder
Elder Research

Bruno Aziza
Director, Worldwide Strategy Lead, BI

Eric Siegel, Ph.D.
Conference Chair
Predictive Analytics World

PAW DC videos feature over 25 speakers with case studies from leading enterprises such as: CIBC, CEB, Forrester, Macy’s, MetLife, Microsoft, Miles Kimball,, Oracle, Paychex, SunTrust, Target, UPMC, Xerox, Yahoo!, YMCA, and more.

How video access works:

View Slides on the Left See & Hear Speaker in the Right Window

Sign up by January 18 for immediate video access and $150 discount

San Francisco
March 14-15, 2011
Washington DC
October, 2011
November, 2011
Contact Us

Produced by:


Session Gallery: Day 1 of 2

Viewing (17) Sessions of (31)


Add to Cart
Keynote: Five Ways Predictive Analytics Cuts Enterprise Risk  

Eric Siegel, Ph.D., Program Chair, Predictive Analytics World

All business is an exercise in risk management. All organizations would benefit from measuring, tracking and computing risk as a core process, much like insurance companies do.

Predictive analytics does the trick, one customer at a time. This technology is a data-driven means to compute the risk each customer will defect, not respond to an expensive mailer, consume a retention discount even if she were not going to leave in the first place, not be targeted for a telephone solicitation that would have landed a sale, commit fraud, or become a “loss customer” such as a bad debtor or an insurance policy-holder with high claims.

In this keynote session, Dr. Eric Siegel reveals:

– Five ways predictive analytics evolves your enterprise to reduce risk

– Hidden sources of risk across operational functions

– What every business should learn from insurance companies

– How advancements have reversed the very meaning of fraud

– Why “man + machine” teams are greater than the sum of their parts for enterprise decision support

Length – 00:45:57 | Email to a Colleague

Price: $195



Play video of session: Platinum Sponsor Presentation, Analytics: The Beauty of Diversity
Platinum Sponsor Presentation: Analytics – The Beauty of Diversity 

Anne H. Milley, Senior Director of Analytic Strategy, Worldwide Product Marketing, SAS

Analytics contributes to, and draws from, multiple disciplines. The unifying theme of “making the world a better place” is bred from diversity. For instance, the same methods used in econometrics might be used in market research, psychometrics and other disciplines. In a similar way, diverse paradigms are needed to best solve problems, reveal opportunities and make better decisions. This is why we evolve capabilities to formulate and solve a wide range of problems through multiple integrated languages and interfaces. Extending that, we have provided integration with other languages so that users can draw on the disciplines and paradigms needed to best practice their craft.

Length – 20:11 | Email to a Colleague

Free viewing enabled – no charge


gold sponsor.jpg
Play video of session: Gold Sponsor Presentation Predictive Analytics Accelerate Insight for Financial Services
Gold Sponsor Presentation: Predictive Analytics Accelerate Insight for Financial Services 

Finbarr Deely, Director of Business Development,ParAccel

Financial services organizations face immense hurdles in maintaining profitability and building competitive advantage. Financial services organizations must perform “what-if” scenario analysis, identify risks, and detect fraud patterns. The advanced analytic complexity required often makes such analysis slow and painful, if not impossible. This presentation outlines the analytic challenges facing these organizations and provides a clear path to providing the accelerated insight needed to perform in today’s complex business environment to reduce risk, stop fraud and increase profits. * The value of predictive analytics in Accelerating Insight * Financial Services Analytic Case Studies * Brief Overview of ParAccel Analytic Database

Length – 09:06 | Email to a Colleague

Free viewing enabled – no charge


Add to Cart
Case Study:
Creating Global Competitive Power with Predictive Analytics 

Jean Paul Isson, Vice President, Globab BI & Predictive Analytics, Monster Worldwide

Using Predictive analytics to gain a deeper understanding of customer behaviours, increase marketing ROI and drive growth

– Creating global competitive power with business intelligence: Making the right decisions – at the right time

– Avoiding common change management challenges in sales, marketing, customer service, and products

– Developing a BI vision – and implementing it: successful business intelligence implementation models

– Using predictive analytics as a business driver to stay on top of the competition

– Following the Monster Worldwide global BI evolution: How Monster used BI to go from good to great

Length – 51:17 | Email to a Colleague

Price: $195



Add to Cart
Case Study: YMCA
Turning Member Satisfaction Surveys into an Actionable Narrative 

Dean Abbott, President, Abbott Analytics

Employees are a key constituency at the Y and previous analysis has shown that their attitudes have a direct bearing on Member Satisfaction. This session will describe a successful approach for the analysis of YMCA employee surveys. Decision trees are built and examined in depth to identify key questions in describing key employee satisfaction metrics, including several interesting groupings of employee attitudes. Our approach will be contrasted with other factor analysis and regression-based approaches to survey analysis that we used initially. The predictive models described are currently in use and resulted in both greater understanding of employee attitudes, and a revised “short-form” survey with fewer key questions identified by the decision trees as the most important predictors.

Length – 50:19 | Email to a Colleague

Price: $195



Add to Cart
2010 Data Minter Survey Results: Highlights

Karl Rexer, Ph.D., Rexer Analytics

Do you want to know the views, actions, and opinions of the data mining community? Each year, Rexer Analytics conducts a global survey of data miners to find out. This year at PAW we unveil the results of our 4th Annual Data Miner Survey. This session will present the research highlights, such as:

– Analytic goals & key challenges

– Impact of the economy

– Regional differences

– Text mining trends

Length – 15:20 | Email to a Colleague

Price: $195



Add to Cart
Multiple Case Studies: U.S. DoD, U.S. DHS, SSA
Text Mining: Lessons Learned 

John F. Elder, Chief Scientist, Elder Research, Inc.

Text Mining is the “Wild West” of data mining and predictive analytics – the potential for gain is huge, the capability claims are often tall tales, and the “land rush” for leadership is very much a race.

In solving unstructured (text) analysis challenges, we found that principles from inductive modeling – learning relationships from labeled cases – has great power to enhance text mining. Dr. Elder highlights key technical breakthroughs discovered while working on projects for leading government agencies, including: Text Mining is the “Wild West” of data mining and predictive analytics – the potential for gain is huge, the capability claims are often tall tales, and the “land rush” for leadership is very much a race.

– Prioritizing searches for the Dept. of Homeland Security

– Quick decisions for Social Security Admin. disability

– Document discovery for the Dept. of Defense

– Disease discovery for the Dept. of Homeland Security

– Risk profiling for the Dept. of Defense

Length – 48:58 | Email to a Colleague

Price: $195



Play video of session: Keynote: How Target Gets the Most out of Its Guest Data to Improve Marketing ROI
Keynote: How Target Gets the Most out of Its Guest Data to Improve Marketing ROI 

Andrew Pole, Senior Manager, Media and Database Marketing, Target

In this session, you’ll learn how Target leverages its own internal guest data to optimize its direct marketing – with the ultimate goal of enhancing our guests’ shopping experience and driving in-store and online performance. You will hear about what guest data is available at Target, how and where we collect it, and how it is used to improve the performance and relevance of direct marketing vehicles. Furthermore, we will discuss Target’s development and usage of guest segmentation, response modeling, and optimization as means to suppress poor performers from mailings, determine relevant product categories and services for online targeted content, and optimally assign receipt marketing offers to our guests when offer quantities are limited.

Length – 47:49 | Email to a Colleague

Free viewing enabled – no charge


Play video of session: Platinum Sponsor Presentation: Driving Analytics Into Decision Making
Platinum Sponsor Presentation: Driving Analytics Into Decision Making  

Jason Verlen, Director, SPSS Product Strategy & Management, IBM Software Group

Organizations looking to dramatically improve their business outcomes are turning to decision management, a convergence of technology and business processes that is used to streamline and predict the outcome of daily decision-making. IBM SPSS Decision Management technology provides the critical link between analytical insight and recommended actions. In this session you’ll learn how Decision Management software integrates analytics with business rules and business applications for front-line systems such as call center applications, insurance claim processing, and websites. See how you can improve every customer interaction, minimize operational risk, reduce fraud and optimize results.

Length – 17:29 | Email to a Colleague

Free viewing enabled – no charge


Add to Cart
Case Study: Macy’s
The world is not flat (even though modeling software has to think it is) 

Paul Coleman, Director of Marketing Statistics, Macy’s Inc.

Software for statistical modeling generally use flat files, where each record represents a unique case with all its variables. In contrast most large databases are relational, where data are distributed among various normalized tables for efficient storage. Variable creation and model scoring engines are necessary to bridge data mining and storage needs. Development datasets taken from a sampled history require snapshot management. Scoring datasets are taken from the present timeframe and the entire available universe. Organizations, with significant data, must decide when to store or calculate necessary data and understand the consequences for their modeling program.

Length – 34:54 | Email to a Colleague

Price: $195



Add to Cart
Case Study: SunTrust
When One Model Will Not Solve the Problem – Using Multiple Models to Create One Solution 

Dudley Gwaltney, Group Vice President, Analytical Modeling, SunTrust Bank

In 2007, SunTrust Bank developed a series of models to identify clients likely to have large changes in deposit balances. The models include three basic binary and two linear regression models.

Based on the models, 15% of SunTrust clients were targeted as those most likely to have large balance changes. These clients accounted for 65% of the absolute balance change and 60% of the large balance change clients. The targeted clients are grouped into a portfolio and assigned to individual SunTrust Retail Branch. Since 2008, the portfolio generated a 2.6% increase in balances over control.

Using the SunTrust example, this presentation will focus on:

– Identifying situations requiring multiple models

– Determining what types of models are needed

– Combining the individual component models into one output

Length – 48:22 | Email to a Colleague

Price: $195



Add to Cart
Case Study: Paychex
Staying One Step Ahead of the Competition – Development of a Predictive 401(k) Marketing and Sales Campaign 

Jason Fox, Information Systems and Portfolio Manager,Paychex

In-depth case study of Paychex, Inc. utilizing predictive modeling to turn the tides on competitive pressures within their own client base. Paychex, a leading provider of payroll and human resource solutions, will guide you through the development of a Predictive 401(k) Marketing and Sales model. Through the use of sophisticated data mining techniques and regression analysis the model derives the probability a client will add retirement services products with Paychex or with a competitor. Session will include roadblocks that could have ended development and ROI analysis. Speaker: Frank Fiorille, Director of Enterprise Risk Management, Paychex Speaker: Jason Fox, Risk Management Analyst, Paychex

Length – 26:29 | Email to a Colleague

Price: $195



Add to Cart
Practitioner: Canadian Imperial Bank of Commerce
Segmentation Do’s and Don’ts 

Daymond Ling, Senior Director, Modelling & Analytics,Canadian Imperial Bank of Commerce

The concept of Segmentation is well accepted in business and has withstood the test of time. Even with the advent of new artificial intelligence and machine learning methods, this old war horse still has its place and is alive and well. Like all analytical methods, when used correctly it can lead to enhanced market positioning and competitive advantage, while improper application can have severe negative consequences.

This session will explore what are the elements of success, and what are the worse practices that lead to failure. The relationship between segmentation and predictive modeling will also be discussed to clarify when it is appropriate to use one versus the other, and how to use them together synergistically.

Length – 45:57 | Email to a Colleague

Price: $195



Add to Cart
Thought Leadership
Social Network Analysis: Killer Application for Cloud Analytics

James Kobielus, Senior Analyst, Forrester Research

Social networks such as Twitter and Facebook are a potential goldmine of insights on what is truly going through customers´minds. Every company wants to know whether, how, how often, and by whom they´re being mentioned across the billowing new cloud of social media. Just as important, every company wants to influence those discussions in their favor, target new business, and harvest maximum revenue potential. In this session, Forrester analyst James Kobielus identifies fruitful applications of social network analysis in customer service, sales, marketing, and brand management. He presents a roadmap for enterprises to leverage their inline analytics initiatives and leverage high-performance data warehousing (DW) clouds and appliances in order to analyze shifting patterns of customer sentiment, influence, and propensity. Leveraging Forrester’s ongoing research in advanced analytics and customer relationship management, Kobielus will discuss industry trends, commercial modeling tools, and emerging best practices in social network analysis, which represents a game-changing new discipline in predictive analytics.

Length – 48:16 | Email to a Colleague

Price: $195



Add to Cart
Case Study: Life Line Screening
Taking CRM Global Through Predictive Analytics 

Ozgur Dogan,
VP, Quantitative Solutions Group, Merkle Inc

Trish Mathe,
Director of Database Marketing, Life Line Screening

While Life Line is successfully executing a US CRM roadmap, they are also beginning this same evolution abroad. They are beginning in the UK where Merkle procured data and built a response model that is pulling responses over 30% higher than competitors. This presentation will give an overview of the US CRM roadmap, and then focus on the beginning of their strategy abroad, focusing on the data procurement they could not get anywhere else but through Merkle and the successful modeling and analytics for the UK. Speaker: Ozgur Dogan, VP, Quantitative Solutions Group, Merkle Inc Speaker: Trish Mathe, Director of Database Marketing, Life Line Screening

Length – 40:12 | Email to a Colleague

Price: $195



Add to Cart
Case Study: Forrester
Making Survey Insights Addressable and Scalable – The Case Study of Forrester’s Technographics Benchmark Survey 

Nethra Sambamoorthi, Team Leader, Consumer Dynamics & Analytics, Global Consulting, Acxiom Corporation

Marketers use surveys to create enterprise wide applicable strategic insights to: (1) develop segmentation schemes, (2) summarize consumer behaviors and attitudes for the whole US population, and (3) use multiple surveys to draw unified views about their target audience. However, these insights are not directly addressable and scalable to the whole consumer universe which is very important when applying the power of survey intelligence to the one to one consumer marketing problems marketers routinely face. Acxiom partnered with Forrester Research, creating addressable and scalable applications of Forrester’s Technographics Survey and applied it successfully to a number of industries and applications.

Length – 39:23 | Email to a Colleague

Price: $195



Add to Cart
Case Study: UPMC Health Plan
A Predictive Model for Hospital Readmissions 

Scott Zasadil, Senior Scientist, UPMC Health Plan

Hospital readmissions are a significant component of our nation’s healthcare costs. Predicting who is likely to be readmitted is a challenging problem. Using a set of 123,951 hospital discharges spanning nearly three years, we developed a model that predicts an individual’s 30-day readmission should they incur a hospital admission. The model uses an ensemble of boosted decision trees and prior medical claims and captures 64% of all 30-day readmits with a true positive rate of over 27%. Moreover, many of the ‘false’ positives are simply delayed true positives. 53% of the predicted 30-day readmissions are readmitted within 180 days.

Length – 54:18 | Email to a Colleague

Price: $195