DECISION STATS

KDD2009: Tutorials

Here are some great tutorials worth attending at KDD 2009

Source; http://www.sigkdd.org/kdd2009/tutorials.html#t1

Tutorials

All tutorials will be held on Sunday, June 28th.

Morning

T1 – Statistical Challenges in Computational Advertising
Deepayan Chakrabarti, Deepak Agarwal

T2 – How to do good research, get it published in SIGKDD and get it cited!
Eamonn Keogh

T3 – Large Graph-Mining: Power Tools and a Practitioner’s Guide
Christos Faloutsos, Gary Miller, Charalampos (Babis) Tsourakakis

T4 – Planning, Running, and Analyzing Controlled Experiments on the Web
Ronny Kohavi, Roger Longbotham, John Quarto-vonTivadar

Afternoon

T5 – Predictive Modelling in the Wild: Success Factors in Data Mining Competitions and Real-World Applications
Saharon Rosset, Claudia Perlich

T6 – New Directions in Data Quality Mining
Laure Berti-Equille, Tamraparni Dasu

T7 – Event Detection
Daniel Neill, Weng-Keen Wong

T8 – Advances in Mining the Web
Myra Spiliopoulou, Osmar Zaiane, Bamshad Mobasher, Olfa Nasraoui

T9 – Real World Text Mining
Ronen Feldman, Lyle Ungar

Abstracts

T1 – Statistical Challenges in Computational Advertising

Deepayan Chakrabarti, Deepak Agarwal

Many organizations now devote significant fractions of their advertising/outreach budgets to online advertising; ad-networks like Yahoo!, Google, MSN have responded by constructing new kinds of economic models and perform the fundamental task of matching the most relevant ads (selected from a large inventory) for a (query,user) pair in a given context. Nearly all of the challenges that arise are substantially data- or model-driven (or both). Computational Advertising is a relatively new scientific sub-discipline at the interesection of large scale search and text analysis, information retrieval, statistical modeling, machine learning, optimization and microeconomics that address this match-making problem and provides unprecedented opportunities to data miners.

Topics covered include a comprehensive introduction to several advertising forms (sponsored search, contextual adverting, display advertising), revenue models (pay-per-click, pay-per-view, pay-per-conversion) and data mining challenges involved, along with an overview of state-of-the-art techniques in the area with a detailed discussion of open problems. We will cover information retrieval techniques and their limitations; data mining challenges involved in performing ad matching through clickstream data and challenging optimization issues that arise in display advertising. In particular, we will cover statistical modeling techniques for clickstream data and explore/exploit schemes to perform online experiments for better long-term performance using multi-armed bandit schemes. We also discuss the close relationship of techniques used in recommender systems to our problem but indicate several additional issues that needs to be addressed before they become routine in computational advertising.

We will only assume basic knowledge of statistical methods, no prior knowledge of online advertising is required. In fact, the first hour that provides an introduction to the area would be appropriate for all registered attendees of KDD 2009. The second half would require familiarity with basic concepts like regression, probability distributions and appreciation of issues involved in fitting statistical models to large scale applications. No prior knowledge of multi-armed bandits would be assumed.

Back to top…

T2 – How to do good research, get it published in SIGKDD and get it cited!

Eamonn Keogh

While SIGKDD has traditionally enjoyed an unusually high quality of reviewing, there is no doubt that publishing in SIGKDD (and other high quality data mining conferences) is very challenging. This is especially true for young faculty, grad students whose primary advisor is not an experienced SIGKDD author, or people from outside the community (i.e. a biologist or mathematician who has a result that might greatly interest the data mining community).

In this tutorial Dr. Keogh will demonstrate some simple ideas to enhance the probability of success in getting your paper published in a top data mining conference; and after the work is published, getting it highly cited.

These tips and tricks are based on 12 years experience as a SIGKDD author and reviewer, and wisdom solicited from many of the most prolific data mining researchers/reviewers.

Topics covered in the tutorial include:

Finding the right problems to work on (80% of the battle).

Don’t summarize, sell! Writing abstracts that put the reviewer on your side from the start.

Getting or creating the perfect dataset.

Experiments that tell a story.

Making effective and interesting figures.

Getting the reviewers on your side.

The top-ten avoidable reasons why papers get rejected from SIGKDD.

Three simple tricks to increase the number of citations to your work.

While Dr. Keogh does not claim to have a “magic bullet” for publishing in SIGKDD, his significant track record of publishing in top data mining venues, combined with extensive (and deliberately uncredited) experience in helping younger researchers “break-in” to SIGKDD have placed him in a unique position to share useful and actionable advice.

While writing this tutorial Dr. Keogh, sought and received advice from many respected data mining researchers, their advice is incorporated into this tutorial.

Back to top…

T3 – Large Graph-Mining: Power Tools and a Practitioner’s Guide

Christos Faloutsos, Gary Miller, Charalampos (Babis) Tsourakakis

Numerous real-world datasets are in matrix form, thus matrix algebra, linear and multilinear, provides important algorithmic tools for analyzing them. The main type of datasets of interest in this tutorial are graphs. Important datasets modeled as graphs include the Internet, the Web, social networks (e,g Facebook, LinkedIn), computer networks, biological networks and many more.

We will discuss how we represent a graph as a matrix (adjacency matrix, Laplacian) and the important properties of those representations. We will then show how these properties are used in several important problems, including node importance via random walks (Pagerank), community detection (METIS, Cheeger inequality), graph isomorphism and graph similarity. Important dimensionality reduction techniques (SVD and random projections) will be discussed in the context of graph mining problems.

Furthermore, we provide a survey of the work on the epidemic threshold, node proximity and center-piece subgraphs. State-of-art graph mining tools for analyzing time evolving graphs will also be presented. Throughout the tutorial, patterns in static and time evolving, weighted and unweighted real-world graphs will be presented.

The target audience are data mining professionals who wish to know the most important matrix algebra tools, their applications in large graph mining and the theory behing them.
Prerequisites: Computer science background (B.Sc or equivalent); familiarity with undergraduate linear algebra.
Demos will be presented.

http://www.cs.cmu.edu/~christos/TALKS/09-KDD-tutorial/

Back to top…

T4 – Planning, Running, and Analyzing Controlled Experiments on the Web

Ronny Kohavi, Roger Longbotham, John Quarto-vonTivadar

The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments, A/B tests (and their generalizations), split tests, and MultiVariable Tests (MVT). Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. Data Mining and Knowledge Discovery techniques can then be used to analyze the data from such experiments. The tutorial will provide a survey and practical guide to running controlled experiments based on the recently published survey article in the Data Mining and Knowledge Discovery Journal, co-authored with the two of the tutorial co-presenters (http://exp-platform.com/dmkd_survey.aspx), and based on the book “Always Be Testing” co-authored by the 3rd tutorial co-presenter (http://www.amazon.com/Always-Be-Testing-Complete-Optimizer/dp/0470290633). The book includes use of industry tools, such as Google Website Optimizer and recently ranked #2 on Amazon’s sales rank for computers/e-commerce books. The tutorial includes multiple real-world examples of actual controlled experiments (many with surprising results), a review the theory and the statistics used to plan and analyze such experiments, and a discussion of the limitations and pitfalls that might face experimenters. Demos will be shown of some tools that support controlled experiments.

A video of a related talk can be found on the videolectures website:
http://videolectures.net/cikm08_kohavi_pgtce/
The shorter version of the DMKD survey paper is now part of the class reading for several classes at Stanford University (CS147, CS376), USCD (CSE 291), and at the University of Washington (CSEP 510).

Topics covered include:

Why online experimentation using controlled experiments is important

What you need in order to conduct a valid experiment

Planning and Analysis of basic experiments

Benefits and limitations of experimentation

Multivariable experiments: setup, analysis, interpretation, and interactions

Architectures

Using online free and low-cost software services (demos)

Challenges and advanced statistical concepts for experiments

Back to top…

T5 – Predictive Modelling in the Wild: Success Factors in Data Mining Competitions and Real-World Applications

Saharon Rosset, Claudia Perlich

In this tutorial, we give our perspective on the keys to success in application of predictive modeling to competitions like KDD Cup and real-life business intelligence projects. We argue that these two modes of applying predictive modeling share many similarities, but have also some important differences. We discuss the main success factors in predictive modeling: domain understanding, statistical acumen, and appropriate algorithmic approaches. We describe our relevant experiences in the context of three recent predictive modeling competitions where our team has had success (KDD Cup 2007 and 2008 and INFORMS DM challenge 2008) and two case studies of projects we have led at IBM Research. We also survey some of the recurring challenges and complexities in practical predictive modeling applications. One key issue is information leakage, and we discuss its definition, influence, detection and avoidance. We consider leakage to be the silent killer of many predictive modeling projects, and we demonstrate its impact on the competitions, and discuss the challenges in addressing it in the real-life projects. Other challenges include framing real-life modeling objectives into predictive modeling, and usefully applying relational learning concepts when modeling “real-life” complex, relational datasets.

Back to top…

T6 – New Directions in Data Quality Mining

Laure Berti-Equille (Univ. of Rennes 1, France), Tamraparni Dasu (AT&T Labs – Research)

As data types and data structures change to keep up with evolving technologies and applications, data quality problems too have evolved and become more complex and interwoven. Data streams, web logs, Wikipedias, biomedical applications, video streams and social networking websites generate a mind boggling variety of data types. However, data quality mining, the use of data mining to manage, measure and improve data quality, has focused mostly on addressing each category of data glitch separately as a static entity.

In this tutorial we provide a technical, KDD-focused account of recent research and developments in discovering and treating complex data anomalies in a broad range of data. In particular, we highlight new directions in data quality mining: (a) the applicability and effectiveness of the methodologies for various data types such as structured, semi-structured and stream data, (b) the detection of concomitant data glitches and patterns like the occurrence of outliers in data with missing values and duplicates, or the co-occurrence of missing values and duplicates, (c) the design of sequential approaches to data quality mining, such as workflows composed of a sequence of tasks for data quality exploration and analysis. We give an overview of past work, introduce current research in this area including recent methods and techniques for discovering complex patterns of anomalies (e.g., multivariate outliers, disguised missing values, combination of different types of noise), and highlight new directions and open problems in data quality mining.

The tutorial includes extensive case studies and practical examples of mining data quality problems for a variety of large datasets and data types e.g., relational, XML, data streams. We discuss illustrative examples drawn from a variety of domains like CRM, networking, biology, and mobility.

Back to top…

T7 – Event Detection

Daniel Neill, Weng-Keen Wong

A common task in surveillance, scientific discovery and data cleaning involves monitoring routinely collected data for anomalous events. Detecting events in univariate time series data can be effectively accomplished using well-established techniques such as Box-Jenkins models, regression, and statistical quality control methods. In recent years, however, routinely collected data has become increasingly complex. At each time step, the data collected can consist of multivariate vectors and/or be spatial in nature. For instance, healthcare data used in disease surveillance often consists of multivariate patient records or spatially distributed pharmaceutical sales data. Consequently, new event detection algorithms have been developed that not only consider temporal information but also detect spatial patterns and integrate information from multiple spatio-temporal data streams.

This tutorial will present algorithms for event detection, with a focus on algorithms dealing with multivariate temporal and spatio-temporal data. We will introduce event detection by providing a general formulation of the event detection problem and describing its unique challenges. In the first half of the tutorial, we will cover algorithms for detecting events in both univariate and multivariate temporal data. The second half will present methods for detecting events in spatio-temporal data, including several recently proposed multivariate approaches.

Back to top…

T8 – Advances in Mining the Web

Myra Spiliopoulou, Osmar Zaiane, Bamshad Mobasher, Olfa Nasraoui

The Web has changed our way of life and the Web 2.0 has changed our way of perceiving and using the Web. Data analysis is now required in a plethora of applications that aim to enrich the experience of people with the Web. We first discuss data mining for the social Web. We elaborate on social network analysis and focus on community mining, then go over to recommendation engines and personalization. We discuss the challenges that emerged through the shift from the traditional Web to Web 2.0. We then focus on two issues – the need to protect Web applications from manipulation and the need to make them adaptive towards change. We first discuss manipulations/attacks in recommender systems and present counter-measures. We then elaborate on how changes/concept drifts can be dealt with in applications that analyze clickstream data, monitor topics in news and blogs, or monitor communities and their evolution.

This tutorial is aimed at novice researchers that have general background in data mining and are interested in understanding the
potential and challenges pertinent to the social Web. The participants should have a basic understanding of recommendation engines, personalization and text modeling for mining (vector space models). They will learn how basic techniques are extended and new techniques are designed for mining the Web, especially the social Web. They will also learn about issues that are still open and require further research – research that the tutorial participants may decide to perform themselves.

OUTLINE
PART I: Mining the Social Web [Osmar Zaiane]
PART II: Recommendations and Personalization in the Social Web [Bamshad Mobasher]
PART III: Dealing with Evolution in the Web [Myra Spiliopoulou]
PART IV: Mining Web Data Streams [Olfa Nasraoui]

Back to top…

T9 – Real World Text Mining

Ronen Feldman, Lyle Ungar

The proliferation of documents available on the Web and on corporate intranets is driving a new wave of text mining research and application. Earlier research addressed extraction of information from relatively small collections of well-structured documents such as newswire or scientific publications. Text mining from the other corpora such as the web requires new techniques drawn from data mining, machine learning, NLP and IR. Text mining requires preprocessing document collections (text categorization, information extraction, term extraction), storage of the intermediate representations, analysis of these intermediate representations (distribution analysis, clustering, trend analysis, association rules, etc.), and visualization of the results. In this tutorial we will present the algorithms and methods used to build text mining systems including pre-processing techniques, supervised leearning (e.g., CRF), entity resolution, relationship extraction, unsupervised learning and machine reading.

The tutorial will cover the state of the art in this rapidly growing area of research, including recent advances in unsupervised methods for extracting facts from text and methods used for web-scale mining. We will also present several real world applications of text mining. Special emphasis will be given to lessons learned from years of experience in developing real world text mining systems, including how to handle informal texts such as blogs and user reviews and how to build scalable systems.

The instructors are Ronen Feldman and Lyle Ungar. Ronen is an Associate Professor of Information Systems at the Business School of the Hebrew University in Jerusalem. He is the founder of the ClearForest text mining corporation, and the author of the book “The Text Mining Handbook” published by Cambridge University Press in 2007. Lyle is an Associate Professor of Computer and Information Science at the University of Pennsylvania.He recently returned from a sabbatical at Google, where he and a team built what is probably the world’s largest named entity recognition system.

Back to top…

Interview Gary Cokins SAS Institute

Here is an interview with Gary Cokins , a well respected veteran of the Business Intelligence industry working with the SAS Institute. Gary has just launched his sixth book (wow!) and the gentlemen he is , he agreed to answer these questions en route to his constant traveling.Gary is the expert on performance measurement so we decided to quiz him a bit on this.

CIO’s need to shift their mindset from a technical one to a managerial one.- Gary Cokins, SAS Institute

Gary_Cokins_SAS_05

Ajay -Gary, please describe your career journey from a freshman in college to your position today. What are the key items of advice that you would give to high school students to encourage taking science careers in this recession?

COKINS: I have been very fortunate. After receiving my MBA in 1974 from the Northwestern University Kellogg Graduate School of Management, I worked in industry for ten years. I had the luck of being a financial controller at Fortune 100 corporation division and then becoming operations manager at the same location. I then had to “eat the financial data I was serving,” and it was a true wake-up call – much of the information was at best useless and at worst misleading. Later with Deloitte I was trained on the theory of constraints (TOC) methodology which indicted cost accounting as “enemy number one of productivity.” I learned about the shortcomings with how accountants make assumptions.

In 1988, when Professor Kaplan struck an exclusive relationship with KPMG Peat Marwick, I was recruited to KPMG with about three others with similar operational backgrounds as I to implement activity based cost management (ABC/M) systems but with using an ABC/M modeling software tool. I learned from experience. Four years later, my mentor Bob Bonsack, who had moved on from Deloitte to Electronic Data Systems (EDS) recruited me to head EDS’ cost management consulting. With about fifteen consultants, I was exposed to over a hundred implementations of cost systems. It was there that I experimented with creating a two day “ABC/M rapid prototyping” method that was radically different from the multi-month approach. By starting with a quick vision of what their ABC/M system would look like, companies could iteratively re-model to the level of detail, granularity, and accuracy needed to support analysis and decisions. It did not initially require a huge system, which was why some ABC/M system implementations got into trouble. My major self-realization is that costing is accomplished by modeling cost consumption relationships – an insight that continues to evade many accountants.

When I began to see the application of strategy maps and the balanced scorecard, more light bulbs went off in my brain. I then began truly seeing the organization as a “system” where all the performance improvement methodologies and core processes are inter-connected. I realized that the technologies are no longer the impediment because they are proven. The obstacle is the organization’s thinking – and the mindset of senior management who is presumably doing the leading.

My advice to high school students take your studies more seriously than you even imagine, and spend less time text-messaging everyone you know and focus on the more meaningful relationships. They will eventually be your friends rather than just acquaintances. And take math courses!

Ajay- So what exactly do you do at SAS? And name some interesting anecdotes that led to a lot of value as well as fun for both your company and clients. How does Gary spend his daily day at SAS Institute?

COKINS: My primary role with SAS is to create and deliver thought leadership content about Performance Management leveraging business analytics. I present webinars and write articles, blogs, presentations and also books. For the last four years I have averaged visiting roughly 40 international cities where SAS offices are located to present seminars and meet SAS customers to educate them on the concepts and benefits from Performance Management methodologies.

Recent examples of having fun and providing value to organizations involved providing expert advice to the International Monetary Fund (IMF) in Washington DC and the European Patent Office (EPO) in Brussels. The IMF is at the beginning of implementing an activity based cost management (ABC/M) system whereas the EPO is completing their ABC/M system design. Both organizations were seeking tips for success and pitfalls to avoid. One of my major recommendations was to not under-estimate the natural resistance to change of managers and employees. That is, they need to focus much more on getting their buy-in than worrying if the system is perfect. The value to them is realizing that Performance Management methodologies are much more social than technical.

Regarding my daily activities, when I am not traveling, I am mainly reading articles written by other experts or journalists and then translating my relevant takeaways into content that I can educate others with. I also respond to questions and requests both internally within SAS and externally from customers, management consultants, and university faculty.

Ajay- When you were a young employee, what was the toughest challenge that you faced? What was your worst mistake and how did you overcome it? What lessons did you learn from it?

COKINS: In my first few years in business following my university graduation, my toughest challenge was persuading my supervisors, usually older men than I, to accept my new ideas. I have always been a creative thinker, almost a dreamer; and I was not accustomed to the resistance that managers have to innovations, particularly those suggested by young inexperienced employees fresh from their university schooling.

My worst mistake was developing a computer program that automatically suggested treasury cash balance transfers to optimize the corporate cash management system of my first employer, a large Fortune 100 corporation. My computer program was basically replacing the decisions made by the corporate cash manager and part of his job. I overcame this disappointment by learning what needs the corporate cash manager did have and developing a different computer program that solved his needs. With its success, he eventually accepted the first computer program.

My lesson was one should first understand what people may want rather than trying to impose on them what you think they need without involving them.

Ajay- Looking back on your distinguished career, what project are you proud of the most? What project would you do over again if given the chance?

COKINS: In 1973 I became a financial controller of a large division of another Fortune 100 manufacturer. I created a rolling financial planning and forecast software program, using pre-spreadsheet software from a mainframe (years before personal computers and Excel). The program modeled product line sales forecasts by month and integrated both the income statement and balance sheet. It became a valuable tool for the executive team to suggest and immediately see varying sales levels as a “what if” scenario builder to calculate the different profit and working capital results. The executive team marveled at how analytical software, in contrast to our transactional ERP-like system, could make sense of the complexity of our operations with thousands of products and customers.

Regarding a project that fell short of expectations, I actually did get a chance to do it over again. As a consultant with Deloitte, I lead a project designing and implementing an activity based cost management (ABC/M) system using the client’s general ledger accounting software. It took many months, and when finished it was too complex for the client to fully understand. Several years later with a similar project I applied a rapid prototyping with iterative re-modeling approach that involved the company’s managers from the first day. (I mentioned this approach in my reply to the first question.) We completed the ABC/M system in just a few weeks, and everyone understood it and also how to interpret the information for analysis and decisions. I have since been a proponent of this type of rapid learning and system design approach.

Ajay- What do people do for fun at SAS Institute do when not creating or selling algorithms? How is SAS reaching out to other members of the analytics community in terms of basic science and development?

COKINS: SAS employees are inspired by our CEO, Dr. Jim Goodnight, who founded SAS roughly 35 years ago. Dr. Goodnight loves solving problems of all flavors. For fun, but also part of our jobs, SAS employees search for problems that only computer software can solve.

SAS’ offerings evolve by listening to our customers, who are typically scientists, researchers, and business analysts. Drug development and marketing analysts are examples. Our customers are our “community.” We motivate them, with formal methods of collecting input from them, to share with us enhancements to our future versions of our software.

Ajay- Describe your new book on Performance Management from the point of a beginner. Assume that I am a college student who does not know why I should read it. Then assume that I am a CIO and have little time to read it. What is in it for a CIO?

COKINS: This is my sixth book I have written. My first four books were about activity based cost management (ABC/M) and the last two about Performance Management. What is different about this second book is it immediately clarifies the confusion and ambiguity about what Performance Management is and is not. It is also written in a humorous and simplified way with lots of analogies and metaphors, such as all of the Performance Management methodologies integrated together like gears in an automobile engine and with a GPS for predictive navigation and dashboards for feedback. Beginners perceive each methodology, such as a balanced scorecard or customer relationship management system, are stand-alone tools. There is synergy when they are integrated.

CIOs have similar needs. They need to shift their mindset from a technical one to a managerial one. Just a few chapters from this book can help CIOs see the broad picture of how all of their organizations processes fit together, and how they can be aligned to efficiently execute the ever-adjusting strategy that the executives continuously formulate with operations.

Biography and Contact Information

Gary Cokins, CPIM

(gary.cokins@sas.com; phone 919 531 2012)

http://blogs.sas.com/cokins

Gary Cokins is a global product marketing manager involved with performance management solutions with SAS, a leading provider of performance management and business analytics software headquartered in Cary, North Carolina. Gary is an internationally recognized expert, speaker, and author in advanced cost management and performance improvement systems. Gary received a BS degree with honors in Industrial Engineering/Operations Research from Cornell University in 1971. He received his MBA from Northwestern University’s Kellogg School of Management in 1974.

Gary began his career as a strategic planner FMC’s Link-Belt Division and then served as Financial Controller and Operations Manager. In 1981 Gary began his management consulting career first with Deloitte Consulting. Next with KPMG Peat Marwick, Gary was trained on ABC by Harvard Business School Professors Robert S. Kaplan and Robin Cooper. More recently, Gary headed the National Cost Management Consulting Services for Electronic Data Systems (EDS)/ A.T. Kearney.

Gary was the lead author of the acclaimed An ABC Manager’s Primer (ISBN 0-86641-220-4) sponsored by the Institute of Management Accountants (IMA). Gary’s second book, Activity Based Cost Management: Making it Work (ISBN 0-7863-0740-4), was judged by the Harvard Business School Press as “read this book first.” A reviewer for Gary’s third book, Activity Based Cost Management: An Executive’s Guide (ISBN 0-471-44328-X) said, Gary has the gift to take the concept that many view as complex and reduce it to its simplest terms.” This book was ranked number one in sales volume of 151 similar books on BarnesandNoble.com. Gary has also written Activity Based Cost Management in Government (ISBN 1-056726-110-8). His latest books are Performance Management: Finding the Missing Pieces to Close the Intelligence Gap (ISBN 0-471-57690-5) and Performance Management: Integrating Strategy Execution, Methodologies, Risk, and Analytics (ISBN 978-0-470-44998-1).

Mr. Cokins participates and serves on committees including: CAM-I, the Supply Chain Council, the International Federation of Accountants (IFAC), and the Institute of Management Accountants. Mr. Cokins is a member of Journal of Cost Management Editorial Advisory Board. Cokins can be reached at gary.cokins@sas.com . His blog is at http//:blogs.sas.com/cokins

and his latest book can also be previewed at http://www.sas.com/apps/pubscat/bookdetails.jsp?catid=1&pc=62401

Interesting Times

Probably for the first time I am reproducing a comment from a reader in it’s entirity. As an ex GE Finance and exCiti man , the following parable stuck closer to heart.

Here are some nice views from Randall Stross of http://enhilex.com/ on current economic crisis- If his touch a chord so do write back

Hello Everyone, There is a lot of talk about what made Wall St. fail. Having spent some time inside firms that played pivotal roles, I must say that I believe there are as many reasons for the failure as there are ways to manifest lack of integrity, diligence, lack of responsibility and accountability. As recent failures demonstrate, all attempts to legislate all those characteristics have failed – for the same reasons. The manifestations I found were like these following examples:

1. Here’s a short conversation in a hallway. I would later realize that we were standing in front of the manager’s office… Me: “So , why don’t your models account for employment as a factor that influences the borrower’s ability to pay his mortgage?” Him, looking at me like I had 2 heads: “Er, uh… Well, that is because… very long pause… there is no reliable data on that. Yes, that’s right. So we leave it out…” That was the reply from a very intelligent person who knew he would be shown the door if he’d told me the truth. Lay persons understand that models are useful, though imperfect. But the example given above is something else. This model’s design was intended to mislead. People who do things like this may have been “good” employees, but they were not good citizens. The “good” employees remain… Anyone with an IQ over 50 knows that past performance does not predict future performance when you don’t account for the differences between past and present in the entirety of the nexus where the question belongs.

2. Whilst sitting next to me in a development lab, a young reporting analyst was directed to “hard code” the sum of a column of figures headed for regulators – because the bottom line “wasn’t good enough.” The young man hesitated. Relieved, I put my hand on his wrist and quietly suggested that he ask for our client’s manager to send that request to him in an email. The manager left the lab, calling us obscene names. Of course, the email never came.

3. A friend of mine working as a funder for a large mortgage firm was fired for refusing to fund a loan she knew was fraudulent. I’m proud to know the people in #2 and #3. These people have integrity and loyalty to principles the investing public can count on. But they both lost their jobs and have moved on into other fields, away from Wall St-ish areas. So, it is the people who are left that will work to rebuild confidence in the market. OK, now — who’s left? I wish we could let it all fall down and stand back up without the bums who caused it all. We do know who they are…

ps- Hope he is neither a spammer or joking. This does bring a lot of old memories when I worked for the big hot fin companies. Do you have a personal story like that.

Decisionstats Interviews

Here is a list of interviews that I have published- these are specific to analytics and data mining and include only the most recent interviews. If I have missed out any notable recent interview related to analytics and data mining, kindly do let me know. Hat Tip to Karl Rexer, for this suggestion .

Date Name of Interviewee Designation and Organization

09-Jun    Karl Rexer                          President, Rexer Analytics
05-Jun    Jim Daves                          CMO, SAS Institute
04-Jun    Paul van Eikeren                 President and CEO, Blue Reference
29-May    David Smith                      Director of Community, REvolution Computing
17-May    Dominic Pouzin                 CEO, Data Applied
11-May    Bruno Delahaye                 VP, KXEN
04-May    Ron Ramos                        Director, Zementis
30-Apr    Oliver Jouve                       VP, SPSS Inc
21-Apr    Fabian Dill                         Co- Founder, Knime.com
18-Apr    Alicia Mcgreevey                 Head Marketing, Visual Numerics
27-Mar    Francoise Soulie Fogelman    VP, KXEN
17-Mar    Jon Peck                            Principal Software Engineer, SPSS Inc
06-Mar    Anne Milley                        Director of product marketing, SAS Institute
04-Mar    Anne Milley                        Director of product marketing, SAS Institute
03-Feb    Phil Rack                            Creator, Bridge to R,and CEO Minequest
03-Feb    Michael Zeller                     CEO, Zementis
31-Jan    Richard Schultz                   CEO, Revolution Computing
21-Jan    Bob Muenchen                    Author, R for SAS and SPSS Users
13-Jan    Dr Graham Williams           Creator, Rattle GUI for R
05-Jan    Roger Haddad                    CEO, KXEN
26-Sep    June Dershewitz                  VP, Semphonic
04-Sep    Vincent Granville                 Head, Analyticbridge

The URl’s to specific interviews are also in this sheet.

http://spreadsheets.google.com/pub?key=rWTqcMe9mqwHeFv1e4GS_yg&single=true&gid=0&range=a1%3Ae24&output=html

Conferences: KXEN and KDD 09

Here is an announcement regarding one of the foremost conferences on Knowledge Discovery KDD 2009 which is being held in Paris. We have interviewed the joint general chair of the conference, KXEN’s Francoise Soulie Fogelman here at http://www.decisionstats.com/2009/03/27/interview-franoise-soulie-fogelman-kxen/

Indeed given KXEN’s exciting release of their social network analysis software, KSN they are also gold sponsors for the conference. You should view the archives here http://www.kdd2008.com/ or read more here http://www.kdd.org/kdd2009/index.html

From KXEN’s Press Release-

World’s Best Data Mining Knowledge and Expertise on Show
in Paris at KDD-09

Eminent data mining researchers, academics and practitioners from across the world are honing their presentation skills and charging their laptops in readiness for the industry’s largest and most respected conference, this year being staged for the first time in Europe, in the city of Paris.

The knowledge discovery and data mining 2009 (KDD-09) event will bring together more than 600 specialists, representing the single largest body of expertise in the science and application of data mining technology for industry, government and academia. They will discuss recent discoveries in data mining and share innovative ways of applying the technology in real world business.

Running from the 28th June to 1st July, KDD-09 will feature more than 120 presentations by experts from the US, Europe, Scandinavia and Asia-Pacific. A 20% increase in papers submitted reflects the growing importance of data mining in financially constrained markets. Companies taking part include Orange as a platinum sponsor and Microsoft adCenterLabs and KXEN as gold sponsors. Silver sponsors are Bayesia, Google, HP labs, Pervasive, SAS, Vadis and Yahoo!. Other sponsors include Alberta Center for Machine Learning, Pascal2, Socio Logiciels, Statsoft, Zementis, SFDS, IBM and SIGMOD.

Joint general chair of KDD-09, Francoise Soulie Fogelman, VP Business Development KXEN, says the conference offers a unique chance to see the very latest thinking in data mining. “Some of the best minds from the scientific and business communities will be there, ready and willing to share the results of their cutting edge research and data mining projects with end users. No other industry event offers anything like the depth and breadth of expertise on offer here.”

A particular focus for 2009 will be social network analysis: the discovery and use for competitive advantage of the links between people in social and professional networks. Currently a hot topic among data mining professionals – especially those working in the telecommunications sector – this technique will feature in theoretical and workshop presentations. Details will also be revealed of the world’s first practical applications involving industrial scale volumes of data. Gold sponsor KXEN will present on its booth its recently revealed KSN social network module, helping companies extract valuable new intelligence for better customer acquisition, retention, cross-sell and up-sell campaigns.

Other exhibitors include sponsors as well as Cambridge University Press, Cap Digital, Elsevier, Morgan Claypool Publishers, Oracle, Salford Systems, Springer and Taylor & Francis CRC press.

Also high on the agenda are real-time Web applications for data mining for custom advertising and personalized offers, both seen as crucial to online marketing and sales but both also requiring technologies able to handle very large volumes of data in real time.

Away from science and technology, delegates will also have a chance to sample the best of Paris architecture and hospitality on the evening of 29th June in the main reception room at the exclusive Hotel de Ville – a venue normally reserved for visiting heads of state. A cocktail reception hosted by KXEN will follow presentations, including a welcome from Jean-Louis Missika, the Deputy Mayor of Paris in charge of Innovation, Research and Universities.

There will also be the presentation of awards of the KDD cup by Dr. Isabelle Guyon (ClopiNet). The cup is awarded to the winners of a contest around predicting customer scores from large marketing databases. It, and other prize awards, are being sponsored by the French telecommunications company Orange and Google.

KDD-09 is organized by the data mining special interest group of the Association of Computing Machinery (ACM), the world’s largest educational and scientific computing society. The ACM provides resources that advance computing both as a science and a profession. ACM provides the computing field’s premier digital library and serves its members and the computing profession with leading-edge publications, conferences, and career resources.

More details, program & registration: http://www.kdd.org/kdd2009/index.html

About KXEN

KXEN, The Data Mining Automation Company™ delivers next-generation Customer Lifecycle Analytics to enterprises that depend on analytics as a competitive advantage. KXEN’s Data Mining Automation Solution drives significant improvements in customer acquisition, retention, cross-sell and risk applications. Its solution integrates predictive analytics into strategic business processes, allowing customers to drive greater value into their business. Find out more by visiting www.kxen.com.

Disclaimer- I am a social media consultant to KXEN.

Interview Karl Rexer -Rexer Analytics

Here is an interview with Karl Rexer of Rexer Analytics. His annual survey is considered a benchmark in the data mining and analytics industry. Here Karl talks of his career, his annual survey and his views on the industry direction and trends.

Almost 20% of data miners report that their company/organizations have only minimal analytic capabilities – Karl Rexer

Ajay- Describe your career in science. What advice would you give to young science graduates in this recession? What advice would you give to high school students choosing from science – non science careers?

Karl- My interests in science began as a child. My father has multiple science degrees, and I grew up listening to his descriptions of the cool things he was building, or the cool investigative tools he was using, in his lab. He worked in an industrial setting, so visiting was difficult. But when I could, I loved going in to see the high-temperature furnaces he was designing, the carbon-fiber production processes he was developing, and the electron microscope that allowed him to look at his samples. Both of my parents encouraged me to ask why, and to think critically about both scientific and social issues. It was also the time of the Apollo moon landings, and I was totally absorbed in watching and thinking about them. Together these things motivated me and shaped my world-view.

I have also had the good fortune to work across many diverse areas and with some truly outstanding people. In graduate school I focused on applied statistics and the use of scientific methods in the social sciences. As a grad student and young academic, I applied those skills to researching how our brains process language. But on the side, I pursued a passion for using the scientific method and analytics to address ….well anything I could. We called it “statistical consulting” then, but it often extended to research design and many other parts of the scientific process. Some early projects included assisting people with AIDS outcome studies, psycholinguistic research, and studies of adolescent adjustment.

My first taste of applying these skills outside of an academic environment was with my mentor Len Katz. The US Navy hired us to help assess the new recruits that were entering the submarine school. Early identification of sailors who would excel in this unusual and stressful environment was critical. Perhaps even more important was identifying sailors who would not perform well in that environment. Luckily, the Navy had years of academic and psychological testing on many sailors, and this data proved quite useful in predicting later job performance onboard the submarines. Even though we never got the promised submarine ride, I was hooked on applying measurement, scientific methods, and analytics in non-academic settings.

And that’s basically what I have continued to do – apply those skills and methods in diverse scientific and business settings. I worked for two banks and two consulting firms before founding Rexer Analytics in 2002. Last year we supported 30 clients. I’ve got great staff and they have great quant skills. Importantly, we also don’t hesitate to challenge each other, and we’re continually learning from each other and from each client engagement. We share a love of project diversity, and we seek it out in our engagements. We’ve forecasted sales for medical devices, measured B2B customer loyalty, identified manufacturing problems by analyzing product returns, predicted which customers will close their bank accounts, analyzed millions of tax returns, helped identify the dimensions of business team cohesion that result in better performance, found millions of dollars of B2B and B2C fraud, and helped many companies understand their customers better with segmentations, surveys, and analyses of sales and customer behavior.

The advice I would give to young science grads in this recession is to expand your view of where you can apply your scientific training. This applies to high school students considering science careers too. All science does not happen in universities, labs and other traditional science locations. Think about applying scientific methods everywhere! Sometimes our projects at Rexer Analytics seem far away from what most people would consider “science.” But we’re always asking “what data is available that can be brought to bear on the business issue we’re addressing.” Sometimes the best solution is to go out and collect more data – so we frequently help our clients improve their measurement processes or design surveys to collect the necessary data. I think there are enormous opportunities for science grads to apply their scientific training in the business world. The opportunities are not limited to physics wiz-kids making models for Wall Street trading or computer science students moving to Silicon Valley. One of the best analytic teams I ever worked on was at Fleet Bank in the late 90s. We had an economist, two physicists, a sociologist, a psychologist, an operations research guy, and person with a degree in marketing science. We were all very focused on data, measurement, and analytic methods.

I recommend that all science grads read Tom Davenport’s book Competing on Analytics *. It illustrates, with compelling examples, how businesses can benefit from using science and analytics. Several examples in Tom’s book come from Gary Loveman, CEO of Harrah’s Entertainment. I think that Gary also serves as a great example of how scientific methods can be applied in every industry. Gary has a PhD in economics from MIT, he’s worked at the Federal Reserve Bank, he’s been a professor at Harvard, but more recently he runs the world’s largest casino and gaming company. And he’s famously said many times that there are three ways to get fired at Harrah’s: steal, harass women, or not use a control group. Business leaders across all industries are increasingly wanting data, analytics and scientific decision-making. Science grads have great training that enables them to take on these roles and to demonstrate the success of these methods.

Ajay- One more survey- How does the Rexer survey differentiate itself from other surveys out there?

Karl- The Annual Rexer Analytics Data Miner Survey is the only broad-reaching research that investigates the analytic behaviors, views and preferences of data mining professionals. Each year our sample grows — in 2009 we had over 700 people around the globe complete our survey. Our participants include large numbers of both academic and business people.

Another way our survey is differentiated from other surveys is that each year we ask our participants to provide suggestions on ways to improve the survey. Incorporating participants’ suggestions improves our survey. For example, in 2008 several people suggested adding questions about model deployment and off-shoring. We asked about both of these topics in the 2009 survey.

Ajay -Could you please share some sneak previews of the survey results? What impact is the recession likely to have on IT spending?

Karl- We’re just starting to analyze the 2009 survey data. But, yes, here’s a peek at some of the findings that relate to the impact of the recession:

* Many data miners report that funding for data mining projects can sometimes be a problem.
* However, when asked what will happen in 2009 if the economic downturn continues, many data miners still anticipate that their company/organization will conduct more data mining projects in 2009 than in previous years (41% anticipate more projects in 2009; 27% anticipate fewer projects).
* The vast majority of companies conduct their data mining internally, and very few are sending data mining off-shore.

I don’t have a crystal ball that tells me about the trends in overall corporate spending on IT, Business Intelligence, or Data Mining. It’s my personal experience that many budgets are tight this year, but that key projects are still getting funded. And it is my strong opinion that in the coming years many companies will increase their focus on analytics, and I think that increasingly analytics will be a source of competitive advantage for these companies.

There are other people and other surveys that provide better insight into the trends in IT spending. For example, Gartner’s recent survey of over 1,500 CIOs (http://www.gartner.com/it/page.jsp?id=855612 ) suggests that 2009 IT spending is likely to be flat. I’m personally happy to see that in the Gartner survey, Business Intelligence is again CIOs’ top technology priority, and that “increasing the use of information/analytics” is the #5 business priority.

Ajay- I noticed you advise SPSS among others. Describe what an advisory role is for an analytics company and how can small open source companies get renowned advisors?

Karl- We have advised Oracle, SPSS, Hewlett-Packard and several smaller companies. We find that advisory roles vary greatly. The biggest source of variation is what the company wants advice about. Example include:

* assessing opportunity areas for the application of analytics
* strategic data assessments
* analytic strategy
* product strategy
* reviewing software

Both large and small companies that look to apply analytics to their businesses can benefit from analytic advisors. So can open source companies that sell analytic software. Companies can find analytic advisors in several ways. One way is to look around for analytic experts whose advice you trust, and hire them. Networking in your own industry and in the analytic communities can identify potential advisors. Don’t forget to look in both academia and the business world. Many skilled people cross back and forth between these two worlds. Another way for these companies to obtain analytic advice is to look in their business networks and user communities for analytic specialists who share some of the goals of the company – they will be motivated for your company to succeed. Especially if focused topic areas or time-constrained tasks can be identified, outside experts may be willing to donate their time, and they may be flattered that you asked.

Ajay- What made you decide to begin the Rexer Surveys? Describe some results of last year’s surveys and any trends from the last three years that you have seen.

Karl- I’ve been involved on the organizing committees of several data mining workshops and conferences. At these conferences I talk with a lot of data miners and companies involved in data mining. I found that many people were interested in hearing about what other data miners were doing: what algorithms, what types of data, what challenges were being faced, what they liked and disliked about their data mining tools, etc. Since we conduct online surveys for several of our clients, and my network of data miners is pretty large, I realized that we could easily do a survey of data miners, and share the results with the data mining community. In the first year, 314 data miners participated, and it’s just grown from there. In 2009 over 700 people completed the survey. The interest we’ve seen in our research summaries has also been astounding – we’ve had thousands of requests. Overall, this just confirms what we originally thought: people are hungry for information about data mining.

Here is a preview of findings from the initial analyses of the 2009 survey data:

* Each year we’ve seen that the most commonly used algorithms are decision trees, regression, and cluster analysis.
* Consistently, some of the top challenges data miners report are dirty data and explaining data mining to others. Previously, data access issues were also reported as a big challenge, but in 2009 fewer data miners reported facing this challenge.
* The most prevalent concerns with how data mining is being utilized are: insufficient training of some data miners, and resistance to using data mining in contexts where it would be beneficial.
* Data mining is playing an important role in organizations. Half of data miners indicate their results are helping to drive strategic decisions and operational processes.
* But there’s room for data mining to grow – almost 20% of data miners report that their company/organizations have only minimal analytic capabilities.

Bio-

Karl Rexer, PhD is President of Rexer Analytics, a small Boston-based consulting firm. Rexer Analytics provides analytic and CRM consulting to help clients use their data to make better strategic and tactical decisions. Recent projects include fraud detection, sales forecasting, customer segmentation, loyalty analyses, predictive modeling for cross-sell and attrition, and survey research. Rexer Analytics also conducts an annual survey of data miners and freely distributes research summaries to the data mining community. Karl has been on the organizing committees of several international data mining conferences, including 3 KDD conferences, and BIWA-2008. Karl is on the SPSS Customer Advisory Board and on the Board of Directors of the Oracle Business Intelligence, Warehousing, & Analytics (BIWA) Special Interest Group. Karl and other Rexer Analytics staff are frequent invited speakers at MBA data mining classes and conferences.

To know more do check out the website on www.rexeranalytics.com

KXEN Case Studies : Financial Sector

Here are the summaries of some excellent success stories that KXEN has achieved working with partners in the financial world over the years.

Fraud Modeling- Disbank (acquired by Fortis) Turkey

1. Dısbank increased the number of identified fraudulent applications by 200% from 7 to 21 per day.

2.More than 50 fraudsters using counterfeit cards at merchant locations or fraudulent applications have been arrested after April 2004 when the fraud modeling system was set.

A large Bank on the U.S. East Coast

1.Response Modeling

Previously it took the modeling group four weeks to build one model with several hundred variables, using traditional modeling tools. KXEN took one hour for the same problem and doubled the lift in the top decile because it included variables that had not been used for this business question before.

2.Data Quality

Building a Cross/Up-sell Model for a direct marketing campaign to high net worth customers, the modelers needed four weeks using 1500 variables. Again it took one hour with KXEN, which uncovered significant problems with some of the top predictive variables. Further investigation proved that these problems were created in the data merge of the mail file and response file, creating several “perfect” predictors. The model was re-run, removing these variables, and immediately put into production.

Le Crédit Lyonnais

1.Around 160 score models now built annually – compared to around 10 previously – for 130 direct marketing campaigns.
2.KXEN software has allowed LCL to drive up response rates, leading to more value-added services for customers.

Finansbank, Turkey

1.Within 4 months of starting the project to combat dormancy using KXEN’s solution, the bank had successfully reactivated half its previously dormant customers as per Kunter Kutluay, Finansbank Director of Marketing and Risk Analytics.

Bank Austria Creditanstalt , Austria

1.Some 4.5 terabytes of data are held in the bank’s operational systems, with a further 2 terabytes archived. Analytical models created in KXEN are automatically fed through the bank’s scoring engine in batches weekly
or monthly depending on the schema.

“But we are looking at a success rate of target customer deals in the area of three to five per cent with KXEN.
Before that, it was one per cent or less. ”
Werner Widhalm, Head of the Customer Knowledge Management Unit.

Barclays

1.Barclays’ Teradata warehouse holds information on some 14 million active customers, with data
on many different aspects of customer behaviour. Previously, analysts had to manually whittle down several thousand fields of data to a core of only a few hundred to fit the limitations of the modelling process. Now, all of the variables can be fed straight into the predictive model.

Summary– KXEN has achieved tremendous response in all aspects of data modelling in financial sector where time in building, deploying and analyzing model is much more crucial than many other sectors. I would be following this with other case studies on other KXEN successes across multiple domains.

Source – http://www.kxen.com/index.php?option=com_content&task=view&id=220&Itemid=786

Disclaimer- I am a social media consultant for KXEN.

Tutorials

Morning

Afternoon

Abstracts

T1 – Statistical Challenges in Computational Advertising

T2 – How to do good research, get it published in SIGKDD and get it cited!

T3 – Large Graph-Mining: Power Tools and a Practitioner’s Guide

T4 – Planning, Running, and Analyzing Controlled Experiments on the Web

T5 – Predictive Modelling in the Wild: Success Factors in Data Mining Competitions and Real-World Applications

T6 – New Directions in Data Quality Mining

T7 – Event Detection

T8 – Advances in Mining the Web

T9 – Real World Text Mining

Please share:

Please share:

Please share:

Date Name of Interviewee Designation and Organization

Please share:

Indeed given KXEN’s exciting release of their social network analysis software, KSN they are also gold sponsors for the conference. You should view the archives here http://www.kdd2008.com/ or read more here http://www.kdd.org/kdd2009/index.html

Please share:

Almost 20% of data miners report that their company/organizations have only minimal analytic capabilities – Karl Rexer

Please share:

Please share: