All tutorials will be held on Sunday, June 28th.
T1 – Statistical Challenges in Computational Advertising
Deepayan Chakrabarti, Deepak Agarwal
Many organizations now devote significant fractions of their advertising/outreach budgets to online advertising; ad-networks like Yahoo!, Google, MSN have responded by constructing new kinds of economic models and perform the fundamental task of matching the most relevant ads (selected from a large inventory) for a (query,user) pair in a given context. Nearly all of the challenges that arise are substantially data- or model-driven (or both). Computational Advertising is a relatively new scientific sub-discipline at the interesection of large scale search and text analysis, information retrieval, statistical modeling, machine learning, optimization and microeconomics that address this match-making problem and provides unprecedented opportunities to data miners.
Topics covered include a comprehensive introduction to several advertising forms (sponsored search, contextual adverting, display advertising), revenue models (pay-per-click, pay-per-view, pay-per-conversion) and data mining challenges involved, along with an overview of state-of-the-art techniques in the area with a detailed discussion of open problems. We will cover information retrieval techniques and their limitations; data mining challenges involved in performing ad matching through clickstream data and challenging optimization issues that arise in display advertising. In particular, we will cover statistical modeling techniques for clickstream data and explore/exploit schemes to perform online experiments for better long-term performance using multi-armed bandit schemes. We also discuss the close relationship of techniques used in recommender systems to our problem but indicate several additional issues that needs to be addressed before they become routine in computational advertising.
We will only assume basic knowledge of statistical methods, no prior knowledge of online advertising is required. In fact, the first hour that provides an introduction to the area would be appropriate for all registered attendees of KDD 2009. The second half would require familiarity with basic concepts like regression, probability distributions and appreciation of issues involved in fitting statistical models to large scale applications. No prior knowledge of multi-armed bandits would be assumed.
Back to top…
T2 – How to do good research, get it published in SIGKDD and get it cited!
While SIGKDD has traditionally enjoyed an unusually high quality of reviewing, there is no doubt that publishing in SIGKDD (and other high quality data mining conferences) is very challenging. This is especially true for young faculty, grad students whose primary advisor is not an experienced SIGKDD author, or people from outside the community (i.e. a biologist or mathematician who has a result that might greatly interest the data mining community).
In this tutorial Dr. Keogh will demonstrate some simple ideas to enhance the probability of success in getting your paper published in a top data mining conference; and after the work is published, getting it highly cited.
These tips and tricks are based on 12 years experience as a SIGKDD author and reviewer, and wisdom solicited from many of the most prolific data mining researchers/reviewers.
Topics covered in the tutorial include:
- Finding the right problems to work on (80% of the battle).
- Don’t summarize, sell! Writing abstracts that put the reviewer on your side from the start.
- Getting or creating the perfect dataset.
- Experiments that tell a story.
- Making effective and interesting figures.
- Getting the reviewers on your side.
- The top-ten avoidable reasons why papers get rejected from SIGKDD.
- Three simple tricks to increase the number of citations to your work.
While Dr. Keogh does not claim to have a “magic bullet” for publishing in SIGKDD, his significant track record of publishing in top data mining venues, combined with extensive (and deliberately uncredited) experience in helping younger researchers “break-in” to SIGKDD have placed him in a unique position to share useful and actionable advice.
While writing this tutorial Dr. Keogh, sought and received advice from many respected data mining researchers, their advice is incorporated into this tutorial.
Back to top…
T3 – Large Graph-Mining: Power Tools and a Practitioner’s Guide
Christos Faloutsos, Gary Miller, Charalampos (Babis) Tsourakakis
Numerous real-world datasets are in matrix form, thus matrix algebra, linear and multilinear, provides important algorithmic tools for analyzing them. The main type of datasets of interest in this tutorial are graphs. Important datasets modeled as graphs include the Internet, the Web, social networks (e,g Facebook, LinkedIn), computer networks, biological networks and many more.
We will discuss how we represent a graph as a matrix (adjacency matrix, Laplacian) and the important properties of those representations. We will then show how these properties are used in several important problems, including node importance via random walks (Pagerank), community detection (METIS, Cheeger inequality), graph isomorphism and graph similarity. Important dimensionality reduction techniques (SVD and random projections) will be discussed in the context of graph mining problems.
Furthermore, we provide a survey of the work on the epidemic threshold, node proximity and center-piece subgraphs. State-of-art graph mining tools for analyzing time evolving graphs will also be presented. Throughout the tutorial, patterns in static and time evolving, weighted and unweighted real-world graphs will be presented.
The target audience are data mining professionals who wish to know the most important matrix algebra tools, their applications in large graph mining and the theory behing them.
Prerequisites: Computer science background (B.Sc or equivalent); familiarity with undergraduate linear algebra.
Demos will be presented.
Back to top…
T4 – Planning, Running, and Analyzing Controlled Experiments on the Web
Ronny Kohavi, Roger Longbotham, John Quarto-vonTivadar
The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments, A/B tests (and their generalizations), split tests, and MultiVariable Tests (MVT). Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. Data Mining and Knowledge Discovery techniques can then be used to analyze the data from such experiments. The tutorial will provide a survey and practical guide to running controlled experiments based on the recently published survey article in the Data Mining and Knowledge Discovery Journal, co-authored with the two of the tutorial co-presenters (http://exp-platform.com/dmkd_survey.aspx), and based on the book “Always Be Testing” co-authored by the 3rd tutorial co-presenter (http://www.amazon.com/Always-Be-Testing-Complete-Optimizer/dp/0470290633). The book includes use of industry tools, such as Google Website Optimizer and recently ranked #2 on Amazon’s sales rank for computers/e-commerce books. The tutorial includes multiple real-world examples of actual controlled experiments (many with surprising results), a review the theory and the statistics used to plan and analyze such experiments, and a discussion of the limitations and pitfalls that might face experimenters. Demos will be shown of some tools that support controlled experiments.
A video of a related talk can be found on the videolectures website:
The shorter version of the DMKD survey paper is now part of the class reading for several classes at Stanford University (CS147, CS376), USCD (CSE 291), and at the University of Washington (CSEP 510).
Topics covered include:
- Why online experimentation using controlled experiments is important
- What you need in order to conduct a valid experiment
- Planning and Analysis of basic experiments
- Benefits and limitations of experimentation
- Multivariable experiments: setup, analysis, interpretation, and interactions
- Using online free and low-cost software services (demos)
- Challenges and advanced statistical concepts for experiments
Back to top…
T5 – Predictive Modelling in the Wild: Success Factors in Data Mining Competitions and Real-World Applications
Saharon Rosset, Claudia Perlich
In this tutorial, we give our perspective on the keys to success in application of predictive modeling to competitions like KDD Cup and real-life business intelligence projects. We argue that these two modes of applying predictive modeling share many similarities, but have also some important differences. We discuss the main success factors in predictive modeling: domain understanding, statistical acumen, and appropriate algorithmic approaches. We describe our relevant experiences in the context of three recent predictive modeling competitions where our team has had success (KDD Cup 2007 and 2008 and INFORMS DM challenge 2008) and two case studies of projects we have led at IBM Research. We also survey some of the recurring challenges and complexities in practical predictive modeling applications. One key issue is information leakage, and we discuss its definition, influence, detection and avoidance. We consider leakage to be the silent killer of many predictive modeling projects, and we demonstrate its impact on the competitions, and discuss the challenges in addressing it in the real-life projects. Other challenges include framing real-life modeling objectives into predictive modeling, and usefully applying relational learning concepts when modeling “real-life” complex, relational datasets.
Back to top…
T6 – New Directions in Data Quality Mining
Laure Berti-Equille (Univ. of Rennes 1, France), Tamraparni Dasu (AT&T Labs – Research)
As data types and data structures change to keep up with evolving technologies and applications, data quality problems too have evolved and become more complex and interwoven. Data streams, web logs, Wikipedias, biomedical applications, video streams and social networking websites generate a mind boggling variety of data types. However, data quality mining, the use of data mining to manage, measure and improve data quality, has focused mostly on addressing each category of data glitch separately as a static entity.
In this tutorial we provide a technical, KDD-focused account of recent research and developments in discovering and treating complex data anomalies in a broad range of data. In particular, we highlight new directions in data quality mining: (a) the applicability and effectiveness of the methodologies for various data types such as structured, semi-structured and stream data, (b) the detection of concomitant data glitches and patterns like the occurrence of outliers in data with missing values and duplicates, or the co-occurrence of missing values and duplicates, (c) the design of sequential approaches to data quality mining, such as workflows composed of a sequence of tasks for data quality exploration and analysis. We give an overview of past work, introduce current research in this area including recent methods and techniques for discovering complex patterns of anomalies (e.g., multivariate outliers, disguised missing values, combination of different types of noise), and highlight new directions and open problems in data quality mining.
The tutorial includes extensive case studies and practical examples of mining data quality problems for a variety of large datasets and data types e.g., relational, XML, data streams. We discuss illustrative examples drawn from a variety of domains like CRM, networking, biology, and mobility.
Back to top…
T7 – Event Detection
Daniel Neill, Weng-Keen Wong
A common task in surveillance, scientific discovery and data cleaning involves monitoring routinely collected data for anomalous events. Detecting events in univariate time series data can be effectively accomplished using well-established techniques such as Box-Jenkins models, regression, and statistical quality control methods. In recent years, however, routinely collected data has become increasingly complex. At each time step, the data collected can consist of multivariate vectors and/or be spatial in nature. For instance, healthcare data used in disease surveillance often consists of multivariate patient records or spatially distributed pharmaceutical sales data. Consequently, new event detection algorithms have been developed that not only consider temporal information but also detect spatial patterns and integrate information from multiple spatio-temporal data streams.
This tutorial will present algorithms for event detection, with a focus on algorithms dealing with multivariate temporal and spatio-temporal data. We will introduce event detection by providing a general formulation of the event detection problem and describing its unique challenges. In the first half of the tutorial, we will cover algorithms for detecting events in both univariate and multivariate temporal data. The second half will present methods for detecting events in spatio-temporal data, including several recently proposed multivariate approaches.
Back to top…
T8 – Advances in Mining the Web
Myra Spiliopoulou, Osmar Zaiane, Bamshad Mobasher, Olfa Nasraoui
The Web has changed our way of life and the Web 2.0 has changed our way of perceiving and using the Web. Data analysis is now required in a plethora of applications that aim to enrich the experience of people with the Web. We first discuss data mining for the social Web. We elaborate on social network analysis and focus on community mining, then go over to recommendation engines and personalization. We discuss the challenges that emerged through the shift from the traditional Web to Web 2.0. We then focus on two issues – the need to protect Web applications from manipulation and the need to make them adaptive towards change. We first discuss manipulations/attacks in recommender systems and present counter-measures. We then elaborate on how changes/concept drifts can be dealt with in applications that analyze clickstream data, monitor topics in news and blogs, or monitor communities and their evolution.
This tutorial is aimed at novice researchers that have general background in data mining and are interested in understanding the
potential and challenges pertinent to the social Web. The participants should have a basic understanding of recommendation engines, personalization and text modeling for mining (vector space models). They will learn how basic techniques are extended and new techniques are designed for mining the Web, especially the social Web. They will also learn about issues that are still open and require further research – research that the tutorial participants may decide to perform themselves.
PART I: Mining the Social Web [Osmar Zaiane]
PART II: Recommendations and Personalization in the Social Web [Bamshad Mobasher]
PART III: Dealing with Evolution in the Web [Myra Spiliopoulou]
PART IV: Mining Web Data Streams [Olfa Nasraoui]
Back to top…
T9 – Real World Text Mining
Ronen Feldman, Lyle Ungar
The proliferation of documents available on the Web and on corporate intranets is driving a new wave of text mining research and application. Earlier research addressed extraction of information from relatively small collections of well-structured documents such as newswire or scientific publications. Text mining from the other corpora such as the web requires new techniques drawn from data mining, machine learning, NLP and IR. Text mining requires preprocessing document collections (text categorization, information extraction, term extraction), storage of the intermediate representations, analysis of these intermediate representations (distribution analysis, clustering, trend analysis, association rules, etc.), and visualization of the results. In this tutorial we will present the algorithms and methods used to build text mining systems including pre-processing techniques, supervised leearning (e.g., CRF), entity resolution, relationship extraction, unsupervised learning and machine reading.
The tutorial will cover the state of the art in this rapidly growing area of research, including recent advances in unsupervised methods for extracting facts from text and methods used for web-scale mining. We will also present several real world applications of text mining. Special emphasis will be given to lessons learned from years of experience in developing real world text mining systems, including how to handle informal texts such as blogs and user reviews and how to build scalable systems.
The instructors are Ronen Feldman and Lyle Ungar. Ronen is an Associate Professor of Information Systems at the Business School of the Hebrew University in Jerusalem. He is the founder of the ClearForest text mining corporation, and the author of the book “The Text Mining Handbook” published by Cambridge University Press in 2007. Lyle is an Associate Professor of Computer and Information Science at the University of Pennsylvania.He recently returned from a sabbatical at Google, where he and a team built what is probably the world’s largest named entity recognition system.
Back to top…