Conference of the year: KDD 2009

This is one great co9nference you should attend if you have the time and inclination to check out latest advances in the world of Knowledge discovery. While KXEN ( from whom I consult on social madia) is a Gold Sponser- the following posts on workshops, demos and  papers will show you just how much technical stuff as opposed to marketing bullshit and jazz ( as in other confs)  is available in this conference. So pack your bags, and Viva La France for a grueling refreshing course in Knowledge Discovery and Text Mining. Incidentally KXEN intend to show their path breaking cutting edge social network analysis software KSN here.

Disclaimer- I am a social media consultant to KXEN.

KDD2009: Workshops

Abstracts

W1 – Statistical and Relational Learning and Mining in Bioinformatics (StReBio’09)

Jan Ramon, Fabrizio Costa, Christophe Costa Florencio, Joost Kok

Bioinformatics is an application domain where information is naturally represented in terms of relations between heterogenous objects. Modern experimentation and data acquisition techniques allow the study of complex interactions in biological systems. This raises interesting challenges because the amount of data is huge,some information can not be observed, and measurements may be noisy.

The StReBio’09 workshop invites contributions concerning applications of statistical relational learning and mining methods in bio-informatics domains. In particular, the workshop invites both regular papers, problem statements and problem solution papers.

Back to top…

W2 – The 3rd International Workshop on Knowledge Discovery from Sensor Data (SensorKDD-2009)

Olufemi Omitaomu, Auroop Ganguly, Joao Gama, Ranga Raju Vatsavai, Mohamed Medhat Gaber and Nitesh V. Chawla

Wide-area sensor infrastructures, remote sensors, RFIDs, and wireless sensor networks yield massive volumes of disparate, dynamic, and geographically distributed data. The Sensor-KDD 2009 workshop solicits papers that describe innovative solutions in offline data mining and/or real-time analysis of sensor or streaming data. Position papers that describe the challenges and requirements for sensor data based knowledge discovery in high-priority application domains, as well as relevant case studies, are particularly encouraged.

Back to top…

W3 – ACM SIGKDD Workshop on CyberSecurity and Intelligence Informatics (CSI-KDD)

Hsinchun Chen, Marc Dacier, Marie-Francine Moens, Gerhard Paaß, Christopher C. Yang

Computer supported communication and infrastructure are integral parts of modern economy. Their security is of incredible importance to a wide variety of practical domains ranging from Internet service providers to the banking industry and e-commerce, from corporate networks to the intelligence community. Of interest to this workshop are novel knowledge discovery methods addressing this field, e.g. adaptive, active or anticipatory approaches integrating new types of contents and protocols. Equally important are innovative applications demonstrating the effectiveness of data mining in solving real-world security problems.

Back to top…

W4 – Workshop on Visual Analytics and Knowledge Discovery (VAKD ’09)

Kai Puolamäki, Heikki Mannila, Alessio Bertone, Silvia Miksch, Mark A. Whiting, Jean Scholtz

The goal of Visual Analytics is to derive insight from massive, dynamic, ambiguous, and often conflicting data; detect the expected and discover the unexpected; provide timely, defensible, and understandable assessments; and communicate the assessment effectively for action. The goal of this workshop is to raise the awareness of the KDD community for the importance of Visual Analytics and bring together researcher from the underlying fields to bridge the gap between them—to write a KDD research roadmap on Visual Analytics.

Back to top…

W5 – The Third International Workshop on Data Mining and Audience Intelligence for Advertising (ADKDD)

Ying Li, Arun C. Surendran, and Dou Shen

Advertising, especially online advertising, is growing rapidly and brings about large volumes of data along with challenging data mining problems. Following on the success of ADKDD 2007 and 2008, ADKDD 2009 is to be held in Paris France, in conjunction with KDD 2009, to provide a high-level international forum for the academic community and the industry to present the state of the art of algorithms and applications of advertising.

We encourage papers that bring up and formalize new research problems in online advertising, or propose novel data mining techniques for existing problems. We plan to cover (but not restricted to) the following areas: Mining for Ad Relevance and Ranking; Audience Intelligence & User Modeling; Content Understanding; Search Engine Marketing, Optimization (SEMs, SEOs) and Other Topics in Advertising. Accepted papers will be achieved in ACM Digital Library and one or two papers will be recommended to SIGKDD Explorations.

Back to top…

W6 – The 3rd Workshop on Social Network Mining and Analysis (SNA-KDD)

Lee Giles, Prasenjit Mitra, Igor Perisic, John Yen, Haizheng Zhang

(Abstract Coming Soon)

Back to top…

W7 – Human Computation Workshop (HCOMP 2009)

Paul Bennett, Raman Chandrasekar, Max Chickering, Panos Ipeirotis, Edith Law, Foster Provost, Anton Mityagin, Luis von Ahn

Human computation is a new research area that studies the process of channeling the vast internet population to perform tasks or provide data towards solving difficult problems that no known computer algorithms can yet solve perfectly and efficiently, e.g. digitize books, recognize objects in images and songs, translate sentences, summarize news articles, annotate videos etc. The goal of HCOMP 2009 is to bring together academic and industry researchers in a stimulating discussion of existing human computation applications, such as Games With A Purpose (e.g. the ESP game), Mechanical Turk and CAPTCHAs, and future directions of this new subject area.

Included in the workshop are invited talks, presentations, posters, and a demo session where participants are invited to showcase their human computation applications.

Back to top…

W8 – Data Mining using Matrices and Tensors (DMMT’09)

Chris Ding, Tao Li

This workshop will present recent advances in algorithms and methods using matrix and scientific computing/applied mathematics for modeling and analyzing massive, high-dimensional, and nonlinear-structured data. One main goal of the workshop is to bring together leading researchers on many topic areas (e.g., computer scientists, computational and applied mathematicians) to assess the state-of-the-art, share ideas, and form collaborations. We also wish to attract practitioners who seek novel ideas for applications.

Back to top…

W9 – Third Workshop on Data Mining Case Studies and Practice Prize (DMCS)

Gabor Melli, Peter van der Putten, Brendan Kitts

The Data Mining Case Studies Workshop and Practice Prize was established to recognize the very best data mining deployments for the year. Data Mining Case Studies will highlight data mining implementations that have been responsible for a significant and measurable improvement in business operations, advanced scientific discoveries, or provided other benefits to humanity. The best paper will be awarded the Practice Prize. Do you have an outstanding data mining application? This is a unique opportunity to be recognized for your work.

Back to top…

W10 – KDD cup 2009: Fast Scoring on a Large Database (KDDcup09)

Isabelle Guyon, David Vogel

This workshop will discuss the results of the KDD cup 2009. The competition is organized around a large dataset provided by the French telecom company Orange. It is a problem of Customer Relationship Management (CRM), a key element of modern marketing strategies. Orange offered the opportunity to work on a large marketing database to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling).

Back to top…

W11 – The First ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data (U’09)

Jian Pei, Lise Getoor, Ander de Keijzer

The First ACM SIGKDD International Workshop on Knowledge Discovery from Uncertain Data (U’09) is to discuss in depth the challenges, opportunities and techniques on the topic of analyzing and mining uncertain data. The theme of this workshop is to make connections among the research areas of probabilistic databases, probabilistic reasoning, and data mining, as well as to build bridges among the aspects of models, data, applications, novel mining tasks and effective solutions. By making connections among different communities, we aim at understanding each other in terms of scientific foundation as well as commonality and differences in research methodology.

Back to top…

KDD-09 Call For Workshop Proposals (Expired)

The ACM KDD-2009 organizing committee invites proposals for workshops to be held in conjunction with the conference. The purpose of a workshop is to provide participants with the opportunity to present and discuss novel research ideas on active and emerging topics of knowledge discovery and data mining. A workshop should also support the interaction and feedback among topic specialists from academia, industry and government.

A workshop may be organized around industrial applications in a particular domain and the challenges this domain poses, such as the Netflix workshop on recommender systems (http://netflixkddworkshop2008.info/).

A workshop may also include a challenge problem, such as the one on time series classification that took place in 2007 (http://www.cs.ucr.edu/~eamonn/SIGKDD2007TimeSeries.html). A session with papers that address a challenge complements the more diverse sessions with regular papers and improves the potential for discussion. Because such challenges require extra time to plan, we may be willing to provide early notice of acceptance.

The organizers of approved workshops are required to announce the workshop and call for papers, gather submissions, conduct the reviewing process and decide upon the final workshop program. They must also prepare an informal set of workshop proceedings to be distributed with the registration materials at the conference. They may choose to form organizing or program committees for assistance in these tasks. The logistics of the workshops will be done with the help from the ACM KDD-2009 organizers.

Back to top…

source-http://www.kdd.org/kdd/2009/workshops.html

KDD2009: Papers Research and Industrial

Research Papers

A Generalized Co-HITS Algorithm and Its Application to Bipartite Graphs
Hongbo Deng* The Chinese Univ. of Hong Kong; Michael Lyu The Chinese University of Hong Kong; IRWIN KING Chinese University of Hong Kong

A LRT Framework for Fast Spatial Anomaly Detection
Mingxi Wu* Oracle Corporation; Xiuyao Song ; Chris Jermaine University of Florida; Sanjay Ranka University of Florida; John Gums

A Multi-Relational Approach to Spatial Classification
Richard Frank* Simon Fraser University; Martin Ester Simon Fraser University; Arno Knobbe Leiden University

A Principled and Flexible Framework for Finding Alternative Clusterings
ZiJie Qi* UCDavis; Ian Davidson University of California Davis

A Viewpoint-based Approach for Interaction Graph Analysis
Sitaram Asur* Ohio State University; Srinivasan Parthasarathy Ohio State University

Adapting the Right Measures for K-means Clustering
Junjie Wu* Beihang University; Hui Xiong Rutgers University; Jian Chen

An Association Analysis Approach to Biclustering
Gaurav Pandey* University of Minnesota; Gowtham Atluri ; Michael Steinbach University of Minnesota; Chad Myers University of Minnesota; Vipin Kumar University of Minnesota

Analyzing Patterns of User Content Generation in Online Social Networks
Lei Guo* Yahoo!; Enhua Tan Ohio State University; Songqing Chen George Mason University; Xiaodong Zhang Ohio State University; Yihong (Eric) Zhao Yahoo!

Anomalous Window Discovery through Scan Statistics for Linear Intersecting Paths (SSLIP)
Lei Shi University of Maryland Baltimore County; Vandana Janeja* UMBC

Audience Selection for On-line Brand Advertising: Privacy-friendly Social Network Targeting
Foster Provost* NYU; Brian Dalessandro Media6degrees; Rod Hook Coriolis Ventures; Xiaohan Zhang New York University

Augmenting the Generalized Hough Transform to Enable the Mining of Petroglyphs
Qiang Zhu* Univ of California Riverside; Xiaoyue Wang Univ of California Riverside; Eamonn Keogh UC Riverside; Sang-Hee Lee UC Riverside

BBM: Bayesian Browsing Model from Petabyte-scale Data
Chao Liu* Microsoft Research; Fan Guo Carnegie Mellon University; Christos Faloutsos CMU

Cross Domain Distribution Adaptation via Kernel Mapping
Erheng Zhong* Sun Yat-Sen University; Wei Fan IBM T.J.Watson; Jing Peng Montclair State University; Kun Zhang Xavier University of Louisiana; Jiangtao Ren Sun Yat-Sun University; Olivier Verscheure IBM T.J.Watson; Deepak Turaga IBM

Cartesian Contour: A Concise Representation for a Collection of Frequent Sets
Ruoming Jin* Kent State University; Yang Xiang Kent State University; Lin Liu Kent State University

Category Detection Using Hierarchical Mean Shift
Pavan Vatturi Oregon State University; Weng-Keen Wong* Oregon State University

Causality Quantification and Its Applications: Structuring and Modeling of Multivariate Time Series
Takashi Shibuya* The University of Tokyo; Tatsuya Harada The University of Tokyo; Yasuo Kuniyoshi The University of Tokyo

Characteristic Relational Patterns
Arne Koopman* Universiteit Utrecht; Arno Siebes Universiteit Utrecht

Classification of Software Behaviors for Failure Detection: A Discriminative Pattern Mining Approach
David Lo Singapore Management University; Hong Cheng* Chinese University of HongKong; Jiawei Han University of Illinois at Urbana-Champaign; Siau-Cheng Khoo National University of Singapore; Chengnian Sun National University of Singapore

Co-Clustering on Manifolds
Quanquan Gu* Tsinghua University; Jie Zhou Tsinghua University

CoCo: Coding Cost for Parameter-free Outlier Detection

Christian Bohm University of Munich; Katrin Haegler University of Munich; Nikola Muller Max Plank Institute of Biochemistry Martinsried Germany; Claudia Plant* Technische Universitat Munchen

Co-evolution of Social and Affiliation Networks
Hossam Sharara* University of Maryland; Elena Zheleva University of Maryland College Park; Lise Getoor University of Maryland

Collaborative Filtering with Temporal Dynamics
Yehuda Koren* Yahoo! Research

Collective Annotation of Wikipedia Entities in Web Text
Sayali Kulkarni IIT Bombay; Amit Singh IIT Bombay; Ganesh Ramakrishnan IIT Bombay; Soumen Chakrabarti* IIT Bombay

Collusion-Resistant Anonymous Data Collection Method
Mafruz Zaman Ashrafi* Institute For Infocomm Researc; See-Kiong Ng Institute for Infocomm Research

Combining Link and Content for Community Detection: A Discriminative Approach
Tianbao Yang* Michigan State University; Rong Jin Michigan State University; Yun Chi NEC Laboratories America; Shenghuo Zhu NEC Laboratories America Inc.

Connections between the Lines: Augmenting Social Networks with Text
Jonathan Chang* Princeton University; Jordan Boyd-Graber Princeton University; David Blei Princeton University

Consensus Group Based Stable Feature Selection
Lei Yu* Binghamton University; Steven Loscalzo SUNY Binghamton; Chris Ding University of Texas at Arlington

Constant-Factor Approximation Algorithms for Identifying Dynamic Communities
Chayant Tantipathananandh* UIC; Tanya Berger-Wolf UIC

Constrained Optimization for Validation-Guided Conditional Random Field Learning
Minmin Chen ; Yixin Chen* Washington University in St. L

Correlated Itemset Mining in ROC Space: A Constraint Programming Approach
Siegfried Nijssen* Leuven University; Tias Guns Katholieke Universiteit Leuven; Luc De Raedt Katholieke Universiteit Leuven

CP-Summary: A Concise Representation for Browsing Frequent Itemsets
Ardian Poernomo* Nanyang Technological Universi; Vivekanand Gopalkrishnan Nanyang Technological Universi

Detection of Unique Temporal Segments by Information Theoretic Meta-clustering
Shin Ando* Gunma University; Einoshin Suzuki

Differentially-Private Recommender Systems
Frank McSherry* Microsoft Research; Ilya Mironov Microsoft Research

DOULION: Counting Triangles in Massive Graphs with a Coin
Charalampos Tsourakakis* Carnegie Mellon University; U Kang Carnegie Mellon University; Gary Miller Carnegie Mellon University; Christos Faloutsos CMU

Drosophila Gene Expression Pattern Annotation Using Sparse Features and Term-term Interactions
Shuiwang Ji* Arizona State University; Lei Yuan Arizona State University; Ying-Xin Li Nanjing University; Zhi-Hua Zhou Nanjing University; Sudhir Kumar ; Jieping Ye Arizona State University

DynaMMo: Mining and Summarization of Coevolving Sequences with Missing Values
Lei Li* Carnegie Mellon University; Jim McCann Carnegie Mellon University; Nancy Pollard Carnegie Mellon University; Christos Faloutsos CMU

Effective Multi-Label Active Learning for Text Classification
Bishan Yang* Peking University; JianTao Sun ; Zheng Chen

Efficient Anomaly Monitoring Over Moving Object Trajectory Streams
Lei Chen* HKUST; Ada Fu Chinese University of Hong Kong; Yingyi Bu CUHK

Efficient Influence Maximization in Social Networks
Wei Chen* Microsoft Research Asia; Yajun Wang Microsoft Research Asia; Siyu Yang Tsinghua University

Efficient Methods for Topic Model Inference on Streaming Document Collections
Limin Yao* University of Massachusetts Am; David Mimno University of Massachusetts Amherst; Andrew McCallum University of Massachusetts Amherst

Efficiently Learning the Accuracy of Labeling Sources for Selective Sampling
Pinar Donmez* Carnegie Mellon University; Jaime Carbonell Carnegie Mellon University; Jeff Schneider Carnegie Mellon University

Exploiting Wikipedia as External Knowledge for Document Clustering
Tony Hu* Drexel University; Xiaodan Zhang Drexel Univerity; Caimei Lu Drexel University; E.K Park University of Missouri at Kansas City; Xiaohua Zhou Drexel University

Exploring Social Tagging Graph for Web Object Classification
Zhijun Yin* University of Illinois; Rui Li ; Qiaozhu Mei ; Jiawei Han University of Illinois at Urbana-Champaign

Extracting Discriminative Concepts for Domain Adaptation in Text Mining
Bo Chen* CUHK; Wai Lam CUHK; Ivor Tsang NTU; Tak-lam Wong CUHK

Fast Approximate Spectral Clustering
Donghui Yan University of California Berkeley; Ling Huang* Intel Research; Michael Jordan University of California Berkeley

Feature Shaping for Linear SVM Classifiers
George Forman* Hewlett-Packard Labs; Martin Scholz HP Labs; Shyamsundar Rajaram Hewlett-Packard

Finding a Team of Experts in Social Networks
Theodoros Lappas Univ of California Riverside; Kun Liu IBM Almaden; Evimaria Terzi* IBM Almaden

Frequent Pattern Mining with Uncertain Data
Charu Aggarwal* IBM T J Watson Research Center; Yan Li Tsinghua University; Jianyong Wang Tsinghua University; Jing Wang New York University

Genre-based Decomposition of Email Class Noise
Aleksander Kolcz* Microsoft Live Labs; Gordon Cormack University of Waterloo

Grouped Graphical Granger Modeling Methods for Temporal Causal Modeling
Aurelie Lozano* IBM Research; Naoki Abe IBM T J Watson Research Center; Yan Liu IBM Research; Saharon Rosset Tel-Aviv University
Israel

Heterogeneous Source Consensus Learning via Decision Propagation and Negotiation
Jing Gao* UIUC; Wei Fan IBM T.J.Watson; Yizhou Sun ; Jiawei Han University of Illinois at Urbana-Champaign

Improving Clustering Stability with Combinatorial MRFs
Ron Bekkerman* HP Labs; Martin Scholz HP Labs; Krishnamurthy Viswanathan HP Labs

Improving Data Mining Utility with Projective Sampling
Mark Last* BGU

Information Theoretic Regularization for Semi-Supervised Boosting
Lei Zheng Wright State University; Shaojun Wang* Wright State University; Yan Liu Wright State University; Chi-Hoon Lee Yahoo

Issues in Evaluation of Stream Learning Algorithms
Joao Gama* University of Porto; Raquel Sebastiao LIAAD; Pedro Rodrigues LIAAD

Large Human Communication Networks: Patterns and a Utility-Driven Generator
Nan Du* CMU; Christos Faloutsos CMU; Bai Wang ; Leman Akoglu Carnegie Mellon University

Large-Scale Behavioral Targeting
Ye Chen* Yahoo! Labs; Dmitry Pavlov Yahoo! Labs; John Canny Computer Science Division University of California Berkeley

Large-Scale Graph Mining Using Backbone Refinement Classes
Andreas Maunz* Freiburg Center for Data Analy; Christoph Helma in-silico toxicology; Stefan Kramer Institut fur Informatik Technische Universitat Munchen

Large-Scale Sparse Logistic Regression
Jun Liu* Arizona State University; Jianhui Chen ASU; Jieping Ye Arizona State University

Learning Optimal Ranking with Tensor Factorization for Tag Recommendation
Steffen Rendle* University of Hildesheim; Leandro Marinho University of Hildesheim; Alexandros Nanopoulos University of Hildesheim; Lars Schmidt-Thieme University of Hildesheim

Learning Patterns in the Dynamics of Biological Networks
Chang hun You* Washington State University; Lawrence Holder Washington State University; Diane Cook Washington State University

Learning with a Nonexhaustive Training Dataset
Murat Dundar* IUPUI; Arun Bhunia Purdue University; Daniel Hirleman Purdue University; Paul Robinson ; Bartek Rajwa Purdue University

Learning Indexing and Diagnosing Network Faults
Ting Wang* Georgia Tech; Mudhakar Srivatsa IBM T.J. Watson Research Cente; Dakshi Agrawal ; Ling Liu

Measuring the Effects of Preprocessing Decisions and Network Forces in Dynamic Network Analysis
Jerry Scripps* Michigan State University; Pang-Ning Tan Michigan State University; Abdol-Hossein Esfahanian Michigan State University

Meme-tracking and the Dynamics of the News Cycle
Jure Leskovec* Cornell University; Lars Backstrom Cornell University; Jon Kleinberg Cornell University

MetaFac: Community Discovery via Relational Hypergraph Factorization
Yu-Ru Lin* Arizona State University; Jimeng Sun IBM; Paul Castro IBM; Ravi Konuru IBM; Hari Sundaram ; Aisling Kelliher Arizona State University

Mind the Gaps: Weighting the Unknown in Large-Scale One-Class Collaborative Filtering
Rong Pan* HP Labs; Martin Scholz HP Labs

Mining Broad Latent Query Aspects from Search Sessions
Xuanhui Wang UIUC; Deepayan Chakrabarti Yahoo! Research; Kunal Punera* Yahoo! Research

Mining Discrete Patterns via Binary Matrix Factorization
Bao-Hong Shen Arizona State University; Shuiwang Ji Arizona State University; Jieping Ye* Arizona State University

Mining for the Most Certain Predictions from Dyadic Data
Meghana Deodhar* University of Texas at Austin; Joydeep Ghosh The University of Texas at Austin

Mining Rich Session Context to Improve Web Search
Guangyu Zhu* University of Maryland College Park; Gilad Mishne Yahoo! Search and Advertising Sciences

Mining Social Networks for Personalized Email Prioritization
Shinjae Yoo* Carnegie Mellon University; Yiming Yang ; Frank Lin ; Il-Chul Moon

Characterizing Individual Communication Patterns
Dean Malmgren* Northwestern University; Jake Hofman Yahoo! Research; Luis Amaral Northwestern University; Duncan Watts Yahoo! Research

Multi-focal Learning and Its Application to Customer Service Support
Yong Ge* Rutgers University; Hui Xiong Rutgers University; Wenjun Zhou Rutgers University; Ramendra Sahoo IBM T.J. Watson Research Center; Xiaofeng Gao ; Weili Wu

Name-Ethnicity Classification from Open Sources
Anurag Ambekar Stony Brook University; Charles Ward Stony Brook University; Jahangir Mohammed Stony Brook University; Swapna Male Stony Brook University; Steven Skiena* Stony Brook University

New ensemble methods for evolving data streams
Albert Bifet* Universitat Politecnica de Cat; Geoff Holmes University of Waikato; Bernhard Pfahringer University of Waikato Hamilton; Richard Kirkby University of Waikato; Ricard Gavalda Universitat Politecnica de Catalunya

On Burstiness-aware Search for Document Sequences
Theodoros Lappas* Univ of California Riverside; Benjamin Arai Univ of California Riverside; Dimitrios Gunopulos UCR NKUA; Manolis Platakis ; Dimitrios Kotsakos

On Compressing Social Networks
Flavio Chierichetti ; Ravi Kumar* Yahoo; Silvio Lattanzi ; Michael Mitzenmacher ; Alessandro Panconesi ; Prabhakar Raghavan

On the Tradeoff Between Privacy and Utility in Data Publishing
Tiancheng Li* Purdue University; Ninghui Li Purdue University Optimizing Web Traffic via the Media Scheduling Problem Lars Backstrom* Cornell University; Jon Kleinberg Cornell University; Ravi Kumar Yahoo

Parallel Community Detection on Large Networks with Propinquity Dynamics
Yuzhou Zhang* Tsinghua University; Jianyong Wang Tsinghua University; Yi Wang Google Beijing Research; Lizhu Zhou Tsinghua University

Primal Sparse Max-Margin Markov Networks
Jun ZHU* Tsinghua University; Eric Xing Carnegie Mellon Univresity; Bo Zhang Tsinghua University

Probabilistic Frequent Itemset Mining in Uncertain Databases
Matthias Renz* Ludwig-Maximilinas-Universitat; Thomas Bernecker Ludwig-Maximilians-Universitat Munchen; Florian Verhein Ludwig-Maximilians-Universitat Munchen; Andreas Zuefle Ludwig-Maximilians-Universitat Munchen; Hans-Peter Kriegel University of Munich

Quantification and Semi-supervised Classification Methods for Handling Changes in Class Distribution
Jack Chongjie Xue* Fordham University; Gary Weiss Fordham University

Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema
Yizhou Sun* UIUC; Yintao Yu UIUC; Jiawei Han University of Illinois at Urbana-Champaign

Regression based Latent Factor Models
Deepak Agarwal* Yahoo!; Bee-Chung Chen Yahoo!

Regret-based Online Ranking for a Growing Digital Library
Erick Delage* Stanford University

Relational Learning via Latent Social Dimensions
Lei Tang* Arizona State University; Huan Liu

Scalable Graph Clustering Using Flows: Applications to Community Discovery
Venu Satuluri The Ohio State University; Srinivasan Parthasarathy* Ohio State University

Scalable Pseudo-Likelihood Estimation in Hybrid Random Fields
Antonino Freno* University of Siena; Edmondo Trentin ; Marco Gori

Social Influence Analysis in Large-scale Networks
Jie Tang* Tsinghua University; Jimeng Sun IBM TJ Watson Research Center; Chi Wang Tsinghua Univ.

Spatial-temporal causal modeling for climate change attribution
Aurelie Lozano* IBM Research; Hongfei Li IBM Research; Alexandru Niculsecu-Mizil IBM Research; Yan Liu IBM Research; Claudia Perlich IBM USA; Jonathan Hosking IBM Research; Naoki Abe IBM T J Watson Research Center

Structured Correspondence Topic Models for Mining Captioned Figures in Biological Literature
Amr Ahmed* Carnegie Mellon Univresity; Eric Xing Carnegie Mellon Univresity; William Cohen Carnegie Mellon Univresity; Robert Murphy Carnegie Mellon Univresity

TANGENT: A Novel, “Surprise-Me”, Recommendation Algorithm
Kensuke Onuma Sony Corporation; Hanghang Tong* CMU; Christos Faloutsos CMU

Tell Me Something I Don’t Know: Randomization Strategies for Iterative Data Mining
Sami Hanhijarvi* Helsinki Univ. of Technology; Markus Ojala Helsinki University of Technology; Niko Vuokko ; Kai Puolamaki ; Nikolaj Tatti Helsinki Univ. of Technology; Heikki Mannila

Temporal Mining for Interactive Workflow Data Analysis
Michele Berlingerio* KDD Lab Pisa ISTI C.N.R.; Fosca Giannotti ISTI CNR; Mirco Nanni KDD Lab – ISTI – CNR; Fabio Pinelli Isti – CNR – Italy Pisa

The Offset Tree for Learning with Partial Labels
John Langford* ; Alina Beygelzimer IBM

Time Series Shapelets: A New Primitive for Data Mining
Lexiang Ye* UC Riverside; Eamonn Keogh UC Riverside

Toward Autonomic Grids: Analyzing the Job Flow with Affinity Streaming
Xiangliang Zhang* INRIA; Cyril Furtlehner ; Julien Perez ; Cecile Germain-Renaud Universite Paris Sud; Michele Sebag Universite Paris-Sud

Towards Efficient Mining of Proportional Fault-Tolerant Frequent Itemsets
Ardian Poernomo* Nanyang Technological Universi; Vivekanand Gopalkrishnan Nanyang Technological Universi

TrustWalker : A Random Walk Model for Combining Trust-based and Item-based Recommendation
Mohsen Jamali* Simon Fraser University; Martin Ester Simon Fraser University

Turning Down the Noise in the Blogosphere
Khalid El-Arini, Carnegie Mellon University; Gaurav Veda; Dafna Shahaf; Carlos Guestrin

User Grouping Behavior in Online Forums
Xiaolin Shi* University of Michigan; Jun ZHU Tsinghua University; Rui Cai Microsoft Research; Lei Zhang Microsoft Research Asia

Using Graph-based Metrics with Empircial Risk Minimization to Speed Up Active Learning on Networked Data
Sofus Macskassy* Fetch Technologies Inc.

WhereNext: a Location Predictor on Trajectory Pattern Mining
Anna Monreale Isti – CNR – Italy Pisa; Fabio Pinelli Isti – CNR – Italy Pisa; Roberto Trasarti* Isti – CNR – Italy Pisa; Fosca Giannotti ISTI CNR

Industrial Papers

A Case Study of Behavior-driven Conjoint Analysis on Yahoo! Front Page Today Module

Wei Chu*, Yahoo! Labs; Seung-Taek Park, Yahoo! Inc.; Todd Beaupre, Yahoo! Inc.; Nitin Motgi, Yahoo! Inc.; Amit Phadke, Yahoo! Inc.; Seinjuti Chakraborty, Yahoo! Inc.; Joe Zachariah, Yahoo! Inc.

Address Standardization with Latent Semantic Association

Honglei Guo*, IBM China Research Lab; Huijia Zhu, IBM China Research Lab; Zhili Guo, IBM China Research Lab; Xiaoxun Zhang, IBM China Research Lab; Zhong Su, IBM China Research Lab

Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service

Noman Mohammed, Concordia University; Benjamin C. M. Fung*, Concordia University; Patrick C. K. Hung, University of Ontario Institute of Technology; Cheuk-kwong Lee, Hong Kong Red Cross Blood Transfusion Service

Applying Syntactic Similarity Algorithms for Enterprise Information Management

Lucy Cherkasova*, HPLabs; Kave Eshghi, HPLabs; Brad Morrey, HPLabs; Joseph Tucek, HPLabs; Alistair Veitch, HPLabs

Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs

Justin Ma*, UC San Diego; Lawrence Saul, UCSD; Stefan Savage, UC San Diego; Geoffrey Voelker, UC San Diego

BGP-lens: Patterns and Anomalies in Internet Routing Updates

B. Aditya Prakash*, Carnegie Mellon University; Nicholas Valler, UCR; David Andersen, CMU; Michalis Faloutsos, UCR; Christos Faloutsos, CMU

Can We Learn a Template-Independent Wrapper for News Article Extraction from a Single Training Site?

Junfeng Wang*, Zhejiang university; Xiaofei He, ; Can Wang, ; Jian Pei, Simon Fraser University; Jiajun Bu, ; Chun Chen, ; Ziyu Guan, ; Wei Vivian Zhang, Microsoft

Catching the Drift: Learning Broad Matches from Clickthrough Data

Sonal Gupta*, University of Texas at Austin; Mikhail Bilenko, Microsoft Research; Matthew Richardson, Microsoft Research Clustering of Event Logs Using Iterative Partitioning Adetokunbo Makanju*, Dalhousie University; Nur Zincir-Heywood, Dalhousie University; Evangelos Milios, Dalhousie University

COA: Finding Novel Patents through Text Analysis

Mohammad Al Hasan*, RPI; W. Scott Spangler, IBM Corporation; Thomas Griffin, IBM Corporation; Alfredo Alba, IBM Corporation

Enabling Analysts in Managed Services for CRM Analytics

Indrajit Bhattacharya, IBM Research; Shantanu Godbole*, IBM Research; Ajay Gupta, IBM Research; Ashish Verma, IBM Research; Jeff Achtermann, IBM MBPS; Kevin English, IBM

Entity Discovery and Assignment for Opinion Mining Applications

Xiaowen Ding*, Univ of Illinois at Chicago; Bing Liu, UIC; Lei Zhang, UIC

Grocery Shopping Recommendations Based on Basket-Sensitive Random Walk

Ming Li*, Unilever UK; Malcolm Dias, Unilever UK; Ian Jarman, Liverpool John Moores University; Wael El-Deredy, University of Manchester; Paulo Lisboa, Liverpool John Moores University

Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

Jiang-Ming Yang*, Microsoft Research Asia; Rui Cai, Microsoft Research; Chunsong Wang, University of Wisconsin-Madison; Hua Huang, Beijing University of Posts and Telecommunications; Lei Zhang, Microsoft Research Asia; Wei-Ying Ma, Microsoft Research Asia

Intelligent File Scoring System for Malware Detection from the Gray List

Tao Li*, Florida International University Learning Dynamic Temporal Graphs for Oil-drilling Equipment Monitoring System Yan Liu*, IBM Research; Jayant Kalagnanam

Migration Motif: A Spatial-Temporal Pattern Mining Approach for Financial Markets

Xiaoxi Du, KSU; Ruoming Jin*, Kent State University; Liang Ding, Kent State University; Victor Lee, Kent State University; John Thornton, Kent State University

Mining Brain Region Connectivity for Alzheimer’s Disease Study via Sparse Inverse Covariance Estimation

Liang Sun*, Arizona State University; Rinkal Patel, Arizona State University; Jun Liu, Arizona State University; Kewei Chen, Neuroimaging Banner Alzheimer’s Institute; Teresa Wu, Arizona State University; Jing Li, Arizona State University; Eric Reiman, Banner Alzheimer’s Institute and Banner PET Center; Jieping Ye, Arizona State University

Modeling and Predicting User Behavior in Sponsored Search

Joshua Attenberg*, NYU Polytechnic Institute; Torsten Suel, Yahoo Research; Sandeep Pandey, Yahoo Research

Named Entity Mining from Click-Through Log Using Weakly Supervised Latent Dirichlet Allocation

Gu Xu*, Microsoft Research Asia; Shuang-Hong Yang, Georgia Tech; Hang Li, Microsoft Research Asia

Network Anomaly Detection based on Eigen Equation Compression

Shunsuke Hirose*, NEC Corporation; Kenji Yamanishi, ; Takayuki Nakata, ; Ryohei Fujimaki

OLAP on Search Logs: An Infrastructure Supporting Data-Driven Applications in Search Engines

Bin Zhou, Simon Fraser University; Daxin Jiang*, MSRA; Jian Pei, Simon Fraser University; Hang Li, Microsoft Research Asia

OpinionMiner: A Machine Learning System for Web Opinion Mining and Extraction

Wei Jin*, North Dakota State University; Hung Hay Ho

Pervasive Parallelism in Data Mining: Dataflow solution to Co-clustering Large and Sparse Netflix Data

Srivatsava Daruru, University of Texas at Austin; Nena Marin*, Pervasive Software; Matthew Walker, Pervasive Software; Joydeep Ghosh, The University of Texas at Austin

Predicting Bounce Rates in Sponsored Search Advertisements

D. Sculley*, Google, Inc.; Robert Malkin, Google, Inc; Sugato Basu, Google, Inc; Roberto Bayardo, Google

PSkip: Estimating relevance ranking quality from web search clickthrough data

Kuansan Wang*, Microsoft Research; Toby Walker, ; Zijian Zheng

Query Result Clustering for Object-level Search

Jongwuk Lee, ; Seung-won Hwang*, Postech; Zaiqing Nie, ; Ji-Rong Wen, Microsoft Research Asia

Improving Classification Accuracy Using Automatically Extracted Training Data

Ariel Fuxman*, Microsoft, USA; Anitha Kanna, Microsoft, USA; Andrew Goldberg, University of Wisconsin; Rakesh Agrawal, Microsoft; Panayiotis Tsaparas, Microsoft

Sentiment Analysis of Blogs by Combining Lexical Knowledge with Text Classification

Prem Melville*, IBM; Wojciech Gryc, ; Richard Lawrence, IBM, USA

Seven Pitfalls to Avoid when Running Controlled Experiments on the Web

Thomas Crook, Microsoft; Brian Frasca, Microsoft; Ron Kohavi*, Microsoft; Roger Longbotham, Microsoft

SNARE: A Link Analytic System for Graph Labeling and Risk Detection

Mary McGlohon*, Carnegie Mellon University; Stephen Bay, PricewaterhouseCoopers; Markus Anderle, PricewaterhouseCoopers; David Steier, PricewaterhouseCoopers; Christos Faloutsos, CMU

Sustainable Operation and Management of Data Center Chillers using Temporal Data Mining

Debprakash Patnaik, Virginia Tech; Manish Marwah, HP Labs; Ratnesh Sharma, HP Labs; Naren Ramakrishnan*, Virginia Tech

Towards a Universal Marketplace over the Web: Statistical Multi-label Classification of Service Provider Forms with Simulated Annealing

Kivanc Ozonat*, HP Labs

Towards Combining Web Classification and Web Information Extraction: A Case Study

Ping Luo*, HP Labs China

Source – http://www.kdd.org/kdd/2009/papers.html

KDD 2009 : Demos

Demos

Exploratory Recommender Systems for Sales and Marketing
Michail Vlachos, Abdel Labbi

Open Mobile Miner: A Toolkit for Mobile Data Stream Mining
Shonali Krishnaswamy, Mohamed Medhat Gaber, Marian Harbach, Christian Hugues, Abhijat Sinha, Brett Gillick, Pari Delir Haghighi, and Arkady Zaslavsky

OSD: An Online Web Spam Detection System
Bin Zhou, Jian Pei

Visalix: A Web Application for Visual Data Analysis and Clustering
Loic Lecerf, Boris Chidlovskii

A Flexible Topic-driven Framework for News Exploration
Juanzi Li, Jun Li, and Jie Tang

Model Monitor: Tracking Model Performance in the Real World
Troy Raeder, Nitesh V. Chawla

SHIFTR: A Fast and Scalable System for Ad Hoc Sensemaking of Large Graphs
Duen Horng Chau, Aniket Kittur, Hanghang Tong, Christos Faloutsos, and Jason I. Hong

Curating and Searching the Annotated Web
Amit Singh, Sayali Kulkarni, Somnath Banerjee, Ganesh Ramakrishnan, Soumen Chakrabarti

Expert2Bólè: From Expert Finding to Bólè Search
Zi Yang, Jie Tang, Bo Wang, Jingyi Guo, and Juanzi

Spam Miner: A Platform for Detecting and Characterizing Spam Campaigns
Pedro H. Calais Guerra, Douglas E. V. Pires, Dorgival Guedes, Wagner Meira Jr., Cristine Hoepers, Klaus Steding-Jessen

Research Papers

A Generalized Co-HITS Algorithm and Its Application to Bipartite Graphs
Hongbo Deng* The Chinese Univ. of Hong Kong; Michael Lyu The Chinese University of Hong Kong; IRWIN KING Chinese University of Hong Kong

A LRT Framework for Fast Spatial Anomaly Detection
Mingxi Wu* Oracle Corporation; Xiuyao Song ; Chris Jermaine University of Florida; Sanjay Ranka University of Florida; John Gums

A Multi-Relational Approach to Spatial Classification
Richard Frank* Simon Fraser University; Martin Ester Simon Fraser University; Arno Knobbe Leiden University

A Principled and Flexible Framework for Finding Alternative Clusterings
ZiJie Qi* UCDavis; Ian Davidson University of California Davis

A Viewpoint-based Approach for Interaction Graph Analysis
Sitaram Asur* Ohio State University; Srinivasan Parthasarathy Ohio State University

Adapting the Right Measures for K-means Clustering
Junjie Wu* Beihang University; Hui Xiong Rutgers University; Jian Chen

An Association Analysis Approach to Biclustering
Gaurav Pandey* University of Minnesota; Gowtham Atluri ; Michael Steinbach University of Minnesota; Chad Myers University of Minnesota; Vipin Kumar University of Minnesota

Analyzing Patterns of User Content Generation in Online Social Networks
Lei Guo* Yahoo!; Enhua Tan Ohio State University; Songqing Chen George Mason University; Xiaodong Zhang Ohio State University; Yihong (Eric) Zhao Yahoo!

Anomalous Window Discovery through Scan Statistics for Linear Intersecting Paths (SSLIP)
Lei Shi University of Maryland Baltimore County; Vandana Janeja* UMBC

Audience Selection for On-line Brand Advertising: Privacy-friendly Social Network Targeting
Foster Provost* NYU; Brian Dalessandro Media6degrees; Rod Hook Coriolis Ventures; Xiaohan Zhang New York University

Augmenting the Generalized Hough Transform to Enable the Mining of Petroglyphs
Qiang Zhu* Univ of California Riverside; Xiaoyue Wang Univ of California Riverside; Eamonn Keogh UC Riverside; Sang-Hee Lee UC Riverside

BBM: Bayesian Browsing Model from Petabyte-scale Data
Chao Liu* Microsoft Research; Fan Guo Carnegie Mellon University; Christos Faloutsos CMU

Cross Domain Distribution Adaptation via Kernel Mapping
Erheng Zhong* Sun Yat-Sen University; Wei Fan IBM T.J.Watson; Jing Peng Montclair State University; Kun Zhang Xavier University of Louisiana; Jiangtao Ren Sun Yat-Sun University; Olivier Verscheure IBM T.J.Watson; Deepak Turaga IBM

Cartesian Contour: A Concise Representation for a Collection of Frequent Sets
Ruoming Jin* Kent State University; Yang Xiang Kent State University; Lin Liu Kent State University

Category Detection Using Hierarchical Mean Shift
Pavan Vatturi Oregon State University; Weng-Keen Wong* Oregon State University

Causality Quantification and Its Applications: Structuring and Modeling of Multivariate Time Series
Takashi Shibuya* The University of Tokyo; Tatsuya Harada The University of Tokyo; Yasuo Kuniyoshi The University of Tokyo

Characteristic Relational Patterns
Arne Koopman* Universiteit Utrecht; Arno Siebes Universiteit Utrecht

Classification of Software Behaviors for Failure Detection: A Discriminative Pattern Mining Approach
David Lo Singapore Management University; Hong Cheng* Chinese University of HongKong; Jiawei Han University of Illinois at Urbana-Champaign; Siau-Cheng Khoo National University of Singapore; Chengnian Sun National University of Singapore

Co-Clustering on Manifolds
Quanquan Gu* Tsinghua University; Jie Zhou Tsinghua University

CoCo: Coding Cost for Parameter-free Outlier Detection

Christian Bohm University of Munich; Katrin Haegler University of Munich; Nikola Muller Max Plank Institute of Biochemistry Martinsried Germany; Claudia Plant* Technische Universitat Munchen

Co-evolution of Social and Affiliation Networks
Hossam Sharara* University of Maryland; Elena Zheleva University of Maryland College Park; Lise Getoor University of Maryland

Collaborative Filtering with Temporal Dynamics
Yehuda Koren* Yahoo! Research

Collective Annotation of Wikipedia Entities in Web Text
Sayali Kulkarni IIT Bombay; Amit Singh IIT Bombay; Ganesh Ramakrishnan IIT Bombay; Soumen Chakrabarti* IIT Bombay

Collusion-Resistant Anonymous Data Collection Method
Mafruz Zaman Ashrafi* Institute For Infocomm Researc; See-Kiong Ng Institute for Infocomm Research

Combining Link and Content for Community Detection: A Discriminative Approach
Tianbao Yang* Michigan State University; Rong Jin Michigan State University; Yun Chi NEC Laboratories America; Shenghuo Zhu NEC Laboratories America Inc.

Connections between the Lines: Augmenting Social Networks with Text
Jonathan Chang* Princeton University; Jordan Boyd-Graber Princeton University; David Blei Princeton University

Consensus Group Based Stable Feature Selection
Lei Yu* Binghamton University; Steven Loscalzo SUNY Binghamton; Chris Ding University of Texas at Arlington

Constant-Factor Approximation Algorithms for Identifying Dynamic Communities
Chayant Tantipathananandh* UIC; Tanya Berger-Wolf UIC

Constrained Optimization for Validation-Guided Conditional Random Field Learning
Minmin Chen ; Yixin Chen* Washington University in St. L

Correlated Itemset Mining in ROC Space: A Constraint Programming Approach
Siegfried Nijssen* Leuven University; Tias Guns Katholieke Universiteit Leuven; Luc De Raedt Katholieke Universiteit Leuven

CP-Summary: A Concise Representation for Browsing Frequent Itemsets
Ardian Poernomo* Nanyang Technological Universi; Vivekanand Gopalkrishnan Nanyang Technological Universi

Detection of Unique Temporal Segments by Information Theoretic Meta-clustering
Shin Ando* Gunma University; Einoshin Suzuki

Differentially-Private Recommender Systems
Frank McSherry* Microsoft Research; Ilya Mironov Microsoft Research

DOULION: Counting Triangles in Massive Graphs with a Coin
Charalampos Tsourakakis* Carnegie Mellon University; U Kang Carnegie Mellon University; Gary Miller Carnegie Mellon University; Christos Faloutsos CMU

Drosophila Gene Expression Pattern Annotation Using Sparse Features and Term-term Interactions
Shuiwang Ji* Arizona State University; Lei Yuan Arizona State University; Ying-Xin Li Nanjing University; Zhi-Hua Zhou Nanjing University; Sudhir Kumar ; Jieping Ye Arizona State University

DynaMMo: Mining and Summarization of Coevolving Sequences with Missing Values
Lei Li* Carnegie Mellon University; Jim McCann Carnegie Mellon University; Nancy Pollard Carnegie Mellon University; Christos Faloutsos CMU

Effective Multi-Label Active Learning for Text Classification
Bishan Yang* Peking University; JianTao Sun ; Zheng Chen

Efficient Anomaly Monitoring Over Moving Object Trajectory Streams
Lei Chen* HKUST; Ada Fu Chinese University of Hong Kong; Yingyi Bu CUHK

Efficient Influence Maximization in Social Networks
Wei Chen* Microsoft Research Asia; Yajun Wang Microsoft Research Asia; Siyu Yang Tsinghua University

Efficient Methods for Topic Model Inference on Streaming Document Collections
Limin Yao* University of Massachusetts Am; David Mimno University of Massachusetts Amherst; Andrew McCallum University of Massachusetts Amherst

Efficiently Learning the Accuracy of Labeling Sources for Selective Sampling
Pinar Donmez* Carnegie Mellon University; Jaime Carbonell Carnegie Mellon University; Jeff Schneider Carnegie Mellon University

Exploiting Wikipedia as External Knowledge for Document Clustering
Tony Hu* Drexel University; Xiaodan Zhang Drexel Univerity; Caimei Lu Drexel University; E.K Park University of Missouri at Kansas City; Xiaohua Zhou Drexel University

Exploring Social Tagging Graph for Web Object Classification
Zhijun Yin* University of Illinois; Rui Li ; Qiaozhu Mei ; Jiawei Han University of Illinois at Urbana-Champaign

Extracting Discriminative Concepts for Domain Adaptation in Text Mining
Bo Chen* CUHK; Wai Lam CUHK; Ivor Tsang NTU; Tak-lam Wong CUHK

Fast Approximate Spectral Clustering
Donghui Yan University of California Berkeley; Ling Huang* Intel Research; Michael Jordan University of California Berkeley

Feature Shaping for Linear SVM Classifiers
George Forman* Hewlett-Packard Labs; Martin Scholz HP Labs; Shyamsundar Rajaram Hewlett-Packard

Finding a Team of Experts in Social Networks
Theodoros Lappas Univ of California Riverside; Kun Liu IBM Almaden; Evimaria Terzi* IBM Almaden

Frequent Pattern Mining with Uncertain Data
Charu Aggarwal* IBM T J Watson Research Center; Yan Li Tsinghua University; Jianyong Wang Tsinghua University; Jing Wang New York University

Genre-based Decomposition of Email Class Noise
Aleksander Kolcz* Microsoft Live Labs; Gordon Cormack University of Waterloo

Grouped Graphical Granger Modeling Methods for Temporal Causal Modeling
Aurelie Lozano* IBM Research; Naoki Abe IBM T J Watson Research Center; Yan Liu IBM Research; Saharon Rosset Tel-Aviv University
Israel

Heterogeneous Source Consensus Learning via Decision Propagation and Negotiation
Jing Gao* UIUC; Wei Fan IBM T.J.Watson; Yizhou Sun ; Jiawei Han University of Illinois at Urbana-Champaign

Improving Clustering Stability with Combinatorial MRFs
Ron Bekkerman* HP Labs; Martin Scholz HP Labs; Krishnamurthy Viswanathan HP Labs

Improving Data Mining Utility with Projective Sampling
Mark Last* BGU

Information Theoretic Regularization for Semi-Supervised Boosting
Lei Zheng Wright State University; Shaojun Wang* Wright State University; Yan Liu Wright State University; Chi-Hoon Lee Yahoo

Issues in Evaluation of Stream Learning Algorithms
Joao Gama* University of Porto; Raquel Sebastiao LIAAD; Pedro Rodrigues LIAAD

Large Human Communication Networks: Patterns and a Utility-Driven Generator
Nan Du* CMU; Christos Faloutsos CMU; Bai Wang ; Leman Akoglu Carnegie Mellon University

Large-Scale Behavioral Targeting
Ye Chen* Yahoo! Labs; Dmitry Pavlov Yahoo! Labs; John Canny Computer Science Division University of California Berkeley

Large-Scale Graph Mining Using Backbone Refinement Classes
Andreas Maunz* Freiburg Center for Data Analy; Christoph Helma in-silico toxicology; Stefan Kramer Institut fur Informatik Technische Universitat Munchen

Large-Scale Sparse Logistic Regression
Jun Liu* Arizona State University; Jianhui Chen ASU; Jieping Ye Arizona State University

Learning Optimal Ranking with Tensor Factorization for Tag Recommendation
Steffen Rendle* University of Hildesheim; Leandro Marinho University of Hildesheim; Alexandros Nanopoulos University of Hildesheim; Lars Schmidt-Thieme University of Hildesheim

Learning Patterns in the Dynamics of Biological Networks
Chang hun You* Washington State University; Lawrence Holder Washington State University; Diane Cook Washington State University

Learning with a Nonexhaustive Training Dataset
Murat Dundar* IUPUI; Arun Bhunia Purdue University; Daniel Hirleman Purdue University; Paul Robinson ; Bartek Rajwa Purdue University

Learning Indexing and Diagnosing Network Faults
Ting Wang* Georgia Tech; Mudhakar Srivatsa IBM T.J. Watson Research Cente; Dakshi Agrawal ; Ling Liu

Measuring the Effects of Preprocessing Decisions and Network Forces in Dynamic Network Analysis
Jerry Scripps* Michigan State University; Pang-Ning Tan Michigan State University; Abdol-Hossein Esfahanian Michigan State University

Meme-tracking and the Dynamics of the News Cycle
Jure Leskovec* Cornell University; Lars Backstrom Cornell University; Jon Kleinberg Cornell University

MetaFac: Community Discovery via Relational Hypergraph Factorization
Yu-Ru Lin* Arizona State University; Jimeng Sun IBM; Paul Castro IBM; Ravi Konuru IBM; Hari Sundaram ; Aisling Kelliher Arizona State University

Mind the Gaps: Weighting the Unknown in Large-Scale One-Class Collaborative Filtering
Rong Pan* HP Labs; Martin Scholz HP Labs

Mining Broad Latent Query Aspects from Search Sessions
Xuanhui Wang UIUC; Deepayan Chakrabarti Yahoo! Research; Kunal Punera* Yahoo! Research

Mining Discrete Patterns via Binary Matrix Factorization
Bao-Hong Shen Arizona State University; Shuiwang Ji Arizona State University; Jieping Ye* Arizona State University

Mining for the Most Certain Predictions from Dyadic Data
Meghana Deodhar* University of Texas at Austin; Joydeep Ghosh The University of Texas at Austin

Mining Rich Session Context to Improve Web Search
Guangyu Zhu* University of Maryland College Park; Gilad Mishne Yahoo! Search and Advertising Sciences

Mining Social Networks for Personalized Email Prioritization
Shinjae Yoo* Carnegie Mellon University; Yiming Yang ; Frank Lin ; Il-Chul Moon

Characterizing Individual Communication Patterns
Dean Malmgren* Northwestern University; Jake Hofman Yahoo! Research; Luis Amaral Northwestern University; Duncan Watts Yahoo! Research

Multi-focal Learning and Its Application to Customer Service Support
Yong Ge* Rutgers University; Hui Xiong Rutgers University; Wenjun Zhou Rutgers University; Ramendra Sahoo IBM T.J. Watson Research Center; Xiaofeng Gao ; Weili Wu

Name-Ethnicity Classification from Open Sources
Anurag Ambekar Stony Brook University; Charles Ward Stony Brook University; Jahangir Mohammed Stony Brook University; Swapna Male Stony Brook University; Steven Skiena* Stony Brook University

New ensemble methods for evolving data streams
Albert Bifet* Universitat Politecnica de Cat; Geoff Holmes University of Waikato; Bernhard Pfahringer University of Waikato Hamilton; Richard Kirkby University of Waikato; Ricard Gavalda Universitat Politecnica de Catalunya

On Burstiness-aware Search for Document Sequences
Theodoros Lappas* Univ of California Riverside; Benjamin Arai Univ of California Riverside; Dimitrios Gunopulos UCR NKUA; Manolis Platakis ; Dimitrios Kotsakos

On Compressing Social Networks
Flavio Chierichetti ; Ravi Kumar* Yahoo; Silvio Lattanzi ; Michael Mitzenmacher ; Alessandro Panconesi ; Prabhakar Raghavan

On the Tradeoff Between Privacy and Utility in Data Publishing
Tiancheng Li* Purdue University; Ninghui Li Purdue University Optimizing Web Traffic via the Media Scheduling Problem Lars Backstrom* Cornell University; Jon Kleinberg Cornell University; Ravi Kumar Yahoo

Parallel Community Detection on Large Networks with Propinquity Dynamics
Yuzhou Zhang* Tsinghua University; Jianyong Wang Tsinghua University; Yi Wang Google Beijing Research; Lizhu Zhou Tsinghua University

Primal Sparse Max-Margin Markov Networks
Jun ZHU* Tsinghua University; Eric Xing Carnegie Mellon Univresity; Bo Zhang Tsinghua University

Probabilistic Frequent Itemset Mining in Uncertain Databases
Matthias Renz* Ludwig-Maximilinas-Universitat; Thomas Bernecker Ludwig-Maximilians-Universitat Munchen; Florian Verhein Ludwig-Maximilians-Universitat Munchen; Andreas Zuefle Ludwig-Maximilians-Universitat Munchen; Hans-Peter Kriegel University of Munich

Quantification and Semi-supervised Classification Methods for Handling Changes in Class Distribution
Jack Chongjie Xue* Fordham University; Gary Weiss Fordham University

Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema
Yizhou Sun* UIUC; Yintao Yu UIUC; Jiawei Han University of Illinois at Urbana-Champaign

Regression based Latent Factor Models
Deepak Agarwal* Yahoo!; Bee-Chung Chen Yahoo!

Regret-based Online Ranking for a Growing Digital Library
Erick Delage* Stanford University

Relational Learning via Latent Social Dimensions
Lei Tang* Arizona State University; Huan Liu

Scalable Graph Clustering Using Flows: Applications to Community Discovery
Venu Satuluri The Ohio State University; Srinivasan Parthasarathy* Ohio State University

Scalable Pseudo-Likelihood Estimation in Hybrid Random Fields
Antonino Freno* University of Siena; Edmondo Trentin ; Marco Gori

Social Influence Analysis in Large-scale Networks
Jie Tang* Tsinghua University; Jimeng Sun IBM TJ Watson Research Center; Chi Wang Tsinghua Univ.

Spatial-temporal causal modeling for climate change attribution
Aurelie Lozano* IBM Research; Hongfei Li IBM Research; Alexandru Niculsecu-Mizil IBM Research; Yan Liu IBM Research; Claudia Perlich IBM USA; Jonathan Hosking IBM Research; Naoki Abe IBM T J Watson Research Center

Structured Correspondence Topic Models for Mining Captioned Figures in Biological Literature
Amr Ahmed* Carnegie Mellon Univresity; Eric Xing Carnegie Mellon Univresity; William Cohen Carnegie Mellon Univresity; Robert Murphy Carnegie Mellon Univresity

TANGENT: A Novel, “Surprise-Me”, Recommendation Algorithm
Kensuke Onuma Sony Corporation; Hanghang Tong* CMU; Christos Faloutsos CMU

Tell Me Something I Don’t Know: Randomization Strategies for Iterative Data Mining
Sami Hanhijarvi* Helsinki Univ. of Technology; Markus Ojala Helsinki University of Technology; Niko Vuokko ; Kai Puolamaki ; Nikolaj Tatti Helsinki Univ. of Technology; Heikki Mannila

Temporal Mining for Interactive Workflow Data Analysis
Michele Berlingerio* KDD Lab Pisa ISTI C.N.R.; Fosca Giannotti ISTI CNR; Mirco Nanni KDD Lab – ISTI – CNR; Fabio Pinelli Isti – CNR – Italy Pisa

The Offset Tree for Learning with Partial Labels
John Langford* ; Alina Beygelzimer IBM

Time Series Shapelets: A New Primitive for Data Mining
Lexiang Ye* UC Riverside; Eamonn Keogh UC Riverside

Toward Autonomic Grids: Analyzing the Job Flow with Affinity Streaming
Xiangliang Zhang* INRIA; Cyril Furtlehner ; Julien Perez ; Cecile Germain-Renaud Universite Paris Sud; Michele Sebag Universite Paris-Sud

Towards Efficient Mining of Proportional Fault-Tolerant Frequent Itemsets
Ardian Poernomo* Nanyang Technological Universi; Vivekanand Gopalkrishnan Nanyang Technological Universi

TrustWalker : A Random Walk Model for Combining Trust-based and Item-based Recommendation
Mohsen Jamali* Simon Fraser University; Martin Ester Simon Fraser University

Turning Down the Noise in the Blogosphere
Khalid El-Arini, Carnegie Mellon University; Gaurav Veda; Dafna Shahaf; Carlos Guestrin

User Grouping Behavior in Online Forums
Xiaolin Shi* University of Michigan; Jun ZHU Tsinghua University; Rui Cai Microsoft Research; Lei Zhang Microsoft Research Asia

Using Graph-based Metrics with Empircial Risk Minimization to Speed Up Active Learning on Networked Data
Sofus Macskassy* Fetch Technologies Inc.

WhereNext: a Location Predictor on Trajectory Pattern Mining
Anna Monreale Isti – CNR – Italy Pisa; Fabio Pinelli Isti – CNR – Italy Pisa; Roberto Trasarti* Isti – CNR – Italy Pisa; Fosca Giannotti ISTI CNR

Accepted Industrial Papers

A Case Study of Behavior-driven Conjoint Analysis on Yahoo! Front Page Today Module

Wei Chu*, Yahoo! Labs; Seung-Taek Park, Yahoo! Inc.; Todd Beaupre, Yahoo! Inc.; Nitin Motgi, Yahoo! Inc.; Amit Phadke, Yahoo! Inc.; Seinjuti Chakraborty, Yahoo! Inc.; Joe Zachariah, Yahoo! Inc.

Address Standardization with Latent Semantic Association

Honglei Guo*, IBM China Research Lab; Huijia Zhu, IBM China Research Lab; Zhili Guo, IBM China Research Lab; Xiaoxun Zhang, IBM China Research Lab; Zhong Su, IBM China Research Lab

Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service

Noman Mohammed, Concordia University; Benjamin C. M. Fung*, Concordia University; Patrick C. K. Hung, University of Ontario Institute of Technology; Cheuk-kwong Lee, Hong Kong Red Cross Blood Transfusion Service

Applying Syntactic Similarity Algorithms for Enterprise Information Management

Lucy Cherkasova*, HPLabs; Kave Eshghi, HPLabs; Brad Morrey, HPLabs; Joseph Tucek, HPLabs; Alistair Veitch, HPLabs

Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs

Justin Ma*, UC San Diego; Lawrence Saul, UCSD; Stefan Savage, UC San Diego; Geoffrey Voelker, UC San Diego

BGP-lens: Patterns and Anomalies in Internet Routing Updates

B. Aditya Prakash*, Carnegie Mellon University; Nicholas Valler, UCR; David Andersen, CMU; Michalis Faloutsos, UCR; Christos Faloutsos, CMU

Can We Learn a Template-Independent Wrapper for News Article Extraction from a Single Training Site?

Junfeng Wang*, Zhejiang university; Xiaofei He, ; Can Wang, ; Jian Pei, Simon Fraser University; Jiajun Bu, ; Chun Chen, ; Ziyu Guan, ; Wei Vivian Zhang, Microsoft

Catching the Drift: Learning Broad Matches from Clickthrough Data

Sonal Gupta*, University of Texas at Austin; Mikhail Bilenko, Microsoft Research; Matthew Richardson, Microsoft Research Clustering of Event Logs Using Iterative Partitioning Adetokunbo Makanju*, Dalhousie University; Nur Zincir-Heywood, Dalhousie University; Evangelos Milios, Dalhousie University

COA: Finding Novel Patents through Text Analysis

Mohammad Al Hasan*, RPI; W. Scott Spangler, IBM Corporation; Thomas Griffin, IBM Corporation; Alfredo Alba, IBM Corporation

Enabling Analysts in Managed Services for CRM Analytics

Indrajit Bhattacharya, IBM Research; Shantanu Godbole*, IBM Research; Ajay Gupta, IBM Research; Ashish Verma, IBM Research; Jeff Achtermann, IBM MBPS; Kevin English, IBM

Entity Discovery and Assignment for Opinion Mining Applications

Xiaowen Ding*, Univ of Illinois at Chicago; Bing Liu, UIC; Lei Zhang, UIC

Grocery Shopping Recommendations Based on Basket-Sensitive Random Walk

Ming Li*, Unilever UK; Malcolm Dias, Unilever UK; Ian Jarman, Liverpool John Moores University; Wael El-Deredy, University of Manchester; Paulo Lisboa, Liverpool John Moores University

Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

Jiang-Ming Yang*, Microsoft Research Asia; Rui Cai, Microsoft Research; Chunsong Wang, University of Wisconsin-Madison; Hua Huang, Beijing University of Posts and Telecommunications; Lei Zhang, Microsoft Research Asia; Wei-Ying Ma, Microsoft Research Asia

Intelligent File Scoring System for Malware Detection from the Gray List

Tao Li*, Florida International University Learning Dynamic Temporal Graphs for Oil-drilling Equipment Monitoring System Yan Liu*, IBM Research; Jayant Kalagnanam

Migration Motif: A Spatial-Temporal Pattern Mining Approach for Financial Markets

Xiaoxi Du, KSU; Ruoming Jin*, Kent State University; Liang Ding, Kent State University; Victor Lee, Kent State University; John Thornton, Kent State University

Mining Brain Region Connectivity for Alzheimer’s Disease Study via Sparse Inverse Covariance Estimation

Liang Sun*, Arizona State University; Rinkal Patel, Arizona State University; Jun Liu, Arizona State University; Kewei Chen, Neuroimaging Banner Alzheimer’s Institute; Teresa Wu, Arizona State University; Jing Li, Arizona State University; Eric Reiman, Banner Alzheimer’s Institute and Banner PET Center; Jieping Ye, Arizona State University

Modeling and Predicting User Behavior in Sponsored Search

Joshua Attenberg*, NYU Polytechnic Institute; Torsten Suel, Yahoo Research; Sandeep Pandey, Yahoo Research

Named Entity Mining from Click-Through Log Using Weakly Supervised Latent Dirichlet Allocation

Gu Xu*, Microsoft Research Asia; Shuang-Hong Yang, Georgia Tech; Hang Li, Microsoft Research Asia

Network Anomaly Detection based on Eigen Equation Compression

Shunsuke Hirose*, NEC Corporation; Kenji Yamanishi, ; Takayuki Nakata, ; Ryohei Fujimaki

OLAP on Search Logs: An Infrastructure Supporting Data-Driven Applications in Search Engines

Bin Zhou, Simon Fraser University; Daxin Jiang*, MSRA; Jian Pei, Simon Fraser University; Hang Li, Microsoft Research Asia

OpinionMiner: A Machine Learning System for Web Opinion Mining and Extraction

Wei Jin*, North Dakota State University; Hung Hay Ho

Pervasive Parallelism in Data Mining: Dataflow solution to Co-clustering Large and Sparse Netflix Data

Srivatsava Daruru, University of Texas at Austin; Nena Marin*, Pervasive Software; Matthew Walker, Pervasive Software; Joydeep Ghosh, The University of Texas at Austin

Predicting Bounce Rates in Sponsored Search Advertisements

D. Sculley*, Google, Inc.; Robert Malkin, Google, Inc; Sugato Basu, Google, Inc; Roberto Bayardo, Google

PSkip: Estimating relevance ranking quality from web search clickthrough data

Kuansan Wang*, Microsoft Research; Toby Walker, ; Zijian Zheng

Query Result Clustering for Object-level Search

Jongwuk Lee, ; Seung-won Hwang*, Postech; Zaiqing Nie, ; Ji-Rong Wen, Microsoft Research Asia

Improving Classification Accuracy Using Automatically Extracted Training Data

Ariel Fuxman*, Microsoft, USA; Anitha Kanna, Microsoft, USA; Andrew Goldberg, University of Wisconsin; Rakesh Agrawal, Microsoft; Panayiotis Tsaparas, Microsoft

Sentiment Analysis of Blogs by Combining Lexical Knowledge with Text Classification

Prem Melville*, IBM; Wojciech Gryc, ; Richard Lawrence, IBM, USA

Seven Pitfalls to Avoid when Running Controlled Experiments on the Web

Thomas Crook, Microsoft; Brian Frasca, Microsoft; Ron Kohavi*, Microsoft; Roger Longbotham, Microsoft

SNARE: A Link Analytic System for Graph Labeling and Risk Detection

Mary McGlohon*, Carnegie Mellon University; Stephen Bay, PricewaterhouseCoopers; Markus Anderle, PricewaterhouseCoopers; David Steier, PricewaterhouseCoopers; Christos Faloutsos, CMU

Sustainable Operation and Management of Data Center Chillers using Temporal Data Mining

Debprakash Patnaik, Virginia Tech; Manish Marwah, HP Labs; Ratnesh Sharma, HP Labs; Naren Ramakrishnan*, Virginia Tech

Towards a Universal Marketplace over the Web: Statistical Multi-label Classification of Service Provider Forms with Simulated Annealing

Kivanc Ozonat*, HP Labs

Towards Combining Web Classification and Web Information Extraction: A Case Study

Ping Luo*, HP Labs China

KDD2009: Tutorials

Here are some great tutorials worth attending at KDD 2009

Source; http://www.sigkdd.org/kdd2009/tutorials.html#t1

Tutorials

All tutorials will be held on Sunday, June 28th.

Morning

Afternoon

Abstracts

T1 – Statistical Challenges in Computational Advertising

Deepayan Chakrabarti, Deepak Agarwal

Many organizations now devote significant fractions of their advertising/outreach budgets to online advertising; ad-networks like Yahoo!, Google, MSN have responded by constructing new kinds of economic models and perform the fundamental task of matching the most relevant ads (selected from a large inventory) for a (query,user) pair in a given context. Nearly all of the challenges that arise are substantially data- or model-driven (or both). Computational Advertising is a relatively new scientific sub-discipline at the interesection of large scale search and text analysis, information retrieval, statistical modeling, machine learning, optimization and microeconomics that address this match-making problem and provides unprecedented opportunities to data miners.

Topics covered include a comprehensive introduction to several advertising forms (sponsored search, contextual adverting, display advertising), revenue models (pay-per-click, pay-per-view, pay-per-conversion) and data mining challenges involved, along with an overview of state-of-the-art techniques in the area with a detailed discussion of open problems. We will cover information retrieval techniques and their limitations; data mining challenges involved in performing ad matching through clickstream data and challenging optimization issues that arise in display advertising. In particular, we will cover statistical modeling techniques for clickstream data and explore/exploit schemes to perform online experiments for better long-term performance using multi-armed bandit schemes. We also discuss the close relationship of techniques used in recommender systems to our problem but indicate several additional issues that needs to be addressed before they become routine in computational advertising.

We will only assume basic knowledge of statistical methods, no prior knowledge of online advertising is required. In fact, the first hour that provides an introduction to the area would be appropriate for all registered attendees of KDD 2009. The second half would require familiarity with basic concepts like regression, probability distributions and appreciation of issues involved in fitting statistical models to large scale applications. No prior knowledge of multi-armed bandits would be assumed.

Back to top…

T2 – How to do good research, get it published in SIGKDD and get it cited!

Eamonn Keogh

While SIGKDD has traditionally enjoyed an unusually high quality of reviewing, there is no doubt that publishing in SIGKDD (and other high quality data mining conferences) is very challenging. This is especially true for young faculty, grad students whose primary advisor is not an experienced SIGKDD author, or people from outside the community (i.e. a biologist or mathematician who has a result that might greatly interest the data mining community).

In this tutorial Dr. Keogh will demonstrate some simple ideas to enhance the probability of success in getting your paper published in a top data mining conference; and after the work is published, getting it highly cited.

These tips and tricks are based on 12 years experience as a SIGKDD author and reviewer, and wisdom solicited from many of the most prolific data mining researchers/reviewers.

Topics covered in the tutorial include:

  • Finding the right problems to work on (80% of the battle).
  • Don’t summarize, sell! Writing abstracts that put the reviewer on your side from the start.
  • Getting or creating the perfect dataset.
  • Experiments that tell a story.
  • Making effective and interesting figures.
  • Getting the reviewers on your side.
  • The top-ten avoidable reasons why papers get rejected from SIGKDD.
  • Three simple tricks to increase the number of citations to your work.

While Dr. Keogh does not claim to have a “magic bullet” for publishing in SIGKDD, his significant track record of publishing in top data mining venues, combined with extensive (and deliberately uncredited) experience in helping younger researchers “break-in” to SIGKDD have placed him in a unique position to share useful and actionable advice.

While writing this tutorial Dr. Keogh, sought and received advice from many respected data mining researchers, their advice is incorporated into this tutorial.

Back to top…

T3 – Large Graph-Mining: Power Tools and a Practitioner’s Guide

Christos Faloutsos, Gary Miller, Charalampos (Babis) Tsourakakis

Numerous real-world datasets are in matrix form, thus matrix algebra, linear and multilinear, provides important algorithmic tools for analyzing them. The main type of datasets of interest in this tutorial are graphs. Important datasets modeled as graphs include the Internet, the Web, social networks (e,g Facebook, LinkedIn), computer networks, biological networks and many more.

We will discuss how we represent a graph as a matrix (adjacency matrix, Laplacian) and the important properties of those representations. We will then show how these properties are used in several important problems, including node importance via random walks (Pagerank), community detection (METIS, Cheeger inequality), graph isomorphism and graph similarity. Important dimensionality reduction techniques (SVD and random projections) will be discussed in the context of graph mining problems.

Furthermore, we provide a survey of the work on the epidemic threshold, node proximity and center-piece subgraphs. State-of-art graph mining tools for analyzing time evolving graphs will also be presented. Throughout the tutorial, patterns in static and time evolving, weighted and unweighted real-world graphs will be presented.

The target audience are data mining professionals who wish to know the most important matrix algebra tools, their applications in large graph mining and the theory behing them.
Prerequisites: Computer science background (B.Sc or equivalent); familiarity with undergraduate linear algebra.
Demos will be presented.

http://www.cs.cmu.edu/~christos/TALKS/09-KDD-tutorial/

Back to top…

T4 – Planning, Running, and Analyzing Controlled Experiments on the Web

Ronny Kohavi, Roger Longbotham, John Quarto-vonTivadar

The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments, A/B tests (and their generalizations), split tests, and MultiVariable Tests (MVT). Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. Data Mining and Knowledge Discovery techniques can then be used to analyze the data from such experiments. The tutorial will provide a survey and practical guide to running controlled experiments based on the recently published survey article in the Data Mining and Knowledge Discovery Journal, co-authored with the two of the tutorial co-presenters (http://exp-platform.com/dmkd_survey.aspx), and based on the book “Always Be Testing” co-authored by the 3rd tutorial co-presenter (http://www.amazon.com/Always-Be-Testing-Complete-Optimizer/dp/0470290633). The book includes use of industry tools, such as Google Website Optimizer and recently ranked #2 on Amazon’s sales rank for computers/e-commerce books. The tutorial includes multiple real-world examples of actual controlled experiments (many with surprising results), a review the theory and the statistics used to plan and analyze such experiments, and a discussion of the limitations and pitfalls that might face experimenters. Demos will be shown of some tools that support controlled experiments.

A video of a related talk can be found on the videolectures website:
http://videolectures.net/cikm08_kohavi_pgtce/
The shorter version of the DMKD survey paper is now part of the class reading for several classes at Stanford University (CS147, CS376), USCD (CSE 291), and at the University of Washington (CSEP 510).

Topics covered include:

  1. Why online experimentation using controlled experiments is important
  2. What you need in order to conduct a valid experiment
  3. Planning and Analysis of basic experiments
  4. Benefits and limitations of experimentation
  5. Multivariable experiments: setup, analysis, interpretation, and interactions
  6. Architectures
  7. Using online free and low-cost software services (demos)
  8. Challenges and advanced statistical concepts for experiments

Back to top…

T5 – Predictive Modelling in the Wild: Success Factors in Data Mining Competitions and Real-World Applications

Saharon Rosset, Claudia Perlich

In this tutorial, we give our perspective on the keys to success in application of predictive modeling to competitions like KDD Cup and real-life business intelligence projects. We argue that these two modes of applying predictive modeling share many similarities, but have also some important differences. We discuss the main success factors in predictive modeling: domain understanding, statistical acumen, and appropriate algorithmic approaches. We describe our relevant experiences in the context of three recent predictive modeling competitions where our team has had success (KDD Cup 2007 and 2008 and INFORMS DM challenge 2008) and two case studies of projects we have led at IBM Research. We also survey some of the recurring challenges and complexities in practical predictive modeling applications. One key issue is information leakage, and we discuss its definition, influence, detection and avoidance. We consider leakage to be the silent killer of many predictive modeling projects, and we demonstrate its impact on the competitions, and discuss the challenges in addressing it in the real-life projects. Other challenges include framing real-life modeling objectives into predictive modeling, and usefully applying relational learning concepts when modeling “real-life” complex, relational datasets.

Back to top…

T6 – New Directions in Data Quality Mining

Laure Berti-Equille (Univ. of Rennes 1, France), Tamraparni Dasu (AT&T Labs – Research)

As data types and data structures change to keep up with evolving technologies and applications, data quality problems too have evolved and become more complex and interwoven. Data streams, web logs, Wikipedias, biomedical applications, video streams and social networking websites generate a mind boggling variety of data types. However, data quality mining, the use of data mining to manage, measure and improve data quality, has focused mostly on addressing each category of data glitch separately as a static entity.

In this tutorial we provide a technical, KDD-focused account of recent research and developments in discovering and treating complex data anomalies in a broad range of data. In particular, we highlight new directions in data quality mining: (a) the applicability and effectiveness of the methodologies for various data types such as structured, semi-structured and stream data, (b) the detection of concomitant data glitches and patterns like the occurrence of outliers in data with missing values and duplicates, or the co-occurrence of missing values and duplicates, (c) the design of sequential approaches to data quality mining, such as workflows composed of a sequence of tasks for data quality exploration and analysis. We give an overview of past work, introduce current research in this area including recent methods and techniques for discovering complex patterns of anomalies (e.g., multivariate outliers, disguised missing values, combination of different types of noise), and highlight new directions and open problems in data quality mining.

The tutorial includes extensive case studies and practical examples of mining data quality problems for a variety of large datasets and data types e.g., relational, XML, data streams. We discuss illustrative examples drawn from a variety of domains like CRM, networking, biology, and mobility.

Back to top…

T7 – Event Detection

Daniel Neill, Weng-Keen Wong

A common task in surveillance, scientific discovery and data cleaning involves monitoring routinely collected data for anomalous events. Detecting events in univariate time series data can be effectively accomplished using well-established techniques such as Box-Jenkins models, regression, and statistical quality control methods. In recent years, however, routinely collected data has become increasingly complex. At each time step, the data collected can consist of multivariate vectors and/or be spatial in nature. For instance, healthcare data used in disease surveillance often consists of multivariate patient records or spatially distributed pharmaceutical sales data. Consequently, new event detection algorithms have been developed that not only consider temporal information but also detect spatial patterns and integrate information from multiple spatio-temporal data streams.

This tutorial will present algorithms for event detection, with a focus on algorithms dealing with multivariate temporal and spatio-temporal data. We will introduce event detection by providing a general formulation of the event detection problem and describing its unique challenges. In the first half of the tutorial, we will cover algorithms for detecting events in both univariate and multivariate temporal data. The second half will present methods for detecting events in spatio-temporal data, including several recently proposed multivariate approaches.

Back to top…

T8 – Advances in Mining the Web

Myra Spiliopoulou, Osmar Zaiane, Bamshad Mobasher, Olfa Nasraoui

The Web has changed our way of life and the Web 2.0 has changed our way of perceiving and using the Web. Data analysis is now required in a plethora of applications that aim to enrich the experience of people with the Web. We first discuss data mining for the social Web. We elaborate on social network analysis and focus on community mining, then go over to recommendation engines and personalization. We discuss the challenges that emerged through the shift from the traditional Web to Web 2.0. We then focus on two issues – the need to protect Web applications from manipulation and the need to make them adaptive towards change. We first discuss manipulations/attacks in recommender systems and present counter-measures. We then elaborate on how changes/concept drifts can be dealt with in applications that analyze clickstream data, monitor topics in news and blogs, or monitor communities and their evolution.

This tutorial is aimed at novice researchers that have general background in data mining and are interested in understanding the
potential and challenges pertinent to the social Web. The participants should have a basic understanding of recommendation engines, personalization and text modeling for mining (vector space models). They will learn how basic techniques are extended and new techniques are designed for mining the Web, especially the social Web. They will also learn about issues that are still open and require further research – research that the tutorial participants may decide to perform themselves.

OUTLINE
PART I: Mining the Social Web [Osmar Zaiane]
PART II: Recommendations and Personalization in the Social Web [Bamshad Mobasher]
PART III: Dealing with Evolution in the Web [Myra Spiliopoulou]
PART IV: Mining Web Data Streams [Olfa Nasraoui]

Back to top…

T9 – Real World Text Mining

Ronen Feldman, Lyle Ungar

The proliferation of documents available on the Web and on corporate intranets is driving a new wave of text mining research and application. Earlier research addressed extraction of information from relatively small collections of well-structured documents such as newswire or scientific publications. Text mining from the other corpora such as the web requires new techniques drawn from data mining, machine learning, NLP and IR. Text mining requires preprocessing document collections (text categorization, information extraction, term extraction), storage of the intermediate representations, analysis of these intermediate representations (distribution analysis, clustering, trend analysis, association rules, etc.), and visualization of the results. In this tutorial we will present the algorithms and methods used to build text mining systems including pre-processing techniques, supervised leearning (e.g., CRF), entity resolution, relationship extraction, unsupervised learning and machine reading.

The tutorial will cover the state of the art in this rapidly growing area of research, including recent advances in unsupervised methods for extracting facts from text and methods used for web-scale mining. We will also present several real world applications of text mining. Special emphasis will be given to lessons learned from years of experience in developing real world text mining systems, including how to handle informal texts such as blogs and user reviews and how to build scalable systems.

The instructors are Ronen Feldman and Lyle Ungar. Ronen is an Associate Professor of Information Systems at the Business School of the Hebrew University in Jerusalem. He is the founder of the ClearForest text mining corporation, and the author of the book “The Text Mining Handbook” published by Cambridge University Press in 2007. Lyle is an Associate Professor of Computer and Information Science at the University of Pennsylvania.He recently returned from a sabbatical at Google, where he and a team built what is probably the world’s largest named entity recognition system.

Back to top…

Interview Gary Cokins SAS Institute

Here is an interview with Gary Cokins , a well respected veteran of the Business Intelligence industry working with the SAS Institute. Gary has just launched his sixth book (wow!) and the gentlemen he is , he agreed to answer these questions en route to his constant traveling.Gary is the expert on performance measurement so we decided to quiz him a bit on this.

CIO’s need to shift their mindset from a technical one to a managerial one.- Gary Cokins, SAS Institute

Gary_Cokins_SAS_05

Ajay -Gary, please describe your career journey from a freshman in college to your position today. What are the key items of advice that you would give to high school students to encourage taking science careers in this recession?

COKINS: I have been very fortunate. After receiving my MBA in 1974 from the Northwestern University Kellogg Graduate School of Management, I worked in industry for ten years. I had the luck of being a financial controller at Fortune 100 corporation division and then becoming operations manager at the same location. I then had to “eat the financial data I was serving,” and it was a true wake-up call – much of the information was at best useless and at worst misleading. Later with Deloitte I was trained on the theory of constraints (TOC) methodology which indicted cost accounting as “enemy number one of productivity.” I learned about the shortcomings with how accountants make assumptions.

In 1988, when Professor Kaplan struck an exclusive relationship with KPMG Peat Marwick, I was recruited to KPMG with about three others with similar operational backgrounds as I to implement activity based cost management (ABC/M) systems but with using an ABC/M modeling software tool. I learned from experience. Four years later, my mentor Bob Bonsack, who had moved on from Deloitte to Electronic Data Systems (EDS) recruited me to head EDS’ cost management consulting. With about fifteen consultants, I was exposed to over a hundred implementations of cost systems. It was there that I experimented with creating a two day “ABC/M rapid prototyping” method that was radically different from the multi-month approach. By starting with a quick vision of what their ABC/M system would look like, companies could iteratively re-model to the level of detail, granularity, and accuracy needed to support analysis and decisions. It did not initially require a huge system, which was why some ABC/M system implementations got into trouble. My major self-realization is that costing is accomplished by modeling cost consumption relationships – an insight that continues to evade many accountants.

When I began to see the application of strategy maps and the balanced scorecard, more light bulbs went off in my brain. I then began truly seeing the organization as a “system” where all the performance improvement methodologies and core processes are inter-connected. I realized that the technologies are no longer the impediment because they are proven. The obstacle is the organization’s thinking – and the mindset of senior management who is presumably doing the leading.

My advice to high school students take your studies more seriously than you even imagine, and spend less time text-messaging everyone you know and focus on the more meaningful relationships. They will eventually be your friends rather than just acquaintances. And take math courses!

Ajay- So what exactly do you do at SAS? And name some interesting anecdotes that led to a lot of value as well as fun for both your company and clients. How does Gary spend his daily day at SAS Institute?

COKINS: My primary role with SAS is to create and deliver thought leadership content about Performance Management leveraging business analytics. I present webinars and write articles, blogs, presentations and also books. For the last four years I have averaged visiting roughly 40 international cities where SAS offices are located to present seminars and meet SAS customers to educate them on the concepts and benefits from Performance Management methodologies.

Recent examples of having fun and providing value to organizations involved providing expert advice to the International Monetary Fund (IMF) in Washington DC and the European Patent Office (EPO) in Brussels. The IMF is at the beginning of implementing an activity based cost management (ABC/M) system whereas the EPO is completing their ABC/M system design. Both organizations were seeking tips for success and pitfalls to avoid. One of my major recommendations was to not under-estimate the natural resistance to change of managers and employees. That is, they need to focus much more on getting their buy-in than worrying if the system is perfect. The value to them is realizing that Performance Management methodologies are much more social than technical.

Regarding my daily activities, when I am not traveling, I am mainly reading articles written by other experts or journalists and then translating my relevant takeaways into content that I can educate others with. I also respond to questions and requests both internally within SAS and externally from customers, management consultants, and university faculty.

Ajay- When you were a young employee, what was the toughest challenge that you faced? What was your worst mistake and how did you overcome it? What lessons did you learn from it?

COKINS: In my first few years in business following my university graduation, my toughest challenge was persuading my supervisors, usually older men than I, to accept my new ideas. I have always been a creative thinker, almost a dreamer; and I was not accustomed to the resistance that managers have to innovations, particularly those suggested by young inexperienced employees fresh from their university schooling.

My worst mistake was developing a computer program that automatically suggested treasury cash balance transfers to optimize the corporate cash management system of my first employer, a large Fortune 100 corporation. My computer program was basically replacing the decisions made by the corporate cash manager and part of his job. I overcame this disappointment by learning what needs the corporate cash manager did have and developing a different computer program that solved his needs. With its success, he eventually accepted the first computer program.

My lesson was one should first understand what people may want rather than trying to impose on them what you think they need without involving them.

Ajay- Looking back on your distinguished career, what project are you proud of the most? What project would you do over again if given the chance?

COKINS: In 1973 I became a financial controller of a large division of another Fortune 100 manufacturer. I created a rolling financial planning and forecast software program, using pre-spreadsheet software from a mainframe (years before personal computers and Excel). The program modeled product line sales forecasts by month and integrated both the income statement and balance sheet. It became a valuable tool for the executive team to suggest and immediately see varying sales levels as a “what if” scenario builder to calculate the different profit and working capital results. The executive team marveled at how analytical software, in contrast to our transactional ERP-like system, could make sense of the complexity of our operations with thousands of products and customers.

Regarding a project that fell short of expectations, I actually did get a chance to do it over again. As a consultant with Deloitte, I lead a project designing and implementing an activity based cost management (ABC/M) system using the client’s general ledger accounting software. It took many months, and when finished it was too complex for the client to fully understand. Several years later with a similar project I applied a rapid prototyping with iterative re-modeling approach that involved the company’s managers from the first day. (I mentioned this approach in my reply to the first question.) We completed the ABC/M system in just a few weeks, and everyone understood it and also how to interpret the information for analysis and decisions. I have since been a proponent of this type of rapid learning and system design approach.

Ajay- What do people do for fun at SAS Institute do when not creating or selling algorithms? How is SAS reaching out to other members of the analytics community in terms of basic science and development?

COKINS: SAS employees are inspired by our CEO, Dr. Jim Goodnight, who founded SAS roughly 35 years ago. Dr. Goodnight loves solving problems of all flavors. For fun, but also part of our jobs, SAS employees search for problems that only computer software can solve.

SAS’ offerings evolve by listening to our customers, who are typically scientists, researchers, and business analysts. Drug development and marketing analysts are examples. Our customers are our “community.” We motivate them, with formal methods of collecting input from them, to share with us enhancements to our future versions of our software.

Ajay- Describe your new book on Performance Management from the point of a beginner. Assume that I am a college student who does not know why I should read it. Then assume that I am a CIO and have little time to read it. What is in it for a CIO?

COKINS: This is my sixth book I have written. My first four books were about activity based cost management (ABC/M) and the last two about Performance Management. What is different about this second book is it immediately clarifies the confusion and ambiguity about what Performance Management is and is not. It is also written in a humorous and simplified way with lots of analogies and metaphors, such as all of the Performance Management methodologies integrated together like gears in an automobile engine and with a GPS for predictive navigation and dashboards for feedback. Beginners perceive each methodology, such as a balanced scorecard or customer relationship management system, are stand-alone tools. There is synergy when they are integrated.

cokins3

CIOs have similar needs. They need to shift their mindset from a technical one to a managerial one. Just a few chapters from this book can help CIOs see the broad picture of how all of their organizations processes fit together, and how they can be aligned to efficiently execute the ever-adjusting strategy that the executives continuously formulate with operations.


Biography and Contact Information

Gary Cokins, CPIM

(gary.cokins@sas.com; phone 919 531 2012)

http://blogs.sas.com/cokins

Gary Cokins is a global product marketing manager involved with performance management solutions with SAS, a leading provider of performance management and business analytics software headquartered in Cary, North Carolina. Gary is an internationally recognized expert, speaker, and author in advanced cost management and performance improvement systems. Gary received a BS degree with honors in Industrial Engineering/Operations Research from Cornell University in 1971. He received his MBA from Northwestern University’s Kellogg School of Management in 1974.

Gary began his career as a strategic planner FMC’s Link-Belt Division and then served as Financial Controller and Operations Manager. In 1981 Gary began his management consulting career first with Deloitte Consulting. Next with KPMG Peat Marwick, Gary was trained on ABC by Harvard Business School Professors Robert S. Kaplan and Robin Cooper. More recently, Gary headed the National Cost Management Consulting Services for Electronic Data Systems (EDS)/ A.T. Kearney.

Gary was the lead author of the acclaimed An ABC Manager’s Primer (ISBN 0-86641-220-4) sponsored by the Institute of Management Accountants (IMA). Gary’s second book, Activity Based Cost Management: Making it Work (ISBN 0-7863-0740-4), was judged by the Harvard Business School Press as “read this book first.” A reviewer for Gary’s third book, Activity Based Cost Management: An Executive’s Guide (ISBN 0-471-44328-X) said, Gary has the gift to take the concept that many view as complex and reduce it to its simplest terms.” This book was ranked number one in sales volume of 151 similar books on BarnesandNoble.com. Gary has also written Activity Based Cost Management in Government (ISBN 1-056726-110-8). His latest books are Performance Management: Finding the Missing Pieces to Close the Intelligence Gap (ISBN 0-471-57690-5) and Performance Management: Integrating Strategy Execution, Methodologies, Risk, and Analytics (ISBN 978-0-470-44998-1).

Mr. Cokins participates and serves on committees including: CAM-I, the Supply Chain Council, the International Federation of Accountants (IFAC), and the Institute of Management Accountants. Mr. Cokins is a member of Journal of Cost Management Editorial Advisory Board. Cokins can be reached at gary.cokins@sas.com . His blog is at http//:blogs.sas.com/cokins

and his latest book can also be previewed at http://www.sas.com/apps/pubscat/bookdetails.jsp?catid=1&pc=62401

Interesting Times

Probably for the first time I am reproducing a comment from a reader in it’s entirity. As an ex GE Finance and exCiti man , the following parable stuck closer to heart.

Here are some nice views from Randall Stross of http://enhilex.com/ on current economic crisis- If his touch a chord so do write back

Hello Everyone, There is a lot of talk about what made Wall St. fail. Having spent some time inside firms that played pivotal roles, I must say that I believe there are as many reasons for the failure as there are ways to manifest lack of integrity, diligence, lack of responsibility and accountability. As recent failures demonstrate, all attempts to legislate all those characteristics have failed – for the same reasons. The manifestations I found were like these following examples:

1. Here’s a short conversation in a hallway. I would later realize that we were standing in front of the manager’s office… Me: “So , why don’t your models account for employment as a factor that influences the borrower’s ability to pay his mortgage?” Him, looking at me like I had 2 heads: “Er, uh… Well, that is because… very long pause… there is no reliable data on that. Yes, that’s right. So we leave it out…” That was the reply from a very intelligent person who knew he would be shown the door if he’d told me the truth. Lay persons understand that models are useful, though imperfect. But the example given above is something else. This model’s design was intended to mislead. People who do things like this may have been “good” employees, but they were not good citizens. The “good” employees remain… Anyone with an IQ over 50 knows that past performance does not predict future performance when you don’t account for the differences between past and present in the entirety of the nexus where the question belongs.

2. Whilst sitting next to me in a development lab, a young reporting analyst was directed to “hard code” the sum of a column of figures headed for regulators – because the bottom line “wasn’t good enough.” The young man hesitated. Relieved, I put my hand on his wrist and quietly suggested that he ask for our client’s manager to send that request to him in an email. The manager left the lab, calling us obscene names. Of course, the email never came.

3. A friend of mine working as a funder for a large mortgage firm was fired for refusing to fund a loan she knew was fraudulent. I’m proud to know the people in #2 and #3. These people have integrity and loyalty to principles the investing public can count on. But they both lost their jobs and have moved on into other fields, away from Wall St-ish areas. So, it is the people who are left that will work to rebuild confidence in the market. OK, now — who’s left? I wish we could let it all fall down and stand back up without the bums who caused it all. We do know who they are…

ps- Hope he is neither a spammer or joking. This does bring a lot of old memories when I worked for the big hot fin companies. Do you have  a personal story like that.