Interview Karim Chine BIOCEP (Cloud Computing with R)

Here is an interview with Karim Chine of http://www.biocep.net/

Working with an R or Scilab on clusters/grids/clouds becomes as simple as working with them locally-

Karim Chine, Biocep.

Ajay- Please describe your career in the field of science. What advice would you give to young science graduates in this recession.

Karim- My original background is in theoretical Physics, I did my Master’s thesis at the Ecole Normale’s Statistical Physics Laboratory where I worked on phase separation in two-dimensional additive mixtures with Dr Werner Krauth. I came to computer science after graduating from the Ecole Polytechnique and I spent two years at TELECOM ParisTech studying software architecture and distributed systems design. I worked then for the IBM Paris Laboratory (VisualAge Pacbase applications’ generator), Schlumberger (Over the Air Platform and Web platform for smartcards personalization services), Air France (SSO deployment) and ILOG (OPL-CPLEX-ODM Development System). This gave me the intense exposure to real world large-scale software design. I crossed the borders of cultural, technical and organizational domains several times and I worked with a broad palette of technologies with some of the best and most innovative engineers. I moved to Cambridge in 2006 and I worked for the European Bioinformatics Institute. It’s where I started dealing with the integration of R into various types of applications. I left the EBI in November 2007. I was looking for an institutional support to help me in bringing into reality a vision that was becoming clearer and clearer about a universal platform for scientific and statistical computing. I failed in getting that support and I have been working on BIOCEP full time for most of the last 18 months without being funded. Few days of consultancy given here and there allowed me to keep going. I spent several weeks at Imperial College, at the National Center for e-Social Sciences and at Berkeley’s department of statistics during that period. Those visits were extremely useful in refining the use cases of my platform. I am still looking for a partner to back the project. You asked me to give advice. The unique advice I would give is to be creative and to try again and again to do what you really want to do. Crisises come and go, they will always do and extreme situations are part of life. I believe hard work and sincerity can prevail anything.

Ajay- Describe BIOCEP’s scope and ambition.

What are the current operational analytics that can be done by users having data.

Karim- My first ambition with BIOCEP is to deliver a universal platform for scientific and statistical computing and to create an open, federative and collaborative environment for the production, sharing and reuse of all the artifacts of computing. My second ambition is to enhance dramatically the accessibility of mathematical and statistical computing, to make HPC a commonplace and to put new analytical, numerical and processing capabilities in the hands of everyone (open science).

The Open source software Conquest has gone very far. Environments like R or Scilab, technologies like Java, Operating Systems like Linux-Ubuntu, and tools like OpenOffice are being used by millions of people. Very little doubt remains about the OSS’s final victory in some domains. The cloud is already a reality and it will take computing to a whole new realm. What is currently missing is the software that, by making the Cloud’s usage seamless, will create new ecosystems and will provide rooms for creativity, innovation and knowledge discovery of an unprecedented scale.

BIOCEP is one more building block into this. BIOCEP is built on top of R and Scilab and anything that you can do within those environments is accessible through BIOCEP. Here is what you have uniquely with this new R/Scilab-based e-platform:

High productivity via the most advanced cross-platform workbench available for the R environment.

Advanced Graphics: with BIOCEP, a graphic transducer allows the rendering on client side of graphics produced on server side and enables advanced capabilities like zooming/unzooming/scrolling for R graphics. a client side mouse tracker allows to display dynamically information related to the graphics and depending on coordinates. Several virtual R Devices showing different data can be coupled in zooming/scrolling and this helps comparing visually complex graphics.

Extensibility with plug-ins: new views (IDE-like views, analytical interfaces…) can be created very easily either programmatically or via drag-and-drop GUI designers.

Extensibility with server-side extensions: any java code can be packaged and used on server side. The code can interact seamlessly with R and Scilab or provide generic bridges to other software. For example, I provide an extension that allows you to use openoffice as a universal converter between various files formats on server side.

Seamless High Performance Computing: working with an R or Scilab on clusters/grids/clouds becomes as simple as working with them locally. Distributed computing becomes seamless, creating a large number R and Scilab remote engines and using them to solve large scale problems becomes easier than ever. From the R console the user can create logical links to existing R engines or to newly created ones and use those logical links to pilot the remote workers from within his R session. R functions enable using the logical links to import/export variables from the R session to the different workers and vice versa. R commands/scripts can be executed by the R workers synchronously or asynchronously. Many logical R links can be aggregated into one logical cluster variable that can be used to pilot the R workers in a coordinated way. A cluster.apply function allows the usage of the logical cluster to apply a function to a big data structure by slicing it and sending elementary execution commands to the workers. The workers apply the user’s function to the slices in parallel. The elementary results are aggregated to compose the final result that becomes available within the R session.

Collaboration: your R/scilab server running in the cloud can be accessed simultaneously by you and your collaborators. Everything gets broadcasted including Graphics. A spreadsheet enables to view and edit data collaboratively. Anyone can write plug-ins to take advantage of the collaborative capabilities of the frameworks. If your IP address is public, you can provide a URL to anyone and get him connect to your locally running R.

– Powerful frameworks for Java developers: BIOCEP provides Frameworks and tools to use R as if it was an Object Oriented Java Toolkit or a Web Toolkit for R-based dynamic application.

Webservices for C#, Perl, Python users/developers: Most of the capabilities of BIOCEP including piloting of R/Scilab engines on the cloud for distributed computing or for building scalable analytical web application are accessible from most of the programming languages thanks to the SOAP front-end.

RESTful API: simple URLs can perform computing using R/Scilab engines and return the result as an XML or as graphics in any format. This works like google charts and has all the power of R since the graphic is described with an R script provided as a parameter of the URL. The same API can be exposed on demand by the workbench. This allow for example to integrate a Cloud-R with Excel or OpenOffice. The workbench works as a bridge between the cloud and those applications.

Advanced Pooling framework for distributed resources: useful for deploying pools of R/scilab engines on multi nodes systems and get them used simultaneously by several distributed client processes in a scalable/optimal way. A supervision GUI is provided for a user friendly management of the pools/nodes/engines.

Simultaneous use of R and Scilab: Using java scripting, data can be transferred from R to Scilab and vice versa.

Ajay- Could you tell us about a successful BIOCEP installation and what it led to? Can BIOCEP be used by the rest of the R community for other packages? What would be an ideal BIOCEP user /customer for whom cloud based analytics makes more sense ?

Karim- BIOCEP is still in pre-beta stage. However it is a robust and polished pre-Beta that several organizations are already using. Janssen Pharmaceutica is using it to create and deliver statistical applications for drug discovery that use R engines running on their backend servers. The platform is foreseen there as the way to go for an ultimate optimization of some of their data analysis pipelines. Janssen’s head of statistics said to be very much interested in the capabilities given by BIOCEP to statisticians to create their own analytical User Interfaces and deliver them with their models without needing specific software development skills. Shell is creating BIOCEP-based applications prototypes to explore the feasibility and advantages of migrating some of Shell’s applications to the Cloud. One group from Shell Global Solutions is planning to use BIOCEP for running scilab in the cloud for Corrosion simulation modeling. Dr Ivo Dinov’s team at UCLA is studying the migration of some the SOCR applications to the BIOCEP platform as plug-ins and extensions. Dr Ivo Dinov also applied for an important grant for building DISCb (Distributed Infrastructure for Statistical Computing in Biomedicine). If the grant application is successful, BIOCEP will be the backbone at software architecture level of that new infrastructure. In cooperation with the Institute of Biostatistics, Leibniz University of Hannover, Bernd Bischl and Kornelius Rohmeyer have developed a framework to user friendly R-GUIs of different complexity. The toolkit uses BIOCEP as an R-backend since release 2.0. Several small projects have been implemented using this framework and some are in production such as an application for education in biostatistics at the University of Hannover. Also the ESNATS project is planning to use the BIOCEP frameworks. Some development is being done at the EBI to customize the workbench and use it to give to the end user the possibility to run R and Bioconductor on the EBI’s LSF cluster.

I’ve been in touch with Phil Butcher, Sanger’s head of IT and he is considering the deployment of BIOCEP on Sanger’s systems simultaneously with Eucalyptus. The same type of deployment has been discussed with the director of OMII-UK, Neil Chue Hong. BIOCEP’s deployment is probably going to follow the deployment of the Eucalyptus System on NGS. Tena Sakai deployed BIOCEP at the Ernest Gallo Clinic and Research Centre and he is currently exploring the usage of the R on the Cloud via BIOCEP (Eucalyptus / AWS). The platform has been deployed by a small consultancy company specializing in R on several London-based investment banks’ systems. I have had a go ahead form Nancy Wilkins Diher (Director for Science Gateways, SDSC) for deploying on TeraGrid, a deployment on EGEE has been discussed with Dr Steven Newhouse (EGEE Technical Director). Both deployments are in standby at the moment.

Quest Diagnostics is planning to use BIOCEP extensively. Sudeep Talati (University of Manchester) is doing his Master’s project on BIOCEP. He is supervised by Professor Andy Brass and he is exploring the use of a BIOCEP-based infrastructure to deliver microarray analysis workflows in a simple and intuitive way to biologists with and without the Cloud. In Manchester, Robin Pinning (e-Science team leader, Research Computing Services) has the deployment of BIOCEP on Manchester’s research cluster on his agenda…

As I have said, anything that you can do with R including installing, loading and using any R package is accessible through BIOCEP. The platform aims to be universal and to become a tool for productivity and collaboration used by everyone dealing with computing/analytics with or without the cloud.

The Cloud whether it is public or private will be generalized and everyone will become a cloud user in one way or another

Ajay- What motivated you to build BIOCEP and mash cloud computing and R. What scope do you see for cloud computing in developing countries in Asia and Africa?

Karim– When I was at the EBI, I worked on the integration of R within scalable web applications. I explored and tested the available frameworks and tools and all of them were too low level or too simple to answer the problem. I decided to build new frameworks. I had the opportunity to be able to stand on the shoulders of giants.

Simon Urbanek’s packages already bridged the C-API of R with Java reliably. Martin Morgan’s RWebsevices package defined class mappings between R types, including S4 classes, and java.

Progressively R became usable as a Java object oriented toolkit, then as a Java Server. Then I built a pooling framework for distributed resources that made it possible for multiple clients to use multiple R engines optimally.

I started building a GUI to validate the server’s increasingly sophisticated API. That GUI became progressively the workbench.

When I was at Imperial, I worked with the National Grid Service team at the Oxford e-Research Centre to deploy my platform on Oxford’s core cluster. That deployment led to many changes in the architecture to meet all the security requirements.

It was obvious that the next step was to make BIOCEP available on Amazon’s Cloud. Academic Grids are for researchers and the cloud is for everyone. Making the platform work seamlessly on EC2 took few months. With the cloud came the focus on collaborative features (collaborative views, graphics, spreadsheets…).

I can only talk about the example of a country I know, Tunisia, and I guess some of this applies to Asian Countries. Even if the broadband is everywhere today and is becoming accessible and affordable by a majority of Tunisians, I am not sure that the adoption of the cloud would happen soon.

Simple considerations like the obligation to pay for the compute cycles in dollars (and not in dinars) are a barrier for adoption. Spending foreign currencies is subject to several restrictions in general for companies and for individuals; few Tunisians have credit cards that can be used to pay Amazon. Companies would prefer to buy and administer their own machines because the cost of operation and maintenance is lower in Tunisia than it is in Europe/US.

Even if the cloud would help in giving Tunisian researchers access to affordable Computing cycles on demand, it seems that most of them have learned to live without HPC resources and that their research is more theoretical and less computational than it could be. Others are collaborating with research groups in Europe (France) and they are using those European groups’ infrastructures.

Ajay- How would BIOCEP address the problem of data hygiene, data security and privacy. Is encrypted and compressed data transfers supported or planned?

Karim- With BIOCEP, a computational engine is exposed as a distributed component via a single mono-directional HTTP port. When you run such an engine on an EC2 instance you have two options:

  • 1/ totally sandbox the machine (via the security group) and leave only the SSH port open.
  • Private Key authentication is required to access the machine. In this case you use an SSH Tunnel (created with a tool like Putty for example) which allows you to see the engine as if it was running on your local machine on a port of your choice, the one specified for creating the Tunnel.
  • When you start the Virtual Workbench and connect in Http mode to your local host via the specified port, you are effectively connecting to the EC2-R engine. 100% of the information exchanged between your workbench and the engine, including your data, is ciphered thanks to the SSH tunnel.
  • The virtual workbench embeds JSCH and can create the Tunnel for you automatically. This mode doesn’t allow collaboration since it requires the private key to let the workbench talk to the EC2 R/Scilab engine.
  • 2/ tell the EC2 machine at startup (via the “user data”) to require specific credentials from the user. When the machine starts running, the user needs to provide those credentials to get a session ID and to be able to pilot a virtual EC2 R/Scilab engine. This mode enables collaboration. The client (workbench/scripts) connects to the EC2 machine instance via HTTP (will be HTTPS in a near future).

Ajay- Suppose I have 20 gb per month of data and my organization decided to cut back on the number of annual expensive software. How can the current version of BIOCEP help me do the following?

Karim– Ways BIOCEP can help you right now.

1) Data aggregation and Reporting in terms of spreadsheet, presentation and graphs

  • BIOCEP provides a highly programmable server side spreadsheet.
  • It can be used interactively as a view of the workbench and simple clicks allow the transfer of data form cells to R variables and vice versa. It can be created and populated from R (console / scripts).
  • Any R function can be used within dynamically computed cells. The evaluation of those dynamic cells is done on server side and can use high performance computing functions. Macros allow adding reactivity to the spreadsheets.
  • A macro allows the user to execute any R code in response to a value change of an R variable or of the content of a range within a spreadsheet. Variables docking macros allow the mirroring of R variables of any type (vectors, matrixes, data frames..) with ranges within the spreadsheet in Read/Write mode

. Several ready-to-use User Interface components can be created and docked anywhere within the spreadsheet. Those components include

  • an R Graphics viewer (PDF viewer) showing Graphics produced by a user-defined R script and reactive on user-defined variables and cell ranges changes,
  • customizable sliders mirroring R variables,
  • Buttons executing user-defined R code when pressed,
  • Combo boxes mirroring factor variables ..

The spreadsheet-based analytical user interface can pilot an R running at any location (local R, Grid R, Cloud R…). It can be created in minutes just by pointing, clicking and copy/pasting.

Cells content+macros+reactive docked components can be saved in a zip file and become a Workbench plug-ins. Like all BIOCEP plug-ins, the spreadsheet-based GUI can be delivered to the end user via a simple URL. It can use a cloud-R or a local R created transparently on the user’s machine.

2) Build time series models, regression models

BIOCEP’s workbench is extensible and I am hoping that contributors will soon start writing plug-ins or converting available GUIs to BIOCEP plug-ins in order to make the creation of those models as easy as possible.

Biography-

Karim Chine
Karim chine graduated from the French Ecole Polytechnique and TELECOM ParisTech. He worked at Ecole Normale Supérieure-LPS (phase separation in two-dimensional additive mixture), IBM (VisualAge Pacbase), Schlumberger (Over the Air Platform and Web platform for smartcards personalization services), Air France (SSO deployment), ILOG (OPL-CPLEX-ODM Development System), European Bioinformatics Institute (Expression Profiler, Biocep) and Imperial College London-Internet Center (Biocep). He contributed to open source software (AdaBroker) and he is the author of the Biocep platform. He currently works on the seamless integration of the new platform within utility computing infrastructures (Amazon EC2), its deployment on Grids (NGS) and its usage as a tool for education and he tries to build collaborations with academic and industrial partners.

You can view his resume here http://www.biocep.net/scan/CV_Karim_Chine_June_2009.pdf

PAW is back

The Predictive Analytics world is going to be back in October soon , and all those who missed out the stelar event can start booking now.

Here is the official BR ( blog Release)

Source: http://www.predictiveanalyticsworld.com/blog/wp-trackback.php?p=20

June 5th 2009 10:46 am

Keynotes at October’s PAW: Stephen Baker and Usama Fayyad

Predictive Analytics World, coming October 20-21 to Washington DC, has a great line-up of keynote speakers:

Stephen Baker, author of The Numerati and senior writer at BusinessWeek, where he’s been since 1987. Steve’s book has received a tremendous amount of attention this year. It is a revealing and insightful exploration of the opportunities and pitfalls of applied analytics, and consumer perception thereof.

Usama Fayyad, Ph.D. — CEO, Open Insights and formerly Yahoo!’s Chief Data Officer and Executive Vice President of Research & Strategic Data Solutions. Dr. Fayyad will return as an acclaimed keynote speaker. His keynote at February’s PAW (San Francisco) received extremely strong ratings from conference attendees.

Finally, Eric Siegel, Ph.D., will be kicking off PAW with a reprise of his keynote, “Five Ways to Lower Costs with Predictive Analytics.”

PMML 4.0

There are some nice changes in the PMML 4.0 version. PMML is the XML version for data modeling , or specificallyquoting the DMG group itself

PMML uses XML to represent mining models. The structure of the models is described by an XML Schema. One or more mining models can be contained in a PMML document. A PMML document is an XML document with a root element of type PMML. The general structure of a PMML document is:

  <?xml version="1.0"?>
  <PMML version="4.0"
    xmlns="http://www.dmg.org/PMML-4_0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" >

    <Header copyright="Example.com"/>
    <DataDictionary> ... </DataDictionary>

    ... a model ...

  </PMML>

So what is new in version 4. Here are some powerful modeling changes. For anyone with any XML knowledge PMML is the way to go.

PMML 4.0 – Changes from PMML 3.2

Associations

  • Itemset and AssociationRule elements are no longer enclosed within a “Choice” element
  • Added different scoring procedures: recommendation, exclusiveRecommendation and ruleAssociation with explanation and example
  • Changed version to “4.0” from “3.2” in the example(s)

BuiltinFunctions

Added the following functions:
  • isMissing
  • isNotMissing
  • equal
  • notEqual
  • lessThan
  • lessOrEqual
  • greaterThan
  • greaterOrEqual
  • isIn
  • isNotIn
  • and
  • or
  • not
  • isIn
  • isNotIn
  • if

Click on Image for better resolution

ClusteringModel

  • Changed version to “4.0” from “3.2” in the example(s)
  • Added reference to ModelExplanation element in the model XSD

Conformance

  • Changed all version references from “3.2” to “4.0”

DataDictionary

  • No changes

Functions

  • No changes

GeneralRegression

  • Changed to allow for Cox survival models and model ensembles
    • Add new model type: CoxRegression.
    • Allow empty regression model when model type is CoxRegression, so that baseline-only model could be represented.
    • Add new optional model attributes: endTimeVariable, startTimeVariable, subjectIDVariable, statusVariable, baselineStrataVariable, modelDF.
    • Add optional Matrix in Predictor to specify a contrast matrix, optional attribute referencePoint in Parameter.
    • Add new elements: BaseCumHazardTables, EventValues, BaselineStratum, BaselineCell.
    • Add examples of scoring for Cox Regression and contrast matrices.
    • Add new type of distribution: tweedie.
    • Add new attribute in model: targetReferenceCategory, so that the model can be used in MiningModel.
    • Changed version to “4.0” from “3.2” in the example(s)
    • Added reference to ModelExplanation element in the model XSD

GeneralStructure

Header

  • No changes

Interoperability

  • Changed: “As a result, a new approach for interoperability was required and is being introduced in PMML version 3.2.” to “As a result, a new approach for interoperability was introduced in PMML version 3.2.”

MiningSchema

  • Added frequencyWeight and analysisWeight as new options for usageType. They will not affect scoring, but will make model information more complete.

ModelComposition — No longer used, replaced by MultipleModels

ModelExplanation

  • New addition to PMML 4.0 that contains information to explain the models, model fit statistics, and visualization information.

ModelVerification

  • No changes

MultipleModels

  • Replaces ModelComposition. Important additions are segmentation and ensembles.
  • Added reference to ModelExplanation element in the model XSD

NaïveBayes

  • Changed version to “4.0” from “3.2” in the example(s)
  • Added reference to ModelExplanation element in the model XSD

NeuralNetwork

  • Changed version to “4.0” from “3.2” in the example(s)
  • Added reference to ModelExplanation element in the model XSD

Output

  • Extended output type to include Association rule models. The changes add a number of new attributes: “ruleFeature”, “algorithm”, “rank”, “rankBasis”, “rankOrder” and “isMultiValued”. A new enumeration type “ruleValue” is added to the RESULT-FEATURE

Regression

  • Changed version to “4.0” from “3.2” in the example(s)
  • Added reference to ModelExplanation element in the model XSD

RuleSet

  • Changed version to “4.0” from “3.2” in the example(s)
  • Added reference to ModelExplanation element in the model XSD

Sequence

  • Changed version to “4.0” from “3.2” in the example(s)

Statistics

  • accommodate weighted counts by replacing INT-ARRAY with NUM-ARRAY in DiscrStats and ContStats
  • change xs:nonNegativeInteger to xs:double in several places
  • add new boolean attribute ‘weighted’ to UnivariateStats and PartitionFieldStats elements
  • add new attribute cardinality in Counts
  • Also some very long lines in this document are now wrapped.

SupportVectorMachine

  • Added optional attribute threshold
  • Added optional attribute classificationMethod
  • Attribute alternateTargetCategory removed from SupportVectorMachineModel element and moved to SupportVectorMachine element
  • Changed the example slightly
  • Changed version to “4.0” from “3.2” in the example(s)
  • Added reference to ModelExplanation element in the model XSD

Targets

  • No changes

Taxonomy

  • Changed: “A TableLocator may contain any description which helps an application to locate a certain table. PMML 3.2 does not yet define the content. PMML users have to use their own extensions. The same applies to InlineTable.” to “A TableLocator may contain any description which helps an application to locate a certain table. PMML standard does not yet define the content. PMML users have to use their own extensions. The same applies to InlineTable.”

Text

  • Changed version to “4.0” from “3.2” in the example(s)
  • Added reference to ModelExplanation element in the model XSD

TimeSeriesModel

  • New addition to PMML 4.0 to support Time series models

Transformations

  • No changes

TreeModel

  • Changed version to “4.0” from “3.2” in the example(s)
  • Added reference to ModelExplanation element in the model XSD

Sources

http://www.dmg.org/v4-0/GeneralStructure.html

http://www.dmg.org/v4-0/Changes.html

and here are some companies using PMML already

http://www.dmg.org/products.html

I found the tool at http://www.dmg.org/coverage/ much more interesting though (see screenshot).

Screenshot-Mozilla Firefox

Zementis who we have covered in the interviews has played a steller role in bring together this common standard for data mining. Note Kxen model is also highlighted there.

The best PMML convertor tutorial is here

http://www.zementis.com/videos/PMML_Converter_iGoogle_gadget_2_demo.htm

Teratec : High Performance Computing Event

Here is a good HC event.

The Ter@tec’09 Forum
June 30 and July 1st, 2009, Supélec (91- France)


Incidently it is also quite close to KDD conference http://www.decisionstats.com/2009/06/19/conference-of-the-year-kdd-2009/

High performance Simulation and Computing for competitiveness, innovation and employment

© Ter@tec 2008 CEA

The international HPC event
The  Ter@tec annual Forum, created in 2006, is a major occasion of meetings, exchanges and reflection in the field of high performance simulation and computing.

Since the success of its first edition, the Ter@tec Forum has developed and is now organized on two days with plenary conferences, workshops and exhibition.

In 2008, more than 400 international attendees, from research and industry, providers and users, met to review the largest worldwide programs and discuss the perspectives and the major challenges we are facing, both on the technology side and on the user side.

The Forum was recognized as very successful, with high-level presentations and workshops, and the personal participation of Mrs Valérie Pécresse, French Minister for Higher education and Research and Mr Janez PotoČnik, European Commissioner for Science and Research.

Ter@tec 2009, the meeting of the HPC community around the technological and economical aspects of the high performance simulation and computing development.

Source- http://www.teratec.eu/gb/forum/index.html

Conference of the year: KDD 2009

This is one great co9nference you should attend if you have the time and inclination to check out latest advances in the world of Knowledge discovery. While KXEN ( from whom I consult on social madia) is a Gold Sponser- the following posts on workshops, demos and  papers will show you just how much technical stuff as opposed to marketing bullshit and jazz ( as in other confs)  is available in this conference. So pack your bags, and Viva La France for a grueling refreshing course in Knowledge Discovery and Text Mining. Incidentally KXEN intend to show their path breaking cutting edge social network analysis software KSN here.

Disclaimer- I am a social media consultant to KXEN.

KDD2009: Workshops

Abstracts

W1 – Statistical and Relational Learning and Mining in Bioinformatics (StReBio’09)

Jan Ramon, Fabrizio Costa, Christophe Costa Florencio, Joost Kok

Bioinformatics is an application domain where information is naturally represented in terms of relations between heterogenous objects. Modern experimentation and data acquisition techniques allow the study of complex interactions in biological systems. This raises interesting challenges because the amount of data is huge,some information can not be observed, and measurements may be noisy.

The StReBio’09 workshop invites contributions concerning applications of statistical relational learning and mining methods in bio-informatics domains. In particular, the workshop invites both regular papers, problem statements and problem solution papers.

Back to top…

W2 – The 3rd International Workshop on Knowledge Discovery from Sensor Data (SensorKDD-2009)

Olufemi Omitaomu, Auroop Ganguly, Joao Gama, Ranga Raju Vatsavai, Mohamed Medhat Gaber and Nitesh V. Chawla

Wide-area sensor infrastructures, remote sensors, RFIDs, and wireless sensor networks yield massive volumes of disparate, dynamic, and geographically distributed data. The Sensor-KDD 2009 workshop solicits papers that describe innovative solutions in offline data mining and/or real-time analysis of sensor or streaming data. Position papers that describe the challenges and requirements for sensor data based knowledge discovery in high-priority application domains, as well as relevant case studies, are particularly encouraged.

Back to top…

W3 – ACM SIGKDD Workshop on CyberSecurity and Intelligence Informatics (CSI-KDD)

Hsinchun Chen, Marc Dacier, Marie-Francine Moens, Gerhard Paaß, Christopher C. Yang

Computer supported communication and infrastructure are integral parts of modern economy. Their security is of incredible importance to a wide variety of practical domains ranging from Internet service providers to the banking industry and e-commerce, from corporate networks to the intelligence community. Of interest to this workshop are novel knowledge discovery methods addressing this field, e.g. adaptive, active or anticipatory approaches integrating new types of contents and protocols. Equally important are innovative applications demonstrating the effectiveness of data mining in solving real-world security problems.

Back to top…

W4 – Workshop on Visual Analytics and Knowledge Discovery (VAKD ’09)

Kai Puolamäki, Heikki Mannila, Alessio Bertone, Silvia Miksch, Mark A. Whiting, Jean Scholtz

The goal of Visual Analytics is to derive insight from massive, dynamic, ambiguous, and often conflicting data; detect the expected and discover the unexpected; provide timely, defensible, and understandable assessments; and communicate the assessment effectively for action. The goal of this workshop is to raise the awareness of the KDD community for the importance of Visual Analytics and bring together researcher from the underlying fields to bridge the gap between them—to write a KDD research roadmap on Visual Analytics.

Back to top…

W5 – The Third International Workshop on Data Mining and Audience Intelligence for Advertising (ADKDD)

Ying Li, Arun C. Surendran, and Dou Shen

Advertising, especially online advertising, is growing rapidly and brings about large volumes of data along with challenging data mining problems. Following on the success of ADKDD 2007 and 2008, ADKDD 2009 is to be held in Paris France, in conjunction with KDD 2009, to provide a high-level international forum for the academic community and the industry to present the state of the art of algorithms and applications of advertising.

We encourage papers that bring up and formalize new research problems in online advertising, or propose novel data mining techniques for existing problems. We plan to cover (but not restricted to) the following areas: Mining for Ad Relevance and Ranking; Audience Intelligence & User Modeling; Content Understanding; Search Engine Marketing, Optimization (SEMs, SEOs) and Other Topics in Advertising. Accepted papers will be achieved in ACM Digital Library and one or two papers will be recommended to SIGKDD Explorations.

Back to top…

W6 – The 3rd Workshop on Social Network Mining and Analysis (SNA-KDD)

Lee Giles, Prasenjit Mitra, Igor Perisic, John Yen, Haizheng Zhang

(Abstract Coming Soon)

Back to top…

W7 – Human Computation Workshop (HCOMP 2009)

Paul Bennett, Raman Chandrasekar, Max Chickering, Panos Ipeirotis, Edith Law, Foster Provost, Anton Mityagin, Luis von Ahn

Human computation is a new research area that studies the process of channeling the vast internet population to perform tasks or provide data towards solving difficult problems that no known computer algorithms can yet solve perfectly and efficiently, e.g. digitize books, recognize objects in images and songs, translate sentences, summarize news articles, annotate videos etc. The goal of HCOMP 2009 is to bring together academic and industry researchers in a stimulating discussion of existing human computation applications, such as Games With A Purpose (e.g. the ESP game), Mechanical Turk and CAPTCHAs, and future directions of this new subject area.

Included in the workshop are invited talks, presentations, posters, and a demo session where participants are invited to showcase their human computation applications.

Back to top…

W8 – Data Mining using Matrices and Tensors (DMMT’09)

Chris Ding, Tao Li

This workshop will present recent advances in algorithms and methods using matrix and scientific computing/applied mathematics for modeling and analyzing massive, high-dimensional, and nonlinear-structured data. One main goal of the workshop is to bring together leading researchers on many topic areas (e.g., computer scientists, computational and applied mathematicians) to assess the state-of-the-art, share ideas, and form collaborations. We also wish to attract practitioners who seek novel ideas for applications.

Back to top…

W9 – Third Workshop on Data Mining Case Studies and Practice Prize (DMCS)

Gabor Melli, Peter van der Putten, Brendan Kitts

The Data Mining Case Studies Workshop and Practice Prize was established to recognize the very best data mining deployments for the year. Data Mining Case Studies will highlight data mining implementations that have been responsible for a significant and measurable improvement in business operations, advanced scientific discoveries, or provided other benefits to humanity. The best paper will be awarded the Practice Prize. Do you have an outstanding data mining application? This is a unique opportunity to be recognized for your work.

Back to top…

W10 – KDD cup 2009: Fast Scoring on a Large Database (KDDcup09)

Isabelle Guyon, David Vogel

This workshop will discuss the results of the KDD cup 2009. The competition is organized around a large dataset provided by the French telecom company Orange. It is a problem of Customer Relationship Management (CRM), a key element of modern marketing strategies. Orange offered the opportunity to work on a large marketing database to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling).

Back to top…

W11 – The First ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data (U’09)

Jian Pei, Lise Getoor, Ander de Keijzer

The First ACM SIGKDD International Workshop on Knowledge Discovery from Uncertain Data (U’09) is to discuss in depth the challenges, opportunities and techniques on the topic of analyzing and mining uncertain data. The theme of this workshop is to make connections among the research areas of probabilistic databases, probabilistic reasoning, and data mining, as well as to build bridges among the aspects of models, data, applications, novel mining tasks and effective solutions. By making connections among different communities, we aim at understanding each other in terms of scientific foundation as well as commonality and differences in research methodology.

Back to top…

KDD-09 Call For Workshop Proposals (Expired)

The ACM KDD-2009 organizing committee invites proposals for workshops to be held in conjunction with the conference. The purpose of a workshop is to provide participants with the opportunity to present and discuss novel research ideas on active and emerging topics of knowledge discovery and data mining. A workshop should also support the interaction and feedback among topic specialists from academia, industry and government.

A workshop may be organized around industrial applications in a particular domain and the challenges this domain poses, such as the Netflix workshop on recommender systems (http://netflixkddworkshop2008.info/).

A workshop may also include a challenge problem, such as the one on time series classification that took place in 2007 (http://www.cs.ucr.edu/~eamonn/SIGKDD2007TimeSeries.html). A session with papers that address a challenge complements the more diverse sessions with regular papers and improves the potential for discussion. Because such challenges require extra time to plan, we may be willing to provide early notice of acceptance.

The organizers of approved workshops are required to announce the workshop and call for papers, gather submissions, conduct the reviewing process and decide upon the final workshop program. They must also prepare an informal set of workshop proceedings to be distributed with the registration materials at the conference. They may choose to form organizing or program committees for assistance in these tasks. The logistics of the workshops will be done with the help from the ACM KDD-2009 organizers.

Back to top…

source-http://www.kdd.org/kdd/2009/workshops.html

KDD2009: Papers Research and Industrial

Research Papers

A Generalized Co-HITS Algorithm and Its Application to Bipartite Graphs
Hongbo Deng* The Chinese Univ. of Hong Kong; Michael Lyu The Chinese University of Hong Kong; IRWIN KING Chinese University of Hong Kong

A LRT Framework for Fast Spatial Anomaly Detection
Mingxi Wu* Oracle Corporation; Xiuyao Song ; Chris Jermaine University of Florida; Sanjay Ranka University of Florida; John Gums

A Multi-Relational Approach to Spatial Classification
Richard Frank* Simon Fraser University; Martin Ester Simon Fraser University; Arno Knobbe Leiden University

A Principled and Flexible Framework for Finding Alternative Clusterings
ZiJie Qi* UCDavis; Ian Davidson University of California Davis

A Viewpoint-based Approach for Interaction Graph Analysis
Sitaram Asur* Ohio State University; Srinivasan Parthasarathy Ohio State University

Adapting the Right Measures for K-means Clustering
Junjie Wu* Beihang University; Hui Xiong Rutgers University; Jian Chen

An Association Analysis Approach to Biclustering
Gaurav Pandey* University of Minnesota; Gowtham Atluri ; Michael Steinbach University of Minnesota; Chad Myers University of Minnesota; Vipin Kumar University of Minnesota

Analyzing Patterns of User Content Generation in Online Social Networks
Lei Guo* Yahoo!; Enhua Tan Ohio State University; Songqing Chen George Mason University; Xiaodong Zhang Ohio State University; Yihong (Eric) Zhao Yahoo!

Anomalous Window Discovery through Scan Statistics for Linear Intersecting Paths (SSLIP)
Lei Shi University of Maryland Baltimore County; Vandana Janeja* UMBC

Audience Selection for On-line Brand Advertising: Privacy-friendly Social Network Targeting
Foster Provost* NYU; Brian Dalessandro Media6degrees; Rod Hook Coriolis Ventures; Xiaohan Zhang New York University

Augmenting the Generalized Hough Transform to Enable the Mining of Petroglyphs
Qiang Zhu* Univ of California Riverside; Xiaoyue Wang Univ of California Riverside; Eamonn Keogh UC Riverside; Sang-Hee Lee UC Riverside

BBM: Bayesian Browsing Model from Petabyte-scale Data
Chao Liu* Microsoft Research; Fan Guo Carnegie Mellon University; Christos Faloutsos CMU

Cross Domain Distribution Adaptation via Kernel Mapping
Erheng Zhong* Sun Yat-Sen University; Wei Fan IBM T.J.Watson; Jing Peng Montclair State University; Kun Zhang Xavier University of Louisiana; Jiangtao Ren Sun Yat-Sun University; Olivier Verscheure IBM T.J.Watson; Deepak Turaga IBM

Cartesian Contour: A Concise Representation for a Collection of Frequent Sets
Ruoming Jin* Kent State University; Yang Xiang Kent State University; Lin Liu Kent State University

Category Detection Using Hierarchical Mean Shift
Pavan Vatturi Oregon State University; Weng-Keen Wong* Oregon State University

Causality Quantification and Its Applications: Structuring and Modeling of Multivariate Time Series
Takashi Shibuya* The University of Tokyo; Tatsuya Harada The University of Tokyo; Yasuo Kuniyoshi The University of Tokyo

Characteristic Relational Patterns
Arne Koopman* Universiteit Utrecht; Arno Siebes Universiteit Utrecht

Classification of Software Behaviors for Failure Detection: A Discriminative Pattern Mining Approach
David Lo Singapore Management University; Hong Cheng* Chinese University of HongKong; Jiawei Han University of Illinois at Urbana-Champaign; Siau-Cheng Khoo National University of Singapore; Chengnian Sun National University of Singapore

Co-Clustering on Manifolds
Quanquan Gu* Tsinghua University; Jie Zhou Tsinghua University

CoCo: Coding Cost for Parameter-free Outlier Detection

Christian Bohm University of Munich; Katrin Haegler University of Munich; Nikola Muller Max Plank Institute of Biochemistry Martinsried Germany; Claudia Plant* Technische Universitat Munchen

Co-evolution of Social and Affiliation Networks
Hossam Sharara* University of Maryland; Elena Zheleva University of Maryland College Park; Lise Getoor University of Maryland

Collaborative Filtering with Temporal Dynamics
Yehuda Koren* Yahoo! Research

Collective Annotation of Wikipedia Entities in Web Text
Sayali Kulkarni IIT Bombay; Amit Singh IIT Bombay; Ganesh Ramakrishnan IIT Bombay; Soumen Chakrabarti* IIT Bombay

Collusion-Resistant Anonymous Data Collection Method
Mafruz Zaman Ashrafi* Institute For Infocomm Researc; See-Kiong Ng Institute for Infocomm Research

Combining Link and Content for Community Detection: A Discriminative Approach
Tianbao Yang* Michigan State University; Rong Jin Michigan State University; Yun Chi NEC Laboratories America; Shenghuo Zhu NEC Laboratories America Inc.

Connections between the Lines: Augmenting Social Networks with Text
Jonathan Chang* Princeton University; Jordan Boyd-Graber Princeton University; David Blei Princeton University

Consensus Group Based Stable Feature Selection
Lei Yu* Binghamton University; Steven Loscalzo SUNY Binghamton; Chris Ding University of Texas at Arlington

Constant-Factor Approximation Algorithms for Identifying Dynamic Communities
Chayant Tantipathananandh* UIC; Tanya Berger-Wolf UIC

Constrained Optimization for Validation-Guided Conditional Random Field Learning
Minmin Chen ; Yixin Chen* Washington University in St. L

Correlated Itemset Mining in ROC Space: A Constraint Programming Approach
Siegfried Nijssen* Leuven University; Tias Guns Katholieke Universiteit Leuven; Luc De Raedt Katholieke Universiteit Leuven

CP-Summary: A Concise Representation for Browsing Frequent Itemsets
Ardian Poernomo* Nanyang Technological Universi; Vivekanand Gopalkrishnan Nanyang Technological Universi

Detection of Unique Temporal Segments by Information Theoretic Meta-clustering
Shin Ando* Gunma University; Einoshin Suzuki

Differentially-Private Recommender Systems
Frank McSherry* Microsoft Research; Ilya Mironov Microsoft Research

DOULION: Counting Triangles in Massive Graphs with a Coin
Charalampos Tsourakakis* Carnegie Mellon University; U Kang Carnegie Mellon University; Gary Miller Carnegie Mellon University; Christos Faloutsos CMU

Drosophila Gene Expression Pattern Annotation Using Sparse Features and Term-term Interactions
Shuiwang Ji* Arizona State University; Lei Yuan Arizona State University; Ying-Xin Li Nanjing University; Zhi-Hua Zhou Nanjing University; Sudhir Kumar ; Jieping Ye Arizona State University

DynaMMo: Mining and Summarization of Coevolving Sequences with Missing Values
Lei Li* Carnegie Mellon University; Jim McCann Carnegie Mellon University; Nancy Pollard Carnegie Mellon University; Christos Faloutsos CMU

Effective Multi-Label Active Learning for Text Classification
Bishan Yang* Peking University; JianTao Sun ; Zheng Chen

Efficient Anomaly Monitoring Over Moving Object Trajectory Streams
Lei Chen* HKUST; Ada Fu Chinese University of Hong Kong; Yingyi Bu CUHK

Efficient Influence Maximization in Social Networks
Wei Chen* Microsoft Research Asia; Yajun Wang Microsoft Research Asia; Siyu Yang Tsinghua University

Efficient Methods for Topic Model Inference on Streaming Document Collections
Limin Yao* University of Massachusetts Am; David Mimno University of Massachusetts Amherst; Andrew McCallum University of Massachusetts Amherst

Efficiently Learning the Accuracy of Labeling Sources for Selective Sampling
Pinar Donmez* Carnegie Mellon University; Jaime Carbonell Carnegie Mellon University; Jeff Schneider Carnegie Mellon University

Exploiting Wikipedia as External Knowledge for Document Clustering
Tony Hu* Drexel University; Xiaodan Zhang Drexel Univerity; Caimei Lu Drexel University; E.K Park University of Missouri at Kansas City; Xiaohua Zhou Drexel University

Exploring Social Tagging Graph for Web Object Classification
Zhijun Yin* University of Illinois; Rui Li ; Qiaozhu Mei ; Jiawei Han University of Illinois at Urbana-Champaign

Extracting Discriminative Concepts for Domain Adaptation in Text Mining
Bo Chen* CUHK; Wai Lam CUHK; Ivor Tsang NTU; Tak-lam Wong CUHK

Fast Approximate Spectral Clustering
Donghui Yan University of California Berkeley; Ling Huang* Intel Research; Michael Jordan University of California Berkeley

Feature Shaping for Linear SVM Classifiers
George Forman* Hewlett-Packard Labs; Martin Scholz HP Labs; Shyamsundar Rajaram Hewlett-Packard

Finding a Team of Experts in Social Networks
Theodoros Lappas Univ of California Riverside; Kun Liu IBM Almaden; Evimaria Terzi* IBM Almaden

Frequent Pattern Mining with Uncertain Data
Charu Aggarwal* IBM T J Watson Research Center; Yan Li Tsinghua University; Jianyong Wang Tsinghua University; Jing Wang New York University

Genre-based Decomposition of Email Class Noise
Aleksander Kolcz* Microsoft Live Labs; Gordon Cormack University of Waterloo

Grouped Graphical Granger Modeling Methods for Temporal Causal Modeling
Aurelie Lozano* IBM Research; Naoki Abe IBM T J Watson Research Center; Yan Liu IBM Research; Saharon Rosset Tel-Aviv University
Israel

Heterogeneous Source Consensus Learning via Decision Propagation and Negotiation
Jing Gao* UIUC; Wei Fan IBM T.J.Watson; Yizhou Sun ; Jiawei Han University of Illinois at Urbana-Champaign

Improving Clustering Stability with Combinatorial MRFs
Ron Bekkerman* HP Labs; Martin Scholz HP Labs; Krishnamurthy Viswanathan HP Labs

Improving Data Mining Utility with Projective Sampling
Mark Last* BGU

Information Theoretic Regularization for Semi-Supervised Boosting
Lei Zheng Wright State University; Shaojun Wang* Wright State University; Yan Liu Wright State University; Chi-Hoon Lee Yahoo

Issues in Evaluation of Stream Learning Algorithms
Joao Gama* University of Porto; Raquel Sebastiao LIAAD; Pedro Rodrigues LIAAD

Large Human Communication Networks: Patterns and a Utility-Driven Generator
Nan Du* CMU; Christos Faloutsos CMU; Bai Wang ; Leman Akoglu Carnegie Mellon University

Large-Scale Behavioral Targeting
Ye Chen* Yahoo! Labs; Dmitry Pavlov Yahoo! Labs; John Canny Computer Science Division University of California Berkeley

Large-Scale Graph Mining Using Backbone Refinement Classes
Andreas Maunz* Freiburg Center for Data Analy; Christoph Helma in-silico toxicology; Stefan Kramer Institut fur Informatik Technische Universitat Munchen

Large-Scale Sparse Logistic Regression
Jun Liu* Arizona State University; Jianhui Chen ASU; Jieping Ye Arizona State University

Learning Optimal Ranking with Tensor Factorization for Tag Recommendation
Steffen Rendle* University of Hildesheim; Leandro Marinho University of Hildesheim; Alexandros Nanopoulos University of Hildesheim; Lars Schmidt-Thieme University of Hildesheim

Learning Patterns in the Dynamics of Biological Networks
Chang hun You* Washington State University; Lawrence Holder Washington State University; Diane Cook Washington State University

Learning with a Nonexhaustive Training Dataset
Murat Dundar* IUPUI; Arun Bhunia Purdue University; Daniel Hirleman Purdue University; Paul Robinson ; Bartek Rajwa Purdue University

Learning Indexing and Diagnosing Network Faults
Ting Wang* Georgia Tech; Mudhakar Srivatsa IBM T.J. Watson Research Cente; Dakshi Agrawal ; Ling Liu

Measuring the Effects of Preprocessing Decisions and Network Forces in Dynamic Network Analysis
Jerry Scripps* Michigan State University; Pang-Ning Tan Michigan State University; Abdol-Hossein Esfahanian Michigan State University

Meme-tracking and the Dynamics of the News Cycle
Jure Leskovec* Cornell University; Lars Backstrom Cornell University; Jon Kleinberg Cornell University

MetaFac: Community Discovery via Relational Hypergraph Factorization
Yu-Ru Lin* Arizona State University; Jimeng Sun IBM; Paul Castro IBM; Ravi Konuru IBM; Hari Sundaram ; Aisling Kelliher Arizona State University

Mind the Gaps: Weighting the Unknown in Large-Scale One-Class Collaborative Filtering
Rong Pan* HP Labs; Martin Scholz HP Labs

Mining Broad Latent Query Aspects from Search Sessions
Xuanhui Wang UIUC; Deepayan Chakrabarti Yahoo! Research; Kunal Punera* Yahoo! Research

Mining Discrete Patterns via Binary Matrix Factorization
Bao-Hong Shen Arizona State University; Shuiwang Ji Arizona State University; Jieping Ye* Arizona State University

Mining for the Most Certain Predictions from Dyadic Data
Meghana Deodhar* University of Texas at Austin; Joydeep Ghosh The University of Texas at Austin

Mining Rich Session Context to Improve Web Search
Guangyu Zhu* University of Maryland College Park; Gilad Mishne Yahoo! Search and Advertising Sciences

Mining Social Networks for Personalized Email Prioritization
Shinjae Yoo* Carnegie Mellon University; Yiming Yang ; Frank Lin ; Il-Chul Moon

Characterizing Individual Communication Patterns
Dean Malmgren* Northwestern University; Jake Hofman Yahoo! Research; Luis Amaral Northwestern University; Duncan Watts Yahoo! Research

Multi-focal Learning and Its Application to Customer Service Support
Yong Ge* Rutgers University; Hui Xiong Rutgers University; Wenjun Zhou Rutgers University; Ramendra Sahoo IBM T.J. Watson Research Center; Xiaofeng Gao ; Weili Wu

Name-Ethnicity Classification from Open Sources
Anurag Ambekar Stony Brook University; Charles Ward Stony Brook University; Jahangir Mohammed Stony Brook University; Swapna Male Stony Brook University; Steven Skiena* Stony Brook University

New ensemble methods for evolving data streams
Albert Bifet* Universitat Politecnica de Cat; Geoff Holmes University of Waikato; Bernhard Pfahringer University of Waikato Hamilton; Richard Kirkby University of Waikato; Ricard Gavalda Universitat Politecnica de Catalunya

On Burstiness-aware Search for Document Sequences
Theodoros Lappas* Univ of California Riverside; Benjamin Arai Univ of California Riverside; Dimitrios Gunopulos UCR NKUA; Manolis Platakis ; Dimitrios Kotsakos

On Compressing Social Networks
Flavio Chierichetti ; Ravi Kumar* Yahoo; Silvio Lattanzi ; Michael Mitzenmacher ; Alessandro Panconesi ; Prabhakar Raghavan

On the Tradeoff Between Privacy and Utility in Data Publishing
Tiancheng Li* Purdue University; Ninghui Li Purdue University Optimizing Web Traffic via the Media Scheduling Problem Lars Backstrom* Cornell University; Jon Kleinberg Cornell University; Ravi Kumar Yahoo

Parallel Community Detection on Large Networks with Propinquity Dynamics
Yuzhou Zhang* Tsinghua University; Jianyong Wang Tsinghua University; Yi Wang Google Beijing Research; Lizhu Zhou Tsinghua University

Primal Sparse Max-Margin Markov Networks
Jun ZHU* Tsinghua University; Eric Xing Carnegie Mellon Univresity; Bo Zhang Tsinghua University

Probabilistic Frequent Itemset Mining in Uncertain Databases
Matthias Renz* Ludwig-Maximilinas-Universitat; Thomas Bernecker Ludwig-Maximilians-Universitat Munchen; Florian Verhein Ludwig-Maximilians-Universitat Munchen; Andreas Zuefle Ludwig-Maximilians-Universitat Munchen; Hans-Peter Kriegel University of Munich

Quantification and Semi-supervised Classification Methods for Handling Changes in Class Distribution
Jack Chongjie Xue* Fordham University; Gary Weiss Fordham University

Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema
Yizhou Sun* UIUC; Yintao Yu UIUC; Jiawei Han University of Illinois at Urbana-Champaign

Regression based Latent Factor Models
Deepak Agarwal* Yahoo!; Bee-Chung Chen Yahoo!

Regret-based Online Ranking for a Growing Digital Library
Erick Delage* Stanford University

Relational Learning via Latent Social Dimensions
Lei Tang* Arizona State University; Huan Liu

Scalable Graph Clustering Using Flows: Applications to Community Discovery
Venu Satuluri The Ohio State University; Srinivasan Parthasarathy* Ohio State University

Scalable Pseudo-Likelihood Estimation in Hybrid Random Fields
Antonino Freno* University of Siena; Edmondo Trentin ; Marco Gori

Social Influence Analysis in Large-scale Networks
Jie Tang* Tsinghua University; Jimeng Sun IBM TJ Watson Research Center; Chi Wang Tsinghua Univ.

Spatial-temporal causal modeling for climate change attribution
Aurelie Lozano* IBM Research; Hongfei Li IBM Research; Alexandru Niculsecu-Mizil IBM Research; Yan Liu IBM Research; Claudia Perlich IBM USA; Jonathan Hosking IBM Research; Naoki Abe IBM T J Watson Research Center

Structured Correspondence Topic Models for Mining Captioned Figures in Biological Literature
Amr Ahmed* Carnegie Mellon Univresity; Eric Xing Carnegie Mellon Univresity; William Cohen Carnegie Mellon Univresity; Robert Murphy Carnegie Mellon Univresity

TANGENT: A Novel, “Surprise-Me”, Recommendation Algorithm
Kensuke Onuma Sony Corporation; Hanghang Tong* CMU; Christos Faloutsos CMU

Tell Me Something I Don’t Know: Randomization Strategies for Iterative Data Mining
Sami Hanhijarvi* Helsinki Univ. of Technology; Markus Ojala Helsinki University of Technology; Niko Vuokko ; Kai Puolamaki ; Nikolaj Tatti Helsinki Univ. of Technology; Heikki Mannila

Temporal Mining for Interactive Workflow Data Analysis
Michele Berlingerio* KDD Lab Pisa ISTI C.N.R.; Fosca Giannotti ISTI CNR; Mirco Nanni KDD Lab – ISTI – CNR; Fabio Pinelli Isti – CNR – Italy Pisa

The Offset Tree for Learning with Partial Labels
John Langford* ; Alina Beygelzimer IBM

Time Series Shapelets: A New Primitive for Data Mining
Lexiang Ye* UC Riverside; Eamonn Keogh UC Riverside

Toward Autonomic Grids: Analyzing the Job Flow with Affinity Streaming
Xiangliang Zhang* INRIA; Cyril Furtlehner ; Julien Perez ; Cecile Germain-Renaud Universite Paris Sud; Michele Sebag Universite Paris-Sud

Towards Efficient Mining of Proportional Fault-Tolerant Frequent Itemsets
Ardian Poernomo* Nanyang Technological Universi; Vivekanand Gopalkrishnan Nanyang Technological Universi

TrustWalker : A Random Walk Model for Combining Trust-based and Item-based Recommendation
Mohsen Jamali* Simon Fraser University; Martin Ester Simon Fraser University

Turning Down the Noise in the Blogosphere
Khalid El-Arini, Carnegie Mellon University; Gaurav Veda; Dafna Shahaf; Carlos Guestrin

User Grouping Behavior in Online Forums
Xiaolin Shi* University of Michigan; Jun ZHU Tsinghua University; Rui Cai Microsoft Research; Lei Zhang Microsoft Research Asia

Using Graph-based Metrics with Empircial Risk Minimization to Speed Up Active Learning on Networked Data
Sofus Macskassy* Fetch Technologies Inc.

WhereNext: a Location Predictor on Trajectory Pattern Mining
Anna Monreale Isti – CNR – Italy Pisa; Fabio Pinelli Isti – CNR – Italy Pisa; Roberto Trasarti* Isti – CNR – Italy Pisa; Fosca Giannotti ISTI CNR

Industrial Papers

A Case Study of Behavior-driven Conjoint Analysis on Yahoo! Front Page Today Module

Wei Chu*, Yahoo! Labs; Seung-Taek Park, Yahoo! Inc.; Todd Beaupre, Yahoo! Inc.; Nitin Motgi, Yahoo! Inc.; Amit Phadke, Yahoo! Inc.; Seinjuti Chakraborty, Yahoo! Inc.; Joe Zachariah, Yahoo! Inc.

Address Standardization with Latent Semantic Association

Honglei Guo*, IBM China Research Lab; Huijia Zhu, IBM China Research Lab; Zhili Guo, IBM China Research Lab; Xiaoxun Zhang, IBM China Research Lab; Zhong Su, IBM China Research Lab

Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service

Noman Mohammed, Concordia University; Benjamin C. M. Fung*, Concordia University; Patrick C. K. Hung, University of Ontario Institute of Technology; Cheuk-kwong Lee, Hong Kong Red Cross Blood Transfusion Service

Applying Syntactic Similarity Algorithms for Enterprise Information Management

Lucy Cherkasova*, HPLabs; Kave Eshghi, HPLabs; Brad Morrey, HPLabs; Joseph Tucek, HPLabs; Alistair Veitch, HPLabs

Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs

Justin Ma*, UC San Diego; Lawrence Saul, UCSD; Stefan Savage, UC San Diego; Geoffrey Voelker, UC San Diego

BGP-lens: Patterns and Anomalies in Internet Routing Updates

B. Aditya Prakash*, Carnegie Mellon University; Nicholas Valler, UCR; David Andersen, CMU; Michalis Faloutsos, UCR; Christos Faloutsos, CMU

Can We Learn a Template-Independent Wrapper for News Article Extraction from a Single Training Site?

Junfeng Wang*, Zhejiang university; Xiaofei He, ; Can Wang, ; Jian Pei, Simon Fraser University; Jiajun Bu, ; Chun Chen, ; Ziyu Guan, ; Wei Vivian Zhang, Microsoft

Catching the Drift: Learning Broad Matches from Clickthrough Data

Sonal Gupta*, University of Texas at Austin; Mikhail Bilenko, Microsoft Research; Matthew Richardson, Microsoft Research Clustering of Event Logs Using Iterative Partitioning Adetokunbo Makanju*, Dalhousie University; Nur Zincir-Heywood, Dalhousie University; Evangelos Milios, Dalhousie University

COA: Finding Novel Patents through Text Analysis

Mohammad Al Hasan*, RPI; W. Scott Spangler, IBM Corporation; Thomas Griffin, IBM Corporation; Alfredo Alba, IBM Corporation

Enabling Analysts in Managed Services for CRM Analytics

Indrajit Bhattacharya, IBM Research; Shantanu Godbole*, IBM Research; Ajay Gupta, IBM Research; Ashish Verma, IBM Research; Jeff Achtermann, IBM MBPS; Kevin English, IBM

Entity Discovery and Assignment for Opinion Mining Applications

Xiaowen Ding*, Univ of Illinois at Chicago; Bing Liu, UIC; Lei Zhang, UIC

Grocery Shopping Recommendations Based on Basket-Sensitive Random Walk

Ming Li*, Unilever UK; Malcolm Dias, Unilever UK; Ian Jarman, Liverpool John Moores University; Wael El-Deredy, University of Manchester; Paulo Lisboa, Liverpool John Moores University

Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

Jiang-Ming Yang*, Microsoft Research Asia; Rui Cai, Microsoft Research; Chunsong Wang, University of Wisconsin-Madison; Hua Huang, Beijing University of Posts and Telecommunications; Lei Zhang, Microsoft Research Asia; Wei-Ying Ma, Microsoft Research Asia

Intelligent File Scoring System for Malware Detection from the Gray List

Tao Li*, Florida International University Learning Dynamic Temporal Graphs for Oil-drilling Equipment Monitoring System Yan Liu*, IBM Research; Jayant Kalagnanam

Migration Motif: A Spatial-Temporal Pattern Mining Approach for Financial Markets

Xiaoxi Du, KSU; Ruoming Jin*, Kent State University; Liang Ding, Kent State University; Victor Lee, Kent State University; John Thornton, Kent State University

Mining Brain Region Connectivity for Alzheimer’s Disease Study via Sparse Inverse Covariance Estimation

Liang Sun*, Arizona State University; Rinkal Patel, Arizona State University; Jun Liu, Arizona State University; Kewei Chen, Neuroimaging Banner Alzheimer’s Institute; Teresa Wu, Arizona State University; Jing Li, Arizona State University; Eric Reiman, Banner Alzheimer’s Institute and Banner PET Center; Jieping Ye, Arizona State University

Modeling and Predicting User Behavior in Sponsored Search

Joshua Attenberg*, NYU Polytechnic Institute; Torsten Suel, Yahoo Research; Sandeep Pandey, Yahoo Research

Named Entity Mining from Click-Through Log Using Weakly Supervised Latent Dirichlet Allocation

Gu Xu*, Microsoft Research Asia; Shuang-Hong Yang, Georgia Tech; Hang Li, Microsoft Research Asia

Network Anomaly Detection based on Eigen Equation Compression

Shunsuke Hirose*, NEC Corporation; Kenji Yamanishi, ; Takayuki Nakata, ; Ryohei Fujimaki

OLAP on Search Logs: An Infrastructure Supporting Data-Driven Applications in Search Engines

Bin Zhou, Simon Fraser University; Daxin Jiang*, MSRA; Jian Pei, Simon Fraser University; Hang Li, Microsoft Research Asia

OpinionMiner: A Machine Learning System for Web Opinion Mining and Extraction

Wei Jin*, North Dakota State University; Hung Hay Ho

Pervasive Parallelism in Data Mining: Dataflow solution to Co-clustering Large and Sparse Netflix Data

Srivatsava Daruru, University of Texas at Austin; Nena Marin*, Pervasive Software; Matthew Walker, Pervasive Software; Joydeep Ghosh, The University of Texas at Austin

Predicting Bounce Rates in Sponsored Search Advertisements

D. Sculley*, Google, Inc.; Robert Malkin, Google, Inc; Sugato Basu, Google, Inc; Roberto Bayardo, Google

PSkip: Estimating relevance ranking quality from web search clickthrough data

Kuansan Wang*, Microsoft Research; Toby Walker, ; Zijian Zheng

Query Result Clustering for Object-level Search

Jongwuk Lee, ; Seung-won Hwang*, Postech; Zaiqing Nie, ; Ji-Rong Wen, Microsoft Research Asia

Improving Classification Accuracy Using Automatically Extracted Training Data

Ariel Fuxman*, Microsoft, USA; Anitha Kanna, Microsoft, USA; Andrew Goldberg, University of Wisconsin; Rakesh Agrawal, Microsoft; Panayiotis Tsaparas, Microsoft

Sentiment Analysis of Blogs by Combining Lexical Knowledge with Text Classification

Prem Melville*, IBM; Wojciech Gryc, ; Richard Lawrence, IBM, USA

Seven Pitfalls to Avoid when Running Controlled Experiments on the Web

Thomas Crook, Microsoft; Brian Frasca, Microsoft; Ron Kohavi*, Microsoft; Roger Longbotham, Microsoft

SNARE: A Link Analytic System for Graph Labeling and Risk Detection

Mary McGlohon*, Carnegie Mellon University; Stephen Bay, PricewaterhouseCoopers; Markus Anderle, PricewaterhouseCoopers; David Steier, PricewaterhouseCoopers; Christos Faloutsos, CMU

Sustainable Operation and Management of Data Center Chillers using Temporal Data Mining

Debprakash Patnaik, Virginia Tech; Manish Marwah, HP Labs; Ratnesh Sharma, HP Labs; Naren Ramakrishnan*, Virginia Tech

Towards a Universal Marketplace over the Web: Statistical Multi-label Classification of Service Provider Forms with Simulated Annealing

Kivanc Ozonat*, HP Labs

Towards Combining Web Classification and Web Information Extraction: A Case Study

Ping Luo*, HP Labs China

Source – http://www.kdd.org/kdd/2009/papers.html