Interview- Top Data Mining Blogger on Earth , Sandro Saitta

Surajustement Modèle 2
Image via Wikipedia

If you do a Google search for Data Mining Blog- for the past several years one Blog will come on top. data mining blog – Google Search http://bit.ly/kEdPlE

To honor 5 years of Sandro Saitta’s blog (yes thats 5 years!) , we cover an exclusive interview with him where he reveals his unique sauce for cool techie blogging.

Ajay- Describe your journey as a scientist and data miner, from early experiences, to schooling to your work/research/blogging.

Sandro- My first experience with data mining was my master project. I used decision tree to predict pollen concentration for the following week using input data such as wind, temperature and rain. The fact that an algorithm can make a computer learn from experience was really amazing to me. I found it so interesting that I started a PhD in data mining. This time, the field of application was civil engineering. Civil engineers put a lot of sensors on their structure in order to understand how they behave. With all these sensors they generate a lot of data. To interpret these data, I used data mining techniques such as feature selection and clustering. I started my blog, Data Mining Research, during my PhD, to share with other researchers.

I then started applying data mining in the stock market as my first job in industry. I realized the difference between image recognition, where 99% correct classification rate is state of the art, and stock market, where you’re happy with 55%. However, the company ambiance was not as good as I thought, so I moved to consulting. There, I applied data mining in behavioral targeting to increase click-through rates. When you compare the number of customers who click with the ones who don’t, then you really understand what class imbalance mean. A few months ago, I accepted a very good opportunity at SICPA. I’m looking forward to resolving new challenges there.

Ajay- Your blog is the top ranked blog for “data mining blog”. Could you share some tips on better blogging for analytics and technical people

Sandro- It’s always difficult to start a blog, since at the beginning you have no reader. Writing for nobody may seem stupid, but it is not. By writing my first posts during my PhD I was reorganizing my ideas. I was expressing concepts which were not always clear to me. I thus learned a lot and also improved my English level. Of course, it’s still not perfect, but I hope most people can understand me.

Next come the readers. A few dozen each week first. To increase this number, I then started to learn SEO (Search Engine Optimization) by reading books and blogs. I tested many techniques that increased Data Mining Research visibility in the blogosphere. I think SEO is interesting when you already have some content published (which means not at the very beginning of your blog). After a while, once your blog is nicely ranked, the main task is to work on the content of the blog. To be of interest, your content must be particular: original, informative or provocative for example. I also had the chance to have a good visibility thanks to well-known people in the field like Kevin Hillstrom, Gregory Piatetsky-Shapiro, Will Dwinnell / Dean Abbott, Vincent Granville, Matthew Hurst and many others.

Ajay- Whats your favorite statistical software and what are the various softwares that you have worked with.
Could you compare and contrast these software as well.

Sandro- My favorite software at this point is SAS. I worked with it for two years. Once you know the language, you can perform ETL and data mining so easily. It’s also very fast compared to others. There are a lot of tools for data mining, but I cannot think of a tool that is as powerful as SAS and, in the same time, has a high-level programming language behind it.

I also worked with R and Matlab. R is very nice since you have all the up-to-date data mining algorithms implemented. However, working in the memory is not always a good choice, especially for ETL. Matlab is an excellent tool for prototyping. It’s not so fast and certainly not done for ETL, but the price is low regarding all the possibilities for data mining. According to me, SAS is the best choice for ETL and a good choice for data mining. Of course, there is the price.

Ajay- What are your favorite techniques and training resources for learning basics of data mining to say statisticians or business management graduates.

Sandro- I’m the kind of guy who likes to read books. I read data mining books one after the other. The fact that the same concepts are explained differently (and by different people) helps a lot in learning a topic like data mining. Of course, nothing replaces experience in the field. You can read hundreds of books, you will still not be a good practitioner until you really apply data mining in specific fields. My second choice after books is blogs. By reading data mining blogs, you will really see the issues and challenges in the field. It’s still not experience, but we are closer. Finally, web resources and networks such as KDnuggets of course, but also AnalyticBridge and LinkedIn.

Ajay- Describe your hobbies and how they help you ,if at all in your professional life.

Sandro- One of my hobbies is reading. I read a lot of books about data mining, SEO, Google as well as Sci-Fi and Fantasy. I’m a big fan of Asimov by the way. My other hobby is playing tennis. I think I simply use my hobbies as a way to find equilibrium in my life. I always try to find the best balance between work, family, friends and sport.

Ajay- What are your plans for your website for 2011-2012.

Sandro- I will continue to publish guest posts and interviews. I think it is important to let other people express themselves about data mining topics. I will not write about my current applications due to the policies of my current employer. But don’t worry, I still have a lot to write, whether it is technical or not. I will also emphasis more on my experience with data mining, advices for data miners, tips and tricks, and of course book reviews!

Standard Disclosure of Blogging- Sandro awarded me the Peoples Choice award for his blog for 2010 and carried out my interview. There is a lot of love between our respective wordpress blogs, but to reassure our puritan American readers- it is platonic and intellectual.

About Sandro S-



Sandro Saitta is a Data Mining Research Engineer at SICPA Security Solutions. He is also a blogger at Data Mining Research (www.dataminingblog.com). His interests include data mining, machine learning, search engine optimization and website marketing.

You can contact Mr Saitta at his Twitter address- 

https://twitter.com/#!/dataminingblog

KDNuggets Survey on R

CRISP-DM
Image via Wikipedia

From http://www.kdnuggets.com/2011/03/new-poll-r-in-analytics-data-mining-work.html?k11n07

A new poll/survey on actual usage of R in Data Mining

R has been steadily growing in popularity among data miners and analytic professionals.

In KDnuggets 2010 Data Mining / Analytic Tools Poll, R was used by 30% of respondents.
In 2010 Rexer Analytics Data Miner SurveyR was the most popular tool, used by 43% of the data miners.

Another aspect of tool usefulness is how much does it help with the entire data mining process from data preparation and cleaning, modeling, evaluation, visualization and presentation (excluding deployment).

New KDnuggets Poll is asking:
What part of your analytics / data mining work in the past 12 months was done in R?

http://www.kdnuggets.com/2011/03/new-poll-r-in-analytics-data-mining-work.html?k11n07

 

PAWCON -This week in London

Watch out for the twitter hash news on PAWCON and the exciting agenda lined up. If your in the City- you may want to just drop in

http://www.predictiveanalyticsworld.com/london/2010/agenda.php#day1-7

Disclaimer- PAWCON has been a blog partner with Decisionstats (since the first PAWCON ). It is vendor neutral and features open source as well proprietary software, as well case studies from academia and Industry for a balanced view.

 

Little birdie told me some exciting product enhancements may be in the works including a not yet announced R plugin 😉 and the latest SAS product using embedded analytics and Dr Elder’s full day data mining workshop.

Citation-

http://www.predictiveanalyticsworld.com/london/2010/agenda.php#day1-7

Monday November 15, 2010
All conference sessions take place in Edward 5-7

8:00am-9:00am

Registration, Coffee and Danish
Room: Albert Suites


9:00am-9:50am

Keynote
Five Ways Predictive Analytics Cuts Enterprise Risk

All business is an exercise in risk management. All organizations would benefit from measuring, tracking and computing risk as a core process, much like insurance companies do.

Predictive analytics does the trick, one customer at a time. This technology is a data-driven means to compute the risk each customer will defect, not respond to an expensive mailer, consume a retention discount even if she were not going to leave in the first place, not be targeted for a telephone solicitation that would have landed a sale, commit fraud, or become a “loss customer” such as a bad debtor or an insurance policy-holder with high claims.

In this keynote session, Dr. Eric Siegel will reveal:

  • Five ways predictive analytics evolves your enterprise to reduce risk
  • Hidden sources of risk across operational functions
  • What every business should learn from insurance companies
  • How advancements have reversed the very meaning of fraud
  • Why “man + machine” teams are greater than the sum of their parts for
  • enterprise decision support

 

Speaker: Eric Siegel, Ph.D., Program Chair, Predictive Analytics World

Top of this page ] [ Agenda overview ]


IBM9:50am-10:10am

Platinum Sponsor Presentation
The Analytical Revolution

The algorithms at the heart of predictive analytics have been around for years – in some cases for decades. But now, as we see predictive analytics move to the mainstream and become a competitive necessity for organisations in all industries, the most crucial challenges are to ensure that results can be delivered to where they can make a direct impact on outcomes and business performance, and that the application of analytics can be scaled to the most demanding enterprise requirements.

This session will look at the obstacles to successfully applying analysis at the enterprise level, and how today’s approaches and technologies can enable the true “industrialisation” of predictive analytics.

Speaker: Colin Shearer, WW Industry Solutions Leader, IBM UK Ltd

Top of this page ] [ Agenda overview ]


Deloitte10:10am-10:20am

Gold Sponsor Presentation
How Predictive Analytics is Driving Business Value

Organisations are increasingly relying on analytics to make key business decisions. Today, technology advances and the increasing need to realise competitive advantage in the market place are driving predictive analytics from the domain of marketers and tactical one-off exercises to the point where analytics are being embedded within core business processes.

During this session, Richard will share some of the focus areas where Deloitte is driving business transformation through predictive analytics, including Workforce, Brand Equity and Reputational Risk, Customer Insight and Network Analytics.

Speaker: Richard Fayers, Senior Manager, Deloitte Analytical Insight

Top of this page ] [ Agenda overview ]


10:20am-10:45am

Break / Exhibits
Room: Albert Suites


10:45am-11:35am
Healthcare
Case Study: Life Line Screening
Taking CRM Global Through Predictive Analytics

While Life Line is successfully executing a US CRM roadmap, they are also beginning this same evolution abroad. They are beginning in the UK where Merkle procured data and built a response model that is pulling responses over 30% higher than competitors. This presentation will give an overview of the US CRM roadmap, and then focus on the beginning of their strategy abroad, focusing on the data procurement they could not get anywhere else but through Merkle and the successful modeling and analytics for the UK.

Speaker: Ozgur Dogan, VP, Quantitative Solutions Group, Merkle Inc.

Speaker: Trish Mathe, Life Line Screening

Top of this page ] [ Agenda overview ]


11:35am-12:25pm
Open Source Analytics; Healthcare
Case Study: A large health care organization
The Rise of Open Source Analytics: Lowering Costs While Improving Patient Care

Rapidminer and R were the number 1 and 2 in this years annual KDNuggets data mining tool usage poll, followed by Knime on place 4 and Weka on place 6. So what’s going on here? Are these open source tools really that good or is their popularity strongly correlated with lower acquisition costs alone? This session answers these questions based on a real world case for a large health care organization and explains the risks & benefits of using open source technology. The final part of the session explains how these tools stack up against their traditional, proprietary counterparts.

Speaker: Jos van Dongen, Associate & Principal, DeltIQ Group

Top of this page ] [ Agenda overview ]


12:25pm-1:25pm

Lunch / Exhibits
Room: Albert Suites


1:25pm-2:15pm
Keynote
Thought Leader:
Case Study: Yahoo! and other large on-line e-businesses
Search Marketing and Predictive Analytics: SEM, SEO and On-line Marketing Case Studies

Search Engine Marketing is a $15B industry in the U.S. growing to double that number over the next 3 years. Worldwide the SEM market was over $50B in 2010. Not only is this a fast growing area of marketing, but it is one that has significant implications for brand and direct marketing and is undergoing rapid change with emerging channels such as mobile and social. What is unique about this area of marketing is a singularly heavy dependence on analytics:

 

  • Large numbers of variables and options
  • Real-time auctions/bids and a need to adjust strategies in real-time
  • Difficult optimization problems on allocating spend across a huge number of keywords
  • Fast-changing competitive terrain and heavy competition on the obvious channels
  • Complicated interactions between various channels and a large choice of search keyword expansion possibilities
  • Profitability and ROI analysis that are complex and often challenging

 

The size of the industry, its growing importance in marketing, its upcoming role in Mobile Advertising, and its uniquely heavy reliance on analytics makes it particularly interesting as an area for predictive analytics applications. In this session, not only will hear about some of the latest strategies and techniques to optimize search, you will hear case studies that illustrate the important role of analytics from industry practitioners.

Speaker: Usama Fayyad, , Ph.D., CEO, Open Insights

Top of this page ] [ Agenda overview ]


SAS2:15pm-2:35pm

Platinum Sponsor Presentation
Creating a Model Factory Using in-Database Analytics

With the ever-increasing number of analytical models required to make fact-based decisions, as well as increasing audit compliance regulations, it is more important than ever that these models can be created, monitored, retuned and deployed as quickly and automatically as possible. This paper, using a case study from a major financial organisation, will show how organisations can build a model factory efficiently using the latest SAS technology that utilizes the power of in-database processing.

Speaker: John Spooner, Analytics Specialist, SAS (UK)

Top of this page ] [ Agenda overview ]


2:35pm-2:45pm

Session Break
Room: Albert Suites


2:45pm-3:35pm

Retail
Case Study: SABMiller
Predictive Analytics & Global Marketing Strategy

Over the last few years SABMiller plc, the second largest brewing company in the world operating in 70 countries, has been systematically segmenting its markets in different countries globally in order optimize their portfolio strategy & align it to their long term country specific growth strategy. This presentation talks about the overall methodology followed and the challenges that had to be overcome both from a technical as well as from a change management stand point in order to successfully implement a standard analytics approach to diverse markets and diverse business positions in a highly global setting.

The session explains how country specific growth strategies were converted to objective variables and consumption occasion segments were created that differentiated the market effectively by their growth potential. In addition to this the presentation will also provide a discussion on issues like:

  • The dilemmas of static vs. dynamic solutions and standardization vs. adaptable solutions
  • Challenges in acceptability, local capability development, overcoming implementation inertia, cost effectiveness, etc
  • The role that business partners at SAB and analytics service partners at AbsolutData together play in providing impactful and actionable solutions

 

Speaker: Anne Stephens, SABMiller plc

Speaker: Titir Pal, AbsolutData

Top of this page ] [ Agenda overview ]


3:35pm-4:25pm

Retail
Case Study: Overtoom Belgium
Increasing Marketing Relevance Through Personalized Targeting

 

Since many years, Overtoom Belgium – a leading B2B retailer and division of the French Manutan group – focuses on an extensive use of CRM. In this presentation, we demonstrate how Overtoom has integrated Predictive Analytics to optimize customer relationships. In this process, they employ analytics to develop answers to the key question: “which product should we offer to which customer via which channel”. We show how Overtoom gained a 10% revenue increase by replacing the existing segmentation scheme with accurate predictive response models. Additionally, we illustrate how Overtoom succeeds to deliver more relevant communications by offering personalized promotional content to every single customer, and how these personalized offers positively impact Overtoom’s conversion rates.

Speaker: Dr. Geert Verstraeten, Python Predictions

Top of this page ] [ Agenda overview ]


4:25pm-4:50pm

Break / Exhibits
Room: Albert Suites


4:50pm-5:40pm
Uplift Modelling:
Case Study: Lloyds TSB General Insurance & US Bank
Uplift Modelling: You Should Not Only Measure But Model Incremental Response

Most marketing analysts understand that measuring the impact of a marketing campaign requires a valid control group so that uplift (incremental response) can be reported. However, it is much less widely understood that the targeting models used almost everywhere do not attempt to optimize that incremental measure. That requires an uplift model.

This session will explain why a switch to uplift modelling is needed, illustrate what can and does go wrong when they are not used and the hugely positive impact they can have when used effectively. It will also discuss a range of approaches to building and assessing uplift models, from simple basic adjustments to existing modelling processes through to full-blown uplift modelling.

The talk will use Lloyds TSB General Insurance & US Bank as a case study and also illustrate real-world results from other companies and sectors.

 

Speaker: Nicholas Radcliffe, Founder and Director, Stochastic Solutions

Top of this page ] [ Agenda overview ]


5:40pm-6:30pm

Consumer services
Case Study: Canadian Automobile Association and other B2C examples
The Diminishing Marginal Returns of Variable Creation in Predictive Analytics Solutions

 

Variable Creation is the key to success in any predictive analytics exercise. Many different approaches are adopted during this process, yet there are diminishing marginal returns as the number of variables increase. Our organization conducted a case study on four existing clients to explore this so-called diminishing impact of variable creation on predictive analytics solutions. Existing predictive analytics solutions were built using our traditional variable creation process. Yet, presuming that we could exponentially increase the number of variables, we wanted to determine if this added significant benefit to the existing solution.

Speaker: Richard Boire, BoireFillerGroup

Top of this page ] [ Agenda overview ]


6:30pm-7:30pm

Reception / Exhibits
Room: Albert Suites


Tuesday November 16, 2010
All conference sessions take place in Edward 5-7

8:00am-9:00am

Registration, Coffee and Danish
Room: Albert Suites


9:00am-9:55am
Keynote
Multiple Case Studies: Anheuser-Busch, Disney, HP, HSBC, Pfizer, and others
The High ROI of Data Mining for Innovative Organizations

Data mining and advanced analytics can enhance your bottom line in three basic ways, by 1) streamlining a process, 2) eliminating the bad, or 3) highlighting the good. In rare situations, a fourth way – creating something new – is possible. But modern organizations are so effective at their core tasks that data mining usually results in an iterative, rather than transformative, improvement. Still, the impact can be dramatic.

Dr. Elder will share the story (problem, solution, and effect) of nine projects conducted over the last decade for some of America’s most innovative agencies and corporations:

    Streamline:

  • Cross-selling for HSBC
  • Image recognition for Anheuser-Busch
  • Biometric identification for Lumidigm (for Disney)
  • Optimal decisioning for Peregrine Systems (now part of Hewlett-Packard)
  • Quick decisions for the Social Security Administration
    Eliminate Bad:

  • Tax fraud detection for the IRS
  • Warranty Fraud detection for Hewlett-Packard
    Highlight Good:

  • Sector trading for WestWind Foundation
  • Drug efficacy discovery for Pharmacia & UpJohn (now Pfizer)

Moderator: Eric Siegel, Program Chair, Predictive Analytics World

Speaker: John Elder, Ph.D., Elder Research, Inc.

Also see Dr. Elder’s full-day workshop

 

Top of this page ] [ Agenda overview ]


9:55am-10:30am

Break / Exhibits
Room: Albert Suites


10:30am-11:20am
Telecommunications
Case Study: Leading Telecommunications Operator
Predictive Analytics and Efficient Fact-based Marketing

The presentation describes what are the major topics and issues when you introduce predictive analytics and how to build a Fact-Based marketing environment. The introduced tools and methodologies proved to be highly efficient in terms of improving the overall direct marketing activity and customer contact operations for the involved companies. Generally, the introduced approaches have great potential for organizations with large customer bases like Mobile Operators, Internet Giants, Media Companies, or Retail Chains.

Main Introduced Solutions:-Automated Serial Production of Predictive Models for Campaign Targeting-Automated Campaign Measurements and Tracking Solutions-Precise Product Added Value Evaluation.

Speaker: Tamer Keshi, Ph.D., Long-term contractor, T-Mobile

Speaker: Beata Kovacs, International Head of CRM Solutions, Deutsche Telekom

Top of this page ] [ Agenda overview ]


11:20am-11:25am

Session Changeover


11:25am-12:15pm
Thought Leader
Nine Laws of Data Mining

Data mining is the predictive core of predictive analytics, a business process that finds useful patterns in data through the use of business knowledge. The industry standard CRISP-DM methodology describes the process, but does not explain why the process takes the form that it does. I present nine “laws of data mining”, useful maxims for data miners, with explanations that reveal the reasons behind the surface properties of the data mining process. The nine laws have implications for predictive analytics applications: how and why it works so well, which ambitions could succeed, and which must fail.

 

Speaker: Tom Khabaza, khabaza.com

 

Top of this page ] [ Agenda overview ]


12:15pm-1:30pm

Lunch / Exhibits
Room: Albert Suites


1:30pm-2:25pm
Expert Panel: Kaboom! Predictive Analytics Hits the Mainstream

Predictive analytics has taken off, across industry sectors and across applications in marketing, fraud detection, credit scoring and beyond. Where exactly are we in the process of crossing the chasm toward pervasive deployment, and how can we ensure progress keeps up the pace and stays on target?

This expert panel will address:

  • How much of predictive analytics’ potential has been fully realized?
  • Where are the outstanding opportunities with greatest potential?
  • What are the greatest challenges faced by the industry in achieving wide scale adoption?
  • How are these challenges best overcome?

 

Panelist: John Elder, Ph.D., Elder Research, Inc.

Panelist: Colin Shearer, WW Industry Solutions Leader, IBM UK Ltd

Panelist: Udo Sglavo, Global Analytic Solutions Manager, SAS

Panel moderator: Eric Siegel, Ph.D., Program Chair, Predictive Analytics World


2:25pm-2:30pm

Session Changeover


2:30pm-3:20pm
Crowdsourcing Data Mining
Case Study: University of Melbourne, Chessmetrics
Prediction Competitions: Far More Than Just a Bit of Fun

Data modelling competitions allow companies and researchers to post a problem and have it scrutinised by the world’s best data scientists. There are an infinite number of techniques that can be applied to any modelling task but it is impossible to know at the outset which will be most effective. By exposing the problem to a wide audience, competitions are a cost effective way to reach the frontier of what is possible from a given dataset. The power of competitions is neatly illustrated by the results of a recent bioinformatics competition hosted by Kaggle. It required participants to pick markers in HIV’s genetic sequence that coincide with changes in the severity of infection. Within a week and a half, the best entry had already outdone the best methods in the scientific literature. This presentation will cover how competitions typically work, some case studies and the types of business modelling challenges that the Kaggle platform can address.

Speaker: Anthony Goldbloom, Kaggle Pty Ltd

Top of this page ] [ Agenda overview ]


3:20pm-3:50pm

Breaks /Exhibits
Room: Albert Suites


3:50pm-4:40pm
Human Resources; e-Commerce
Case Study: Naukri.com, Jeevansathi.com
Increasing Marketing ROI and Efficiency of Candidate-Search with Predictive Analytics

InfoEdge, India’s largest and most profitable online firm with a bouquet of internet properties has been Google’s biggest customer in India. Our team used predictive modeling to double our profits across multiple fronts. For Naukri.com, India’s number 1 job portal, predictive models target jobseekers most relevant to the recruiter. Analytical insights provided a deeper understanding of recruiter behaviour and informed a redesign of this product’s recruiter search functionality. This session will describe how we did it, and also reveal how Jeevansathi.com, India’s 2nd-largest matrimony portal, targets the acquisition of consumers in the market for marriage.

 

Speaker: Suvomoy Sarkar, Chief Analytics Officer, HT Media & Info Edge India (parent company of the two companies above)

 

Top of this page ] [ Agenda overview ]


4:40pm-5:00pm
Closing Remarks

Speaker: Eric Siegel, Ph.D., Program Chair, Predictive Analytics World

Top of this page ] [ Agenda overview ]


Wednesday November 17, 2010

Full-day Workshop
The Best and the Worst of Predictive Analytics:
Predictive Modeling Methods and Common Data Mining Mistakes

Click here for the detailed workshop description

  • Workshop starts at 9:00am
  • First AM Break from 10:00 – 10:15
  • Second AM Break from 11:15 – 11:30
  • Lunch from 12:30 – 1:15pm
  • First PM Break: 2:00 – 2:15
  • Second PM Break: 3:15 – 3:30
  • Workshop ends at 4:30pm

Speaker: John Elder, Ph.D., CEO and Founder, Elder Research, Inc.

 

KDNuggets Poll on SAS: Churn in Analytics Users

Here are the some surprising results from the Bible of all Data Miners , KDNuggets.com with some interesting comments about SAS being the Microsoft of analytics.

I believe technically advanced users will probably want to try out R before going in for a commercial license from Revolution Analytics as it is free to try out. Also WPS offers a one month free preview for its software- the latest release of it competes with SAS/Stat and SAS/Access, SAS/Graph and Base SAS- so anyone having these installations on a server would be interested to atleast test it for free. Also WPS would be interested in increasing engines (like they have for Oracle and Teradata).

One very crucial difference for SAS is it’s ability to pull in data from almost all data formats- so if you are using SAS/Connect to remote submit code- then you may not be able to switch soon.

Also the more license heavy customers are not the kind of cutomers who have lots of data in their local desktops but is usually pulled and then crunched before analysed. R has recently made some strides with the RevoScaler package from Revolution Analytics but it’s effectiveness would be tested and tried in the coming months- it seems like a great step in the right direction.

For SAS, the feedback should be a call to improve their product bundling – some of which can feel like over selling at times- but they have been fighting off challenges since past 4 decades and have the pockets and intention to sustain market share battles including discounts ( for repeat customers SAS can be much cheaper than say a first time user of WPS or R)

http://teamwpc.co.uk/home

This really should come as a surprise to some people. You can see the comments on WPS and R at the site itself. Interesting stufff and we can see after say 1 year to see how many actually DID switch.

http://www.kdnuggets.com/polls/2010/switching-from-sas-to-wps.html

Best of Decision Stats- Modeling and Text Mining Part3

Here are some of the top articles by way of views, in an  area I love– of modeling and text mining.

1) Karl Rexer – Rexer Analytics

http://www.decisionstats.com/2009/06/09/interview-karl-rexer-rexer-analytics/

Karl produces one of the most respected surveys that captures emerging trends in data mining and technology. Karl was also one of the most enthusiastic people I have interviewed- and I am thankful for his help in getting me some more interviews.

2) Gregory Piatesky Shapiro

One of the earliest and easily the best Knowledge Discoverer of all times, Gregory produces http://www.kdnuggets.com and the newsletter is easily the must newsletter to be on. Gregory was doing data mining , while the Google boys were still debating whether to drop out of Stanford or not.
Continue reading “Best of Decision Stats- Modeling and Text Mining Part3”

Interview Gregory Piatetsky KDNuggets.com

Here is an interviw with Gregory Piatetsky, founder and editor of KDNuggets (www.KDnuggets.com ) ,the oldest and biggest independent industry websites in terms of data mining and analytics-

gps6

Ajay- Please describe your career in science, many challenges and rewards that came with it. Name any scientific research, degrees teaching etc.


Gregory-
I was born in Moscow, Russia and went to a top math high-school in Moscow. A unique  challenge for me was that my father was one of leading mathematicians in Soviet Union.  While I liked math (and still do), I quickly realized while still in high school that  I will never be as good as my father, and math career was not for me.

Fortunately, I discovered computers and really liked the process of programming and solving applied problems.  At that time (late 1970s) computers were not very popular and it was not clear that one can make a career in computers.  However I was very lucky that I was able to pursue what I liked and find demand for my skills.

I got my MS in 1979 and PhD in 1984 in Computer Science from New York University.
I was interested in AI (perhaps thanks to a lot of science fiction I read as a kid), but found a job in databases, so I was looking for ways to combine them.

In 1984 I joined GTE Labs where I worked on research in databases and AI, and in 1989 started the first project on Knowledge Discovery in data. To help convince my management that there will be a demand for this thing
called “data mining” (GTE management did not see much future for it), I also organized a AAAI workshop on the topic.

I thought “data mining” is not sexy enough name, and so I called it “Knowledge Discovery in Data”, or KDD.  Since 1989, I was working on KDD and data mining in all aspects – more on my page www.kdnuggets.com/gps.html

Ajay-  How would you encourage a young science entrepreneur in this recession.

Gregory- Many great companies were started or grew in a recession, e.g.
http://www.insidecrm.com/features/businesses-started-slump-111108/

Recession may be compared to a brush fire which removes dead wood and allows new trees to grow.

Ajay- What prompted you to set up KD Nuggets? Any reasons for the name (kNowledge Discovery Nuggets). Describe some key milestones in this iconic website for data mining people.

Gregory- After a third KDD workshop in 1993 I started a newsletter to connect about 50 people who attended the workshop and possibly others who were interested in data mining and KDD.  The idea was that it will have short items or “nuggets” of information. Also, at that time a popular metaphor for data miner was gold miners who were looking for gold “nuggets”.  So, I wanted a newsletter with “nuggets” – short, valuable items about Knowledge Discovery.  Thus, the name KDnuggets.

In 1994 I created a website on data mining at GTE and in 1997, after I left  GTE , I moved it to the current domain name www.kdnuggets.com .

In 1999, I was working for startup which provided data mining services to financial industry.  However, because of Y2K issues, all banks etc froze their systems in the second half of 1999, and we had very little work (and our salaries were reduced as well).  I decided that I will try to get some ads and was able to get companies like SPSS and Megaputer to advertise.

Since 2001, I am an independent consultant and KDnuggets is only part of what I am doing.  I also do data mining consulting, and actively participate in SIGKDD (Director 1998-2005, Chair 2005-2009).

Some people think that KDnuggets is a large company, with publisher, webmaster, editor, ad salesperson, billing dept, etc.  KDnuggets indeed has all this functions, but it is all me and my two cats.

Ajay- I am impressed by the fact KD nuggets is almost a dictionary or encyclopedia for data mining. But apart from advertising you have not been totally commercial- many features of your newsletter remain ad free – you still maintain a minimalistic look and do not take sponsership aligned with one big vendor. What is your vision for KD Nuggets for the years to come to keep it truly independent.

Gregory- My vision for KDnuggets is to be a comprehensive resource for data mining community, and I really enjoyed maintaining such resource for the first 7-8 years completely non-commercially. However, when I became self -employed, I could not do KDnuggets without any income, so I selectively introduced ads, and only those which are relevant to data mining.

I like to think of KDnuggets as a Craiglist for data mining community.

I certainly realize the importance of social media and Web 2.0 (and interested people can follow my tweets at tweeter.com/kdnuggets)  and plan to add more social features to KDnuggets.

Still, just like Wikipedia and Facebook do not make New York Times obsolete, I think there is room and need for an edited website, especially for such a nerdy and not very social group like data miners.

Ajay- What is the worst mistake/error in writing publishing that you did. What is the biggest triumph or high moment in the Nuggets history.

Gregory- My biggest mistake is probably in choosing the name kdnuggets – in retrospect,  I could have used a shorter and easier to spell domain name, but in 1997 I never expected that I will still be publishing www.KDnuggets.com 12 years later.

Ajay- Who are your favourite data mining students ( having known so many people). What qualities do you think set a data mining person apart from other sceinces.

Gregory- I was only an adjunct professor for a short time, so I did not really have data mining students, but I was privileged enough to know many current data mining leaders when they were students.  Among more recent students, I am very impressed with Jure Leskovec, who just finished his PhD and got the best KDD dissertation award.

Ajay- What does Gregory Piatetsky do for fun when he is not informing the world on analytics and knowledge discovery.

Gregory- I enjoy travelling with my family, and in the summer I like biking and windsurfing.
I also read a lot, and currently in the middle of reading Proust (which I periodically dilute by other, lighter books).

Ajay- What is your favourite reading blog and website ? Any India plans to visit.
Gregory
– I visit many blogs on www.kdnuggets.com/websites/blogs.html

and I like especially
– Matthew Hurst blog: Data Mining: Text Mining, Visualization, and Social Media
– Occam’s Razor by Avinash Kaushik, examining web analytics.
– Juice Analytics, blogging about analytics and visualization
– Geeking with Greg, exploring the future of personalized information.

I also like your website decisionstats.com and plan to visit it more frequently

I visited many countries, but not yet India – waiting for the right occasion !

Biography

(http://www.kdnuggets.com/gps.html)

Gregory Piatetsky-Shapiro, Ph.D. is the President of KDnuggets, which provides research and consulting services in the areas of data mining, web mining, and business analytics. Gregory is considered to be one of the founders of the data mining and knowledge discovery field.Gregory edited or co-edited many collections on data mining and knowledge discovery, including two best-selling books: Knowledge Discovery in Databases (AAAI/MIT Press, 1991) and Advances in Knowledge Discovery in Databases (AAAI/MIT Press, 1996), and has over 60 publications in the areas of data mining, artificial intelligence and database research.

Gregory is the founder of Knowledge Discovery in Database (KDD) conference series. He organized and chaired the first three Knowledge Discovery in Databases (KDD) workshops in 1989, 1991, and 1993. He then served as the Chair of KDD Steering committee and guided the conversion of KDD workshops into leading international conferences on data mining. He also was the General Chair of the KDD-98 conference.

Decisionstats on KDNuggets

image

That dear reader is your site third from bottom ( http://kdnuggets.com/websites/blogs.html ) .How big is recognition from this well HE was around and BIG before two Stanford Phd dropouts wrote that text mining algorithm .

Thanks for helping us come there hopefully it is not just April list (see http://www.decisionstats.com/2009/04/april-2-announcement/ )

 

And read this if you want to know how to catch visitors that link to you –http://www.decisionstats.com/privacy/

Use clicky software for row level data on your website . Get it here from this link www.getclicky.com . I used to pay them 5 $ a month before I ran out of $, but the free version is great as well.