Interview- Top Data Mining Blogger on Earth , Sandro Saitta

Surajustement Modèle 2
Image via Wikipedia

If you do a Google search for Data Mining Blog- for the past several years one Blog will come on top. data mining blog – Google Search http://bit.ly/kEdPlE

To honor 5 years of Sandro Saitta’s blog (yes thats 5 years!) , we cover an exclusive interview with him where he reveals his unique sauce for cool techie blogging.

Ajay- Describe your journey as a scientist and data miner, from early experiences, to schooling to your work/research/blogging.

Sandro- My first experience with data mining was my master project. I used decision tree to predict pollen concentration for the following week using input data such as wind, temperature and rain. The fact that an algorithm can make a computer learn from experience was really amazing to me. I found it so interesting that I started a PhD in data mining. This time, the field of application was civil engineering. Civil engineers put a lot of sensors on their structure in order to understand how they behave. With all these sensors they generate a lot of data. To interpret these data, I used data mining techniques such as feature selection and clustering. I started my blog, Data Mining Research, during my PhD, to share with other researchers.

I then started applying data mining in the stock market as my first job in industry. I realized the difference between image recognition, where 99% correct classification rate is state of the art, and stock market, where you’re happy with 55%. However, the company ambiance was not as good as I thought, so I moved to consulting. There, I applied data mining in behavioral targeting to increase click-through rates. When you compare the number of customers who click with the ones who don’t, then you really understand what class imbalance mean. A few months ago, I accepted a very good opportunity at SICPA. I’m looking forward to resolving new challenges there.

Ajay- Your blog is the top ranked blog for “data mining blog”. Could you share some tips on better blogging for analytics and technical people

Sandro- It’s always difficult to start a blog, since at the beginning you have no reader. Writing for nobody may seem stupid, but it is not. By writing my first posts during my PhD I was reorganizing my ideas. I was expressing concepts which were not always clear to me. I thus learned a lot and also improved my English level. Of course, it’s still not perfect, but I hope most people can understand me.

Next come the readers. A few dozen each week first. To increase this number, I then started to learn SEO (Search Engine Optimization) by reading books and blogs. I tested many techniques that increased Data Mining Research visibility in the blogosphere. I think SEO is interesting when you already have some content published (which means not at the very beginning of your blog). After a while, once your blog is nicely ranked, the main task is to work on the content of the blog. To be of interest, your content must be particular: original, informative or provocative for example. I also had the chance to have a good visibility thanks to well-known people in the field like Kevin Hillstrom, Gregory Piatetsky-Shapiro, Will Dwinnell / Dean Abbott, Vincent Granville, Matthew Hurst and many others.

Ajay- Whats your favorite statistical software and what are the various softwares that you have worked with.
Could you compare and contrast these software as well.

Sandro- My favorite software at this point is SAS. I worked with it for two years. Once you know the language, you can perform ETL and data mining so easily. It’s also very fast compared to others. There are a lot of tools for data mining, but I cannot think of a tool that is as powerful as SAS and, in the same time, has a high-level programming language behind it.

I also worked with R and Matlab. R is very nice since you have all the up-to-date data mining algorithms implemented. However, working in the memory is not always a good choice, especially for ETL. Matlab is an excellent tool for prototyping. It’s not so fast and certainly not done for ETL, but the price is low regarding all the possibilities for data mining. According to me, SAS is the best choice for ETL and a good choice for data mining. Of course, there is the price.

Ajay- What are your favorite techniques and training resources for learning basics of data mining to say statisticians or business management graduates.

Sandro- I’m the kind of guy who likes to read books. I read data mining books one after the other. The fact that the same concepts are explained differently (and by different people) helps a lot in learning a topic like data mining. Of course, nothing replaces experience in the field. You can read hundreds of books, you will still not be a good practitioner until you really apply data mining in specific fields. My second choice after books is blogs. By reading data mining blogs, you will really see the issues and challenges in the field. It’s still not experience, but we are closer. Finally, web resources and networks such as KDnuggets of course, but also AnalyticBridge and LinkedIn.

Ajay- Describe your hobbies and how they help you ,if at all in your professional life.

Sandro- One of my hobbies is reading. I read a lot of books about data mining, SEO, Google as well as Sci-Fi and Fantasy. I’m a big fan of Asimov by the way. My other hobby is playing tennis. I think I simply use my hobbies as a way to find equilibrium in my life. I always try to find the best balance between work, family, friends and sport.

Ajay- What are your plans for your website for 2011-2012.

Sandro- I will continue to publish guest posts and interviews. I think it is important to let other people express themselves about data mining topics. I will not write about my current applications due to the policies of my current employer. But don’t worry, I still have a lot to write, whether it is technical or not. I will also emphasis more on my experience with data mining, advices for data miners, tips and tricks, and of course book reviews!

Standard Disclosure of Blogging- Sandro awarded me the Peoples Choice award for his blog for 2010 and carried out my interview. There is a lot of love between our respective wordpress blogs, but to reassure our puritan American readers- it is platonic and intellectual.

About Sandro S-



Sandro Saitta is a Data Mining Research Engineer at SICPA Security Solutions. He is also a blogger at Data Mining Research (www.dataminingblog.com). His interests include data mining, machine learning, search engine optimization and website marketing.

You can contact Mr Saitta at his Twitter address- 

https://twitter.com/#!/dataminingblog

HIGHLIGHTS from REXER Survey :R gives best satisfaction

Simple graph showing hierarchical clustering. ...
Image via Wikipedia

A Summary report from Rexer Analytics Annual Survey

 

HIGHLIGHTS from the 4th Annual Data Miner Survey (2010):

 

•   FIELDS & GOALS: Data miners work in a diverse set of fields.  CRM / Marketing has been the #1 field in each of the past four years.  Fittingly, “improving the understanding of customers”, “retaining customers” and other CRM goals are also the goals identified by the most data miners surveyed.

 

•   ALGORITHMS: Decision trees, regression, and cluster analysis continue to form a triad of core algorithms for most data miners.  However, a wide variety of algorithms are being used.  This year, for the first time, the survey asked about Ensemble Models, and 22% of data miners report using them.
A third of data miners currently use text mining and another third plan to in the future.

 

•   MODELS: About one-third of data miners typically build final models with 10 or fewer variables, while about 28% generally construct models with more than 45 variables.

 

•   TOOLS: After a steady rise across the past few years, the open source data mining software R overtook other tools to become the tool used by more data miners (43%) than any other.  STATISTICA, which has also been climbing in the rankings, is selected as the primary data mining tool by the most data miners (18%).  Data miners report using an average of 4.6 software tools overall.  STATISTICA, IBM SPSS Modeler, and R received the strongest satisfaction ratings in both 2010 and 2009.

 

•   TECHNOLOGY: Data Mining most often occurs on a desktop or laptop computer, and frequently the data is stored locally.  Model scoring typically happens using the same software used to develop models.  STATISTICA users are more likely than other tool users to deploy models using PMML.

 

•   CHALLENGES: As in previous years, dirty data, explaining data mining to others, and difficult access to data are the top challenges data miners face.  This year data miners also shared best practices for overcoming these challenges.  The best practices are available online.

 

•   FUTURE: Data miners are optimistic about continued growth in the number of projects they will be conducting, and growth in data mining adoption is the number one “future trend” identified.  There is room to improve:  only 13% of data miners rate their company’s analytic capabilities as “excellent” and only 8% rate their data quality as “very strong”.

 

Please contact us if you have any questions about the attached report or this annual research program.  The 5th Annual Data Miner Survey will be launching next month.  We will email you an invitation to participate.

 

Information about Rexer Analytics is available at www.RexerAnalytics.com. Rexer Analytics continues their impressive journey see http://www.rexeranalytics.com/Clients.html

|My only thought- since most data miners are using multiple tools including free tools as well as paid software, Perhaps a pie chart of market share by revenue and volume would be handy.

Also some ideas on comparing diverse data mining projects by data size, or complexity.

 

Interview Karl Rexer -Rexer Analytics

Here is an interview with Karl Rexer of Rexer Analytics. His annual survey is considered a benchmark in the data mining and analytics industry. Here Karl talks of his career, his annual survey and his views on the industry direction and trends.

Almost 20% of data miners report that their company/organizations have only minimal analytic capabilities – Karl Rexer

IMG_2031

Ajay- Describe your career in science. What advice would you give to young science graduates in this recession? What advice would you give to high school students choosing from science – non science careers?

Karl- My interests in science began as a child. My father has multiple science degrees, and I grew up listening to his descriptions of the cool things he was building, or the cool investigative tools he was using, in his lab. He worked in an industrial setting, so visiting was difficult. But when I could, I loved going in to see the high-temperature furnaces he was designing, the carbon-fiber production processes he was developing, and the electron microscope that allowed him to look at his samples. Both of my parents encouraged me to ask why, and to think critically about both scientific and social issues. It was also the time of the Apollo moon landings, and I was totally absorbed in watching and thinking about them. Together these things motivated me and shaped my world-view.

I have also had the good fortune to work across many diverse areas and with some truly outstanding people. In graduate school I focused on applied statistics and the use of scientific methods in the social sciences. As a grad student and young academic, I applied those skills to researching how our brains process language. But on the side, I pursued a passion for using the scientific method and analytics to address ….well anything I could. We called it “statistical consulting” then, but it often extended to research design and many other parts of the scientific process. Some early projects included assisting people with AIDS outcome studies, psycholinguistic research, and studies of adolescent adjustment.

My first taste of applying these skills outside of an academic environment was with my mentor Len Katz. The US Navy hired us to help assess the new recruits that were entering the submarine school. Early identification of sailors who would excel in this unusual and stressful environment was critical. Perhaps even more important was identifying sailors who would not perform well in that environment. Luckily, the Navy had years of academic and psychological testing on many sailors, and this data proved quite useful in predicting later job performance onboard the submarines. Even though we never got the promised submarine ride, I was hooked on applying measurement, scientific methods, and analytics in non-academic settings.

And that’s basically what I have continued to do – apply those skills and methods in diverse scientific and business settings. I worked for two banks and two consulting firms before founding Rexer Analytics in 2002. Last year we supported 30 clients. I’ve got great staff and they have great quant skills. Importantly, we also don’t hesitate to challenge each other, and we’re continually learning from each other and from each client engagement. We share a love of project diversity, and we seek it out in our engagements. We’ve forecasted sales for medical devices, measured B2B customer loyalty, identified manufacturing problems by analyzing product returns, predicted which customers will close their bank accounts, analyzed millions of tax returns, helped identify the dimensions of business team cohesion that result in better performance, found millions of dollars of B2B and B2C fraud, and helped many companies understand their customers better with segmentations, surveys, and analyses of sales and customer behavior.

The advice I would give to young science grads in this recession is to expand your view of where you can apply your scientific training. This applies to high school students considering science careers too. All science does not happen in universities, labs and other traditional science locations. Think about applying scientific methods everywhere! Sometimes our projects at Rexer Analytics seem far away from what most people would consider “science.” But we’re always asking “what data is available that can be brought to bear on the business issue we’re addressing.” Sometimes the best solution is to go out and collect more data – so we frequently help our clients improve their measurement processes or design surveys to collect the necessary data. I think there are enormous opportunities for science grads to apply their scientific training in the business world. The opportunities are not limited to physics wiz-kids making models for Wall Street trading or computer science students moving to Silicon Valley. One of the best analytic teams I ever worked on was at Fleet Bank in the late 90s. We had an economist, two physicists, a sociologist, a psychologist, an operations research guy, and person with a degree in marketing science. We were all very focused on data, measurement, and analytic methods.

I recommend that all science grads read Tom Davenport’s book Competing on Analytics *. It illustrates, with compelling examples, how businesses can benefit from using science and analytics. Several examples in Tom’s book come from Gary Loveman, CEO of Harrah’s Entertainment. I think that Gary also serves as a great example of how scientific methods can be applied in every industry. Gary has a PhD in economics from MIT, he’s worked at the Federal Reserve Bank, he’s been a professor at Harvard, but more recently he runs the world’s largest casino and gaming company. And he’s famously said many times that there are three ways to get fired at Harrah’s: steal, harass women, or not use a control group. Business leaders across all industries are increasingly wanting data, analytics and scientific decision-making. Science grads have great training that enables them to take on these roles and to demonstrate the success of these methods.

Ajay- One more survey- How does the Rexer survey differentiate itself from other surveys out there?

Karl- The Annual Rexer Analytics Data Miner Survey is the only broad-reaching research that investigates the analytic behaviors, views and preferences of data mining professionals. Each year our sample grows — in 2009 we had over 700 people around the globe complete our survey. Our participants include large numbers of both academic and business people.

Another way our survey is differentiated from other surveys is that each year we ask our participants to provide suggestions on ways to improve the survey. Incorporating participants’ suggestions improves our survey. For example, in 2008 several people suggested adding questions about model deployment and off-shoring. We asked about both of these topics in the 2009 survey.

Ajay -Could you please share some sneak previews of the survey results? What impact is the recession likely to have on IT spending?

Karl- We’re just starting to analyze the 2009 survey data. But, yes, here’s a peek at some of the findings that relate to the impact of the recession:

* Many data miners report that funding for data mining projects can sometimes be a problem.
* However, when asked what will happen in 2009 if the economic downturn continues, many data miners still anticipate that their company/organization will conduct more data mining projects in 2009 than in previous years (41% anticipate more projects in 2009; 27% anticipate fewer projects).
* The vast majority of companies conduct their data mining internally, and very few are sending data mining off-shore.

I don’t have a crystal ball that tells me about the trends in overall corporate spending on IT, Business Intelligence, or Data Mining. It’s my personal experience that many budgets are tight this year, but that key projects are still getting funded. And it is my strong opinion that in the coming years many companies will increase their focus on analytics, and I think that increasingly analytics will be a source of competitive advantage for these companies.

There are other people and other surveys that provide better insight into the trends in IT spending. For example, Gartner’s recent survey of over 1,500 CIOs (http://www.gartner.com/it/page.jsp?id=855612 ) suggests that 2009 IT spending is likely to be flat. I’m personally happy to see that in the Gartner survey, Business Intelligence is again CIOs’ top technology priority, and that “increasing the use of information/analytics” is the #5 business priority.

Ajay- I noticed you advise SPSS among others. Describe what an advisory role is for an analytics company and how can small open source companies get renowned advisors?

Karl- We have advised Oracle, SPSS, Hewlett-Packard and several smaller companies. We find that advisory roles vary greatly. The biggest source of variation is what the company wants advice about. Example include:

* assessing opportunity areas for the application of analytics
* strategic data assessments
* analytic strategy
* product strategy
* reviewing software

Both large and small companies that look to apply analytics to their businesses can benefit from analytic advisors. So can open source companies that sell analytic software. Companies can find analytic advisors in several ways. One way is to look around for analytic experts whose advice you trust, and hire them. Networking in your own industry and in the analytic communities can identify potential advisors. Don’t forget to look in both academia and the business world. Many skilled people cross back and forth between these two worlds. Another way for these companies to obtain analytic advice is to look in their business networks and user communities for analytic specialists who share some of the goals of the company – they will be motivated for your company to succeed. Especially if focused topic areas or time-constrained tasks can be identified, outside experts may be willing to donate their time, and they may be flattered that you asked.

Ajay- What made you decide to begin the Rexer Surveys? Describe some results of last year’s surveys and any trends from the last three years that you have seen.

Karl- I’ve been involved on the organizing committees of several data mining workshops and conferences. At these conferences I talk with a lot of data miners and companies involved in data mining. I found that many people were interested in hearing about what other data miners were doing: what algorithms, what types of data, what challenges were being faced, what they liked and disliked about their data mining tools, etc. Since we conduct online surveys for several of our clients, and my network of data miners is pretty large, I realized that we could easily do a survey of data miners, and share the results with the data mining community. In the first year, 314 data miners participated, and it’s just grown from there. In 2009 over 700 people completed the survey. The interest we’ve seen in our research summaries has also been astounding – we’ve had thousands of requests. Overall, this just confirms what we originally thought: people are hungry for information about data mining.

Here is a preview of findings from the initial analyses of the 2009 survey data:

* Each year we’ve seen that the most commonly used algorithms are decision trees, regression, and cluster analysis.
* Consistently, some of the top challenges data miners report are dirty data and explaining data mining to others. Previously, data access issues were also reported as a big challenge, but in 2009 fewer data miners reported facing this challenge.
* The most prevalent concerns with how data mining is being utilized are: insufficient training of some data miners, and resistance to using data mining in contexts where it would be beneficial.
* Data mining is playing an important role in organizations. Half of data miners indicate their results are helping to drive strategic decisions and operational processes.
* But there’s room for data mining to grow – almost 20% of data miners report that their company/organizations have only minimal analytic capabilities.

Bio-

Karl Rexer, PhD is President of Rexer Analytics, a small Boston-based consulting firm. Rexer Analytics provides analytic and CRM consulting to help clients use their data to make better strategic and tactical decisions. Recent projects include fraud detection, sales forecasting, customer segmentation, loyalty analyses, predictive modeling for cross-sell and attrition, and survey research. Rexer Analytics also conducts an annual survey of data miners and freely distributes research summaries to the data mining community. Karl has been on the organizing committees of several international data mining conferences, including 3 KDD conferences, and BIWA-2008. Karl is on the SPSS Customer Advisory Board and on the Board of Directors of the Oracle Business Intelligence, Warehousing, & Analytics (BIWA) Special Interest Group. Karl and other Rexer Analytics staff are frequent invited speakers at MBA data mining classes and conferences.

To know more do check out the website on www.rexeranalytics.com

*

%d bloggers like this: