Interview- Top Data Mining Blogger on Earth , Sandro Saitta

Surajustement Modèle 2
Image via Wikipedia

If you do a Google search for Data Mining Blog- for the past several years one Blog will come on top. data mining blog – Google Search http://bit.ly/kEdPlE

To honor 5 years of Sandro Saitta’s blog (yes thats 5 years!) , we cover an exclusive interview with him where he reveals his unique sauce for cool techie blogging.

Ajay- Describe your journey as a scientist and data miner, from early experiences, to schooling to your work/research/blogging.

Sandro- My first experience with data mining was my master project. I used decision tree to predict pollen concentration for the following week using input data such as wind, temperature and rain. The fact that an algorithm can make a computer learn from experience was really amazing to me. I found it so interesting that I started a PhD in data mining. This time, the field of application was civil engineering. Civil engineers put a lot of sensors on their structure in order to understand how they behave. With all these sensors they generate a lot of data. To interpret these data, I used data mining techniques such as feature selection and clustering. I started my blog, Data Mining Research, during my PhD, to share with other researchers.

I then started applying data mining in the stock market as my first job in industry. I realized the difference between image recognition, where 99% correct classification rate is state of the art, and stock market, where you’re happy with 55%. However, the company ambiance was not as good as I thought, so I moved to consulting. There, I applied data mining in behavioral targeting to increase click-through rates. When you compare the number of customers who click with the ones who don’t, then you really understand what class imbalance mean. A few months ago, I accepted a very good opportunity at SICPA. I’m looking forward to resolving new challenges there.

Ajay- Your blog is the top ranked blog for “data mining blog”. Could you share some tips on better blogging for analytics and technical people

Sandro- It’s always difficult to start a blog, since at the beginning you have no reader. Writing for nobody may seem stupid, but it is not. By writing my first posts during my PhD I was reorganizing my ideas. I was expressing concepts which were not always clear to me. I thus learned a lot and also improved my English level. Of course, it’s still not perfect, but I hope most people can understand me.

Next come the readers. A few dozen each week first. To increase this number, I then started to learn SEO (Search Engine Optimization) by reading books and blogs. I tested many techniques that increased Data Mining Research visibility in the blogosphere. I think SEO is interesting when you already have some content published (which means not at the very beginning of your blog). After a while, once your blog is nicely ranked, the main task is to work on the content of the blog. To be of interest, your content must be particular: original, informative or provocative for example. I also had the chance to have a good visibility thanks to well-known people in the field like Kevin Hillstrom, Gregory Piatetsky-Shapiro, Will Dwinnell / Dean Abbott, Vincent Granville, Matthew Hurst and many others.

Ajay- Whats your favorite statistical software and what are the various softwares that you have worked with.
Could you compare and contrast these software as well.

Sandro- My favorite software at this point is SAS. I worked with it for two years. Once you know the language, you can perform ETL and data mining so easily. It’s also very fast compared to others. There are a lot of tools for data mining, but I cannot think of a tool that is as powerful as SAS and, in the same time, has a high-level programming language behind it.

I also worked with R and Matlab. R is very nice since you have all the up-to-date data mining algorithms implemented. However, working in the memory is not always a good choice, especially for ETL. Matlab is an excellent tool for prototyping. It’s not so fast and certainly not done for ETL, but the price is low regarding all the possibilities for data mining. According to me, SAS is the best choice for ETL and a good choice for data mining. Of course, there is the price.

Ajay- What are your favorite techniques and training resources for learning basics of data mining to say statisticians or business management graduates.

Sandro- I’m the kind of guy who likes to read books. I read data mining books one after the other. The fact that the same concepts are explained differently (and by different people) helps a lot in learning a topic like data mining. Of course, nothing replaces experience in the field. You can read hundreds of books, you will still not be a good practitioner until you really apply data mining in specific fields. My second choice after books is blogs. By reading data mining blogs, you will really see the issues and challenges in the field. It’s still not experience, but we are closer. Finally, web resources and networks such as KDnuggets of course, but also AnalyticBridge and LinkedIn.

Ajay- Describe your hobbies and how they help you ,if at all in your professional life.

Sandro- One of my hobbies is reading. I read a lot of books about data mining, SEO, Google as well as Sci-Fi and Fantasy. I’m a big fan of Asimov by the way. My other hobby is playing tennis. I think I simply use my hobbies as a way to find equilibrium in my life. I always try to find the best balance between work, family, friends and sport.

Ajay- What are your plans for your website for 2011-2012.

Sandro- I will continue to publish guest posts and interviews. I think it is important to let other people express themselves about data mining topics. I will not write about my current applications due to the policies of my current employer. But don’t worry, I still have a lot to write, whether it is technical or not. I will also emphasis more on my experience with data mining, advices for data miners, tips and tricks, and of course book reviews!

Standard Disclosure of Blogging- Sandro awarded me the Peoples Choice award for his blog for 2010 and carried out my interview. There is a lot of love between our respective wordpress blogs, but to reassure our puritan American readers- it is platonic and intellectual.

About Sandro S-



Sandro Saitta is a Data Mining Research Engineer at SICPA Security Solutions. He is also a blogger at Data Mining Research (www.dataminingblog.com). His interests include data mining, machine learning, search engine optimization and website marketing.

You can contact Mr Saitta at his Twitter address- 

https://twitter.com/#!/dataminingblog

Browsing update- Dear Decisionstats.com Reader

Wordpress default1 mainpage
Image via Wikipedia

In view of the recent root level breach of WordPress, which may include viewing source code for hidden hacks or Trojans, as effective immediately, please Decisionstats.com has no responsibility for any viruses, or Trojans that you may inadvertently download while on this website. I will be responsible for any deliberate malicious honey traps I put up , but any body putting an interesting comment with a link on this website , can and may direct you to phishing.

All disputes will be to subject to the jurisdiction of Tis Hazari Court, Delhi, India as already mentioned.

Forecasting World Events Team

a large and diverse panel of forecasters, including substantial representation from government, academia, “think tanks,” and industry. Here are a few other details concerning your fellow participants:
  • At this time, over 600 people are being invited to participate. Please note that we expect that new participants will be joining the panel on a rolling basis for years to come.
  • Around 85% of these 600+ participants have at least a Bachelor’s degree, and over 60% of them have advanced degrees.
  • In terms of background training, participants represent a range of academic fields. Around 40% report a Social-Behavioral Science background, but there is also significant representation from those with backgrounds in Business (15%), the Humanities (13%), Engineering (12%), and the Natural Sciences (10%), among others.
  • The average participant age is 43 years-old, with a standard deviation of 15 years.
  • The panel’s gender composition is 75% men / 25% women, and this closely mirrors the gender ratio for all FWE registrants.
  • In addition to participation from individuals overseas, we are pleased to have eligible participants representing 44 of the 50 United States.
We are currently scheduled to begin the core forecasting study in late summer, a few months later than we initially anticipated. In the meantime, we will be readying our web-based forecasting environment and assembling our initial set of forecasting questions. As our formal launch date approaches, we will be contacting you with a link to the forecasting website and any other information you’ll need to get started. Between now and then we may reach out to you with other related announcements.
Finally, registration remains open, and we encourage you to “spread the word” by sharing our registration homepage link with your friends and colleagues.
Thanks once again for your interest in Forecasting World Events. We look forward to you joining us this summer.
Sincerely,
The Forecasting World Events Team
E-mail is not a secure form of communication.

The confidentiality of this message cannot be guaranteed.

ps- above message was from this new contest. Enter at your initiative. Buyer Beware!.

Heritage prize= 3mill now open

I am still angry with THE netflix for 1 mill I lost out. No sweat! this time the money is 3 times as much, it is legit, and yes baby you can change the world, make it a better place and get rich.! see details below-http://www.heritagehealthprize.com/c/hhp/Data

HERITAGE HEALTH PRIZE DATA FILES

You must accept this competition’s rules before you’ll be able to download data files.

IMPORTANT NOTE: The information provided below is intended only to provide general guidance to participants in the Heritage Health Prize Competition and is subject to the Competition Official Rules. Any capitalized term not defined below is defined in the Competition Official Rules. Please consult the Competition Official Rules for complete details.

Heritage Provider Network is providing Competition Entrants with deidentified member data collected during a forty-eight month period that is allocated among three data sets (the “Data Sets”). Competition Entrants will use the Data Sets to develop and test their algorithms for accurately predicting the number of days that the members will spend in a hospital (inpatient or emergency room visit) during the 12-month period following the Data Set cut-off date.

HHP_release2.zip contains the latest files, so you can ignore HHP_release1.zip. SampleEntry.CSV shows you how an entry should look.

Data Sets will be released to Entrants after registration on the Website according to the following schedule:

April 4, 2011 Claims Table – Y1 and DaysInHospital Table – Y2

May 4, 2011

All other Data Sets except Labs Table and Rx Table

From https://www.kaggle.com/

The $3 million Heritage Health Prize opens to entries

It’s been one month since the launch of the Heritage Health Prize. The prize has attracted some great publicity, receiving coverage from the Wall Street JournalThe EconomistSlate andForbes.

By now, people have had a good chance to poke around the first portion of the data. Now the fun starts! HPN have released two more years’-worth of data, set the accuracy threshold and are opening up the competition to entries. The data are available from the Heritage Health Prize page. Good luck to all participants!

The Deloitte/FIDE Chess Ratings Competition results

The Deloitte/FIDE Chess Ratings Competition attracted one of the strongest fields ever seen in a Kaggle Competition. The competition attracted 189 teams, ranging from chess ratings  experts to Netflix Prize winners. As Jeff Sonas wrote on the Kaggle blog last week, the  competition has far exceeded his expectations. A big congratulations the provisional winner, Tim Salimans, an econometrician at Erasmus University in Rotterdam. We look forward to reading about the approaches used by top performers on the Kaggle blog. We also look forward to the results of the FIDE prize, which could see the introduction of a new chess ratings system.

ICDAR 2011 Competition Results

The ICDAR 2011 competition also finished recently. The competiiton required participants to develop an algorithm that correctly matched handwriting samples. The winners were Lewis Griffin and Andrew Newell from the University College London who achieved Kaggle’s first ever perfect score by managing to match every sample correctly! Andrew and Lewis have posted a description of their winning method on the Kaggle blog.

Revolution R Enterprise

Since R is the most popular language used by Kaggle members, the Revolution Analytics team is making Revolution R Enterprise (the pre-eminent commercial version of R) available free of charge to Kaggle members. Revolution R Enterprise has several advantages over standard R, including the ability to seemlessly handle larger datasets. To get your free copy, visit http://info.revolutionanalytics.com/Kaggle.html.
Kaggle-in-Class

As many of you know, Kaggle offers a free platform, Kaggle-in-Class, for instructors who want to host competitions for their students. For those interested in hearing more about the use of Kaggle-in-Class as a teaching tool, Susan Holmes and Nelson Ray from Stanford University share their experience in a webinar organized by the Consortium for the Advancement of Undergraduate Statistics Education.

Lovely forecasting blog

Eight different random walks.
Image via Wikipedia

I really loved this simple, smart and yet elegant explanation of forecasting. even a high school quarterback could understand it, and maybe get a internship job building and running and re running code for Mars shot.

Despite my plea that you remain svelte in real life, I implore you to be naïve in business forecasting – and use a naïve forecasting model early and often. A naïve forecasting model is the most important model you will ever use in business forecasting.

and now the killer line

Purists may argue that the only true naïve forecast is the “no-change” forecast, meaning either a random walk (forecast = last known actual) or a seasonal random walk (e.g. forecast = actual from corresponding period last year). These are referred to as NF1 and NF2 in the Makridakis text (where NF = Naïve Forecast). In our 2006 SAS webseries Finding Flaws in Forecasting, an attendee asked “What about using a simple time series forecast with no intervention as the naïve forecast?” Is that allowed?

i did write a blog article on forecasting some time back, but back then I was a little blogger, with the website name being http://iwannacrib.com

great work in helping make forecasting easier to understand for people who have flower shops and dont have a bee, to help them with the forecasts, nor an geeky email list, not 4000$.

make it easier for the little guy to forecast his sales, so he cuts down on his supply chain inventory, lowering his carbon footprint.

Blog.sas.com take a bow, on labour day, helping workers with easy to understand models.

http://blogs.sas.com/forecasting/index.php?/archives/68-Which-Naive-Model-to-Use.html

Try JMP for free in steps 1-2-3

Test a 30 day free trial of JMP, the beautiful software with the ugliest website.

In case you have never used JMP, but know the difference between a mean and a mode- take a look.

Step 1 Fill long and badly designed outdated form (note the blue lightening graphics design and font)


Step 2 See uselessly long message, as the website does require registration but it has not done  any oAuth/SM easy registration even though they help sell software in the same campus on social media

Step 3 Wait for 352 mb TO DOWNLOAD without a bit torrent or mirror servers, or even a link for scheduling Download Accelerator-

Note internet connections can be lousy (globally not just in India) to categorize 352 mb of downloads as painful.


And after all the violence and double talk
There’s just a song in all the trouble and the strife

JMP is still the best easiest to use powerful Big Data software with extensions into R and SAS.

Broad Guidelines for Graphs

Here are some broad guidelines for Graphs from EIA.gov , so you can say these are the official graphical guidelines of USA Gov

They can be really useful for sites planning to get into the Tableau Software/NYT /Guardian Infographic mode- or even for communities of blogs that have recurrent needs to display graphical plots- particularly since communication, statistical and design specialists are different areas/expertise/people.

Energy Information Administration Standard

Broad Guidelines for Graphs-I am reproducing an example from EIA ‘s guidelines for graphs-
http://www.eia.gov/about/eia_standards.cfm#Standard25

Energy Information Administration Standard 2009-25

Title: Statistical Graphs
Superseded Version: Standard 2002-25
Purpose: To ensure the utility (usefulness to intended users) and objectivity (accuracy, clarity, completeness, and lack of bias) of energy information presented in statistical graphs.
Applicability: All EIA information products.
Required Actions:

  1. Graphs should be used to show and compare changes, trends and/or relationships, and to assist users in visualizing the conclusions drawn from the data represented.
  2. A graph should contain sufficient Continue reading “Broad Guidelines for Graphs”