Interview- Top Data Mining Blogger on Earth , Sandro Saitta

Surajustement Modèle 2
Image via Wikipedia

If you do a Google search for Data Mining Blog- for the past several years one Blog will come on top. data mining blog – Google Search http://bit.ly/kEdPlE

To honor 5 years of Sandro Saitta’s blog (yes thats 5 years!) , we cover an exclusive interview with him where he reveals his unique sauce for cool techie blogging.

Ajay- Describe your journey as a scientist and data miner, from early experiences, to schooling to your work/research/blogging.

Sandro- My first experience with data mining was my master project. I used decision tree to predict pollen concentration for the following week using input data such as wind, temperature and rain. The fact that an algorithm can make a computer learn from experience was really amazing to me. I found it so interesting that I started a PhD in data mining. This time, the field of application was civil engineering. Civil engineers put a lot of sensors on their structure in order to understand how they behave. With all these sensors they generate a lot of data. To interpret these data, I used data mining techniques such as feature selection and clustering. I started my blog, Data Mining Research, during my PhD, to share with other researchers.

I then started applying data mining in the stock market as my first job in industry. I realized the difference between image recognition, where 99% correct classification rate is state of the art, and stock market, where you’re happy with 55%. However, the company ambiance was not as good as I thought, so I moved to consulting. There, I applied data mining in behavioral targeting to increase click-through rates. When you compare the number of customers who click with the ones who don’t, then you really understand what class imbalance mean. A few months ago, I accepted a very good opportunity at SICPA. I’m looking forward to resolving new challenges there.

Ajay- Your blog is the top ranked blog for “data mining blog”. Could you share some tips on better blogging for analytics and technical people

Sandro- It’s always difficult to start a blog, since at the beginning you have no reader. Writing for nobody may seem stupid, but it is not. By writing my first posts during my PhD I was reorganizing my ideas. I was expressing concepts which were not always clear to me. I thus learned a lot and also improved my English level. Of course, it’s still not perfect, but I hope most people can understand me.

Next come the readers. A few dozen each week first. To increase this number, I then started to learn SEO (Search Engine Optimization) by reading books and blogs. I tested many techniques that increased Data Mining Research visibility in the blogosphere. I think SEO is interesting when you already have some content published (which means not at the very beginning of your blog). After a while, once your blog is nicely ranked, the main task is to work on the content of the blog. To be of interest, your content must be particular: original, informative or provocative for example. I also had the chance to have a good visibility thanks to well-known people in the field like Kevin Hillstrom, Gregory Piatetsky-Shapiro, Will Dwinnell / Dean Abbott, Vincent Granville, Matthew Hurst and many others.

Ajay- Whats your favorite statistical software and what are the various softwares that you have worked with.
Could you compare and contrast these software as well.

Sandro- My favorite software at this point is SAS. I worked with it for two years. Once you know the language, you can perform ETL and data mining so easily. It’s also very fast compared to others. There are a lot of tools for data mining, but I cannot think of a tool that is as powerful as SAS and, in the same time, has a high-level programming language behind it.

I also worked with R and Matlab. R is very nice since you have all the up-to-date data mining algorithms implemented. However, working in the memory is not always a good choice, especially for ETL. Matlab is an excellent tool for prototyping. It’s not so fast and certainly not done for ETL, but the price is low regarding all the possibilities for data mining. According to me, SAS is the best choice for ETL and a good choice for data mining. Of course, there is the price.

Ajay- What are your favorite techniques and training resources for learning basics of data mining to say statisticians or business management graduates.

Sandro- I’m the kind of guy who likes to read books. I read data mining books one after the other. The fact that the same concepts are explained differently (and by different people) helps a lot in learning a topic like data mining. Of course, nothing replaces experience in the field. You can read hundreds of books, you will still not be a good practitioner until you really apply data mining in specific fields. My second choice after books is blogs. By reading data mining blogs, you will really see the issues and challenges in the field. It’s still not experience, but we are closer. Finally, web resources and networks such as KDnuggets of course, but also AnalyticBridge and LinkedIn.

Ajay- Describe your hobbies and how they help you ,if at all in your professional life.

Sandro- One of my hobbies is reading. I read a lot of books about data mining, SEO, Google as well as Sci-Fi and Fantasy. I’m a big fan of Asimov by the way. My other hobby is playing tennis. I think I simply use my hobbies as a way to find equilibrium in my life. I always try to find the best balance between work, family, friends and sport.

Ajay- What are your plans for your website for 2011-2012.

Sandro- I will continue to publish guest posts and interviews. I think it is important to let other people express themselves about data mining topics. I will not write about my current applications due to the policies of my current employer. But don’t worry, I still have a lot to write, whether it is technical or not. I will also emphasis more on my experience with data mining, advices for data miners, tips and tricks, and of course book reviews!

Standard Disclosure of Blogging- Sandro awarded me the Peoples Choice award for his blog for 2010 and carried out my interview. There is a lot of love between our respective wordpress blogs, but to reassure our puritan American readers- it is platonic and intellectual.

About Sandro S-



Sandro Saitta is a Data Mining Research Engineer at SICPA Security Solutions. He is also a blogger at Data Mining Research (www.dataminingblog.com). His interests include data mining, machine learning, search engine optimization and website marketing.

You can contact Mr Saitta at his Twitter address- 

https://twitter.com/#!/dataminingblog

The Year 2010

Nokia N800 internet tablet, with open source s...
Image via Wikipedia

My annual traffic to this blog was almost 99,000 . Add in additional views on networking sites plus the 400 plus RSS readers- so I can say traffic was 1,20,000 for 2010. Nice. Thanks for reading and hope it was worth your time. (this is a long post and will take almost 440 secs to read but the summary is just given)

My intent is either to inform you, give something useful or atleast something interesting.

see below-

Jan Feb Mar Apr May Jun
2010 6,311 4,701 4,922 5,463 6,493 4,271
Jul Aug Sep Oct Nov Dec Total
5,041 5,403 17,913 16,430 11,723 10,096 98,767

 

 

Sandro Saita from http://www.dataminingblog.com/ just named me for an award on his blog (but my surname is ohRi , Sandro left me without an R- What would I be without R :)) ).

Aw! I am touched. Google for “Data Mining Blog” and Sandro is the best that it is in data mining writing.

DMR People Award 2010
There are a lot of active people in the field of data mining. You can discuss with them on forums. You can read their blogs. You can also meet them in events such as PAW or KDD. Among the people I follow on a regular basis, I have elected:

Ajay Ori

He has been very active in 2010, especially on his blog . Good work Ajay and continue sharing your experience with us!”

What did I write in 2010- stuff.

What did you read on this blog- well thats the top posts list.

2009-12-31 to Today

Title Views
Home page More stats 21,150
Top 10 Graphical User Interfaces in Statistical Software More stats 6,237
Wealth = function (numeracy, memory recall) More stats 2,014
Matlab-Mathematica-R and GPU Computing More stats 1,946
The Top Statistical Softwares (GUI) More stats 1,405
About DecisionStats More stats 1,352
Using Facebook Analytics (Updated) More stats 1,313
Test drive a Chrome notebook. More stats 1,170
Top ten RRReasons R is bad for you ? More stats 1,157
Libre Office More stats 1,151
Interview Hadley Wickham R Project Data Visualization Guru More stats 1,007
Using Red R- R with a Visual Interface More stats 854
SAS Institute files first lawsuit against WPS- Episode 1 More stats 790
Interview Professor John Fox Creator R Commander More stats 764
R Package Creating More stats 754
Windows Azure vs Amazon EC2 (and Google Storage) More stats 726
Norman Nie: R GUI and More More stats 716
Startups for Geeks More stats 682
Google Maps – Jet Ski across Pacific Ocean More stats 670
Not so AWkward after all: R GUI RKWard More stats 579
Red R 1.8- Pretty GUI More stats 570
Parallel Programming using R in Windows More stats 569
R is an epic fail or is it just overhyped More stats 559
Enterprise Linux rises rapidly:New Report More stats 537
Rapid Miner- R Extension More stats 518
Creating a Blog Aggregator for free More stats 504
So which software is the best analytical software? Sigh- It depends More stats 473
Revolution R for Linux More stats 465
John Sall sets JMP 9 free to tango with R More stats 460

So how do people come here –

well I guess I owe Tal G for almost 9000 views ( incidentally I withdrew posting my blog from R- Bloggers and Analyticbridge blogs – due to SEO keyword reasons and some spam I was getting see (below))

http://r-bloggers.com is still the CAT’s whiskers and I read it  a lot.

I still dont know who linked my blog to a free sex movie site with 400 views but I have a few suspects.

2009-12-31 to Today

Referrer Views
r-bloggers.com 9,131
Reddit 3,829
rattle.togaware.com 1,500
Twitter 1,254
Google Reader 1,215
linkedin.com 717
freesexmovie.irwanaf.com 422
analyticbridge.com 341
Google 327
coolavenues.com 322
Facebook 317
kdnuggets.com 298
dataminingblog.com 278
en.wordpress.com 185
google.co.in 151
xianblog.wordpress.com 130
inside-r.org 124
decisionstats.com 119
ifreestores.com 117
bits.blogs.nytimes.com 108

Still reading this post- gosh let me sell you some advertising. It is only $100 a month (yes its a recession)

Advertisers are treated on First in -Last out (FILO)

I have been told I am obsessed with SEO , but I dont care much for search engines apart from Google, and yes SEO is an interesting science (they should really re name it GEO or Google Engine Optimization)

Apparently Hadley Wickham and Donald Farmer are big keywords for me so I should be more respectful I guess.

Search Terms for 365 days ending 2010-12-31 (Summarized)

2009-12-31 to Today

Search Views
libre office 925
facebook analytics 798
test drive a chrome notebook 467
test drive a chrome notebook. 215
r gui 203
data mining 163
wps sas lawsuit 158
wordle.net 133
wps sas 123
google maps jet ski 123
test drive chrome notebook 96
sas wps 89
sas wps lawsuit 85
chrome notebook test drive 83
decision stats 83
best statistics software 74
hadley wickham 72
google maps jetski 72
libreoffice 70
doug savage 65
hive tutorial 58
funny india 56
spss certification 52
donald farmer microsoft 51
best statistical software 49

What about outgoing links? Apparently I need to find a way to ask Google to pay me for the free advertising I gave their chrome notebook launch. But since their search engine and browser is free to me, guess we are even steven.

Clicks for 365 days ending 2010-12-31 (Summarized)

2009-12-31 to Today

URL Clicks
rattle.togaware.com 378
facebook.com/Decisionstats 355
rapid-i.com/content/view/182/196 319
services.google.com/fb/forms/cr48basic 313
red-r.org 228
decisionstats.wordpress.com/2010/05/07/the-top-statistical-softwares-gui 199
teamwpc.co.uk/products/wps 162
r4stats.com/popularity 148
r-statistics.com/2010/04/r-and-the-google-summer-of-code-2010-accepted-students-and-projects 138
socserv.mcmaster.ca/jfox/Misc/Rcmdr 138
spss.com/certification 116
learnr.wordpress.com 114
dudeofdata.com/decisionstats 108
r-project.org 107
documentfoundation.org/faq 104
goo.gl/maps/UISY 100
inside-r.org/download 96
en.wikibooks.org/wiki/R_Programming 92
nytimes.com/external/readwriteweb/2010/12/07/07readwriteweb-report-google-offering-chrome-notebook-test-11919.html 92
sourceforge.net/apps/mediawiki/rkward/index.php?title=Main_Page 92
analyticdroid.togaware.com 88
yeroon.net/ggplot2 87

so in 2010,

SAS remained top daddy in business analytics,

R made revolutionary strides in terms of new packages,

JMP  launched a new version,

SPSS got integrated with Cognos,

Oracle sued Google and did build a great Data Mining GUI,

Libre Office gave you a non Oracle Open office ( or open even more office)

2011 looks like  a fun year. Have safe partying .

The SEO mess on joining blog aggregators

 

Mug shot of Paris Hilton.
Image via Wikipedia

 

If you are an analytics blogger who writes, and is aggregated on an analytical community- read on- Here’s how blog aggregation communities can help you lose 30% of all future traffic long term, while giving you a short term.

The problem is not created by Blogging Communities (like R-Bloggers, or PlanteR, or Smart Data Collective or AnalyticBridge or even BeyeBlogs )

It is created by the way Google Page Rank is structured- you see given exactly the same content on two different we pages- Google Page Rank will place the higher Page Rank results higher. This is counter intutive and quite simple to rectify- The Google Spider can just use the Time Stamp for choosing which article was published where first (Obviously on your blog, AND then later to the aggregator).

How bad is the mess? Well joining ANY blog aggregation will lead to an instant lift of upto 10-50 % of your current traffic as similar bloggers try and read about you. However you can lose the long term 30% proportion which is a benchmark of search engine created traffic for you.

So do you opt out of blog aggregation? No. It’s a SEO mess and it’s unfair to punish your blog aggregator, most of whom are running on ad-supported sponsors or their own funds on dry fumes to publish your content. Most of the fore mentioned communities are created by excellent people I interacted with heavily- and they are genuinely motivated to give readers an easy way to keep up with blogs. Especially Smart Data Collective, Analyticbridge and R-bloggers whose founders I have known personally.

You can do one thing- create manual summaries in the excerpt feature of your blog posts- it’s just below the WordPress page. And switch your RSS feed to summary rather than full. It avoids losing keyword rank to other websites, it prevents the Blog Aggregation from gaining too much influence in key word related searches, and it keeps your whole eco system happy, Best of All it helps readers of Blog Aggregators- since most of them use a summary on the front page anyways.

An additional thought on Google Page Rank- something I have sulked over but not spoken for a long long time.  It ignores the value of reader- If Bill Gates, Steve Jobs, and 500 ceos from Fortune 500 companies read my blog but do not link to it- it will count daily traffic as 500. Probably it will give more weightage to Paris Hilton fans.

A suggestion-humbly- you can use IP Address lookup of visitors to see if traffic is coming from corporate sources or retail sources -Clicky from GetClicky does this. Use it as feedback in Google Analytics as well as Google Trends.

And maybe PageRank needs to add quantity and quality of visitors as additional variables . Do a A/B test guys some Chi Square juice- its not quite Mad Men Adverting but its still good fun.

 

PageRank
Image via Wikipedia

 

and the world is one big community as per xkcd


Blog Update

Some changes at Decisionstats-

1) We are back at Decisionstats.com and Decisionstats.wordpress.com will point to that as well. The SEO effects would be interesting and so would be the Instant Pagerank or LinkRank or whatever Coffee/Percolator they use in Cali to index the site.

2) AsterData is no longer a sponsor- but Predictive Analytics Conference is. Welcome PAWS! I have been a blog partner to PAWS ever since it began- and it’s a great marketing fit. Expect to see a lot of exclusive content and interviews from great speakers at PAWS.

3) The Feedblitz newsletter (now at 404 subscribers) is now a weekly subscription to send one big big email rather than lots of email through the week- this is because my blogging frequency is moving up as I collect material for a new book on business analytics that I would probably release in 2011 (if all goes well, touchwood). Linkedin group would be getting a weekly update announcement. If you are connected to Decisionstats on Analyticbridge _ I would soon try to find a way to update the whole post automatically using RSS and Ning.com . or not. Depends.

4) R continues to be a bigger focus. So will SPSS and maybe JMP. Newer softwares or older softwares that change more rapidly would get more coverage. Generally a particular software is covered if it has newer features, or an interesting techie conference, or it gets sued.

5) I will occasionally write a poem or post a video once a week randomly to prove geeks and nerds and analysts can have fun (much more fun actually dont we)

Thanks for reading this. Sept 2010 was the best ever for Decisionstats.com – we crossed 15,000 + visitors and thanks for that again! I promise to bore you less and less as we grow old together on the blog 😉

Decisionstats Interviews

Here is a list of interviews that I have published- these are specific to analytics and data mining and include only the most recent interviews. If I have missed out any notable recent interview related to analytics and data mining, kindly do let me know. Hat Tip to Karl Rexer, for this suggestion .

Date    Name of Interviewee    Designation and Organization

09-Jun    Karl Rexer                          President, Rexer Analytics
05-Jun    Jim Daves                          CMO, SAS Institute
04-Jun    Paul van Eikeren                 President and CEO, Blue Reference
29-May    David Smith                      Director of Community, REvolution Computing
17-May    Dominic Pouzin                 CEO, Data Applied
11-May    Bruno Delahaye                 VP, KXEN
04-May    Ron Ramos                        Director, Zementis
30-Apr    Oliver Jouve                       VP, SPSS Inc
21-Apr    Fabian Dill                         Co- Founder, Knime.com
18-Apr    Alicia Mcgreevey                 Head Marketing, Visual Numerics
27-Mar    Francoise Soulie Fogelman    VP, KXEN
17-Mar    Jon Peck                            Principal Software Engineer, SPSS Inc
06-Mar    Anne Milley                        Director of product marketing, SAS Institute
04-Mar    Anne Milley                        Director of product marketing, SAS Institute
03-Feb    Phil Rack                            Creator, Bridge to R,and CEO Minequest
03-Feb    Michael Zeller                     CEO, Zementis
31-Jan    Richard Schultz                   CEO, Revolution Computing
21-Jan    Bob Muenchen                    Author, R for SAS and SPSS Users
13-Jan    Dr Graham Williams           Creator, Rattle GUI for R
05-Jan    Roger Haddad                    CEO, KXEN
26-Sep    June Dershewitz                  VP, Semphonic
04-Sep    Vincent Granville                 Head, Analyticbridge

The URl’s to specific interviews are also in this sheet.

http://spreadsheets.google.com/pub?key=rWTqcMe9mqwHeFv1e4GS_yg&single=true&gid=0&range=a1%3Ae24&output=html