Interview- Top Data Mining Blogger on Earth , Sandro Saitta

Surajustement Modèle 2
Image via Wikipedia

If you do a Google search for Data Mining Blog- for the past several years one Blog will come on top. data mining blog – Google Search http://bit.ly/kEdPlE

To honor 5 years of Sandro Saitta’s blog (yes thats 5 years!) , we cover an exclusive interview with him where he reveals his unique sauce for cool techie blogging.

Ajay- Describe your journey as a scientist and data miner, from early experiences, to schooling to your work/research/blogging.

Sandro- My first experience with data mining was my master project. I used decision tree to predict pollen concentration for the following week using input data such as wind, temperature and rain. The fact that an algorithm can make a computer learn from experience was really amazing to me. I found it so interesting that I started a PhD in data mining. This time, the field of application was civil engineering. Civil engineers put a lot of sensors on their structure in order to understand how they behave. With all these sensors they generate a lot of data. To interpret these data, I used data mining techniques such as feature selection and clustering. I started my blog, Data Mining Research, during my PhD, to share with other researchers.

I then started applying data mining in the stock market as my first job in industry. I realized the difference between image recognition, where 99% correct classification rate is state of the art, and stock market, where you’re happy with 55%. However, the company ambiance was not as good as I thought, so I moved to consulting. There, I applied data mining in behavioral targeting to increase click-through rates. When you compare the number of customers who click with the ones who don’t, then you really understand what class imbalance mean. A few months ago, I accepted a very good opportunity at SICPA. I’m looking forward to resolving new challenges there.

Ajay- Your blog is the top ranked blog for “data mining blog”. Could you share some tips on better blogging for analytics and technical people

Sandro- It’s always difficult to start a blog, since at the beginning you have no reader. Writing for nobody may seem stupid, but it is not. By writing my first posts during my PhD I was reorganizing my ideas. I was expressing concepts which were not always clear to me. I thus learned a lot and also improved my English level. Of course, it’s still not perfect, but I hope most people can understand me.

Next come the readers. A few dozen each week first. To increase this number, I then started to learn SEO (Search Engine Optimization) by reading books and blogs. I tested many techniques that increased Data Mining Research visibility in the blogosphere. I think SEO is interesting when you already have some content published (which means not at the very beginning of your blog). After a while, once your blog is nicely ranked, the main task is to work on the content of the blog. To be of interest, your content must be particular: original, informative or provocative for example. I also had the chance to have a good visibility thanks to well-known people in the field like Kevin Hillstrom, Gregory Piatetsky-Shapiro, Will Dwinnell / Dean Abbott, Vincent Granville, Matthew Hurst and many others.

Ajay- Whats your favorite statistical software and what are the various softwares that you have worked with.
Could you compare and contrast these software as well.

Sandro- My favorite software at this point is SAS. I worked with it for two years. Once you know the language, you can perform ETL and data mining so easily. It’s also very fast compared to others. There are a lot of tools for data mining, but I cannot think of a tool that is as powerful as SAS and, in the same time, has a high-level programming language behind it.

I also worked with R and Matlab. R is very nice since you have all the up-to-date data mining algorithms implemented. However, working in the memory is not always a good choice, especially for ETL. Matlab is an excellent tool for prototyping. It’s not so fast and certainly not done for ETL, but the price is low regarding all the possibilities for data mining. According to me, SAS is the best choice for ETL and a good choice for data mining. Of course, there is the price.

Ajay- What are your favorite techniques and training resources for learning basics of data mining to say statisticians or business management graduates.

Sandro- I’m the kind of guy who likes to read books. I read data mining books one after the other. The fact that the same concepts are explained differently (and by different people) helps a lot in learning a topic like data mining. Of course, nothing replaces experience in the field. You can read hundreds of books, you will still not be a good practitioner until you really apply data mining in specific fields. My second choice after books is blogs. By reading data mining blogs, you will really see the issues and challenges in the field. It’s still not experience, but we are closer. Finally, web resources and networks such as KDnuggets of course, but also AnalyticBridge and LinkedIn.

Ajay- Describe your hobbies and how they help you ,if at all in your professional life.

Sandro- One of my hobbies is reading. I read a lot of books about data mining, SEO, Google as well as Sci-Fi and Fantasy. I’m a big fan of Asimov by the way. My other hobby is playing tennis. I think I simply use my hobbies as a way to find equilibrium in my life. I always try to find the best balance between work, family, friends and sport.

Ajay- What are your plans for your website for 2011-2012.

Sandro- I will continue to publish guest posts and interviews. I think it is important to let other people express themselves about data mining topics. I will not write about my current applications due to the policies of my current employer. But don’t worry, I still have a lot to write, whether it is technical or not. I will also emphasis more on my experience with data mining, advices for data miners, tips and tricks, and of course book reviews!

Standard Disclosure of Blogging- Sandro awarded me the Peoples Choice award for his blog for 2010 and carried out my interview. There is a lot of love between our respective wordpress blogs, but to reassure our puritan American readers- it is platonic and intellectual.

About Sandro S-



Sandro Saitta is a Data Mining Research Engineer at SICPA Security Solutions. He is also a blogger at Data Mining Research (www.dataminingblog.com). His interests include data mining, machine learning, search engine optimization and website marketing.

You can contact Mr Saitta at his Twitter address- 

https://twitter.com/#!/dataminingblog

FlattR goes free for all bloggers

This is the Flattr right now
Image via Wikipedia

Email from people at FlattR. What is FlattR- social micropayments. like small Paypal button that gives a fraction of your monthly budget (say 2 euros) to people you retweet/like (called Flattered in this social media service).

Some angels are smiling over some bays , it seems. congrats FlattR and their team (which includes tech team of largest bit torrent search engine in the world). Apparently this was done in time for the royal wedding.

Important service announcement

Hello!

Flattr’s first year has been great. And not just for us but for tens of thousands of bloggers, podcasters, developers, designers and other creators out there. Just ask Tim. We’re now making an important change to the service, one which should open the floodgates of Flattr, if you will.
From May 1st we no longer require users to flattr others before they can be flattrd. Or in other words, it’s not mandatory to add money to your account to have an active Flattr button.

How does this affect you?

If you’re mainly using Flattr to make payments you will soon have much more content to flattr.If you’re using Flattr both to make and receive payments then you no longer need to check your balance at the end of each month to see whether your Flattr button is still active or not. It is and always will be.

If your Flattr button was once deactivated because balance dropped to zero, it is now active again. Forever.

This makes Flattr simpler

We have good reasons for making this change and we’ve just added a post about it on our blog. In a nutshell, we just didn’t need to force the give before you get principle onto people. During the the last year we’ve learned that people want to flattr the content they like and therefore we decided to drop any rules that made the service restrictive or outright complicated.We hope you’ll like the simpler more straightforward Flattr. If you have any comments, questions or feedback please get in touch via our blog or support page.

Linus, Peter and the rest of Flattr team

Contest for SAS Users and Students

Heres a new contest for SAS users. The prizes are books, so students should be interested as well.

From http://www.sascommunity.org/mwiki/images/b/bc/PointsforprizesRules.pdf

HOW TO ENTER: To qualify for entry, go to the sasCommunity.org web site located at http://www.sascommunity.org/wiki/Main_Page
between April 11, 2011 and May 9, 2011 and either add or edit valid content as described herein to earn award points.
Creation of a first time profile on www.sascommunity.org will earn 1,000 points. For each valid article creation or edit, 100
points will be earned. Articles and subsequent edits should adhere to the sasCommunity.org terms of use as outlined on
http://www.sascommunity.org/wiki/sasCommunity:Terms_of_Use. All points’ accumulation will end at 5:00 PM GMT on
May 9, 2011 and only those points earned between 8:00 AM GMT on April 11, 2011 and 5:00 PM GMT on May 9, 2011
will be counted in this contest. Contest entries made through the Internet will be declared made by the registered user of
the sasCommunity.org profile account. Sponsor is not responsible for phone, technical, network, electronic, computer
hardware or software failures of any kind, misdirected, incomplete, garbled or delayed transmissions. Sponsor will not be
responsible for incorrect or inaccurate entry information, whether caused by entrants or by any of the equipment or
programming associated with or utilized in the contest.
ELIGIBILITY: The contest is open to all sasCommunity.org members 18 year of age or older on the start date of the
contest. Void where prohibited by law. Employees (including immediate family members and/or those living in the same 
household of each), the Sponsor, members of the sasCommunity.org Advisory Board, SAS Global Users Group Executive 
Board, their advertising, promotion and production agencies, the affiliated companies of each, and the immediate family 
members of each are not eligible. 

PRIZE: Three (3) prizes will be awarded based on total points accumulated during the contest as follows:
 1stPlace: 3 SAS®Press books - not to exceed $250 in combined retail value;
 2ndPlace: 2 SAS®Press books - not to exceed $150 in combined retail value; and
 3rdPlace: 1 SAS®Press book - not to exceed $100 in retail value.

What’s New

http://www.sascommunity.org/wiki/Main_Page

New Points for Prizes Contest
Points for Prizes Contest
Win SAS books!
Contribute content or SAS code to sasCommunity.org for your chance to WIN! To qualify, simply add or edit articles between April 11, 2011 and May 9, 2011 (GMT). Creation of a first-time profile on sasCommunity.org gives you 1,000 points. For each valid article creation or edit, 100 points will be earned. The user with the most points collected during this time wins SAS Press Books!

Become a sasCommunity Guru
Thanks for Contributing to sasCommunity.org!
New sasCommunity.org Point System
The sasCommunity support team has been hard at work adding new features and is pleased to announce a points system that recognizes each user’s contributions to the site. Every time you contribute by creating a page, updating it, or just doing a little wiki gardening, you earn points.Earning points is automatic and simple – all you have to do is contribute! Creating your account starts you with 1000 points and all the current users have been credited with points dating back to the site coming online in April 2007.

Why search optimization can make you like Rebecca Black

Felicia Day, actress and web content producer.
Image via Wikipedia

A highly optimized blog post or web content can get you a lot of attention just like Rebecca Black’s video (provided it passes through the new quality metrics \change*/ in the Search Engine)

But if the underlying content is weak, or based on a shoddy understanding of the content-it can drive lots of horrid comments as well as ensuring that bad word of mouth is spread about the content or you/despite your hard work.

An example of this is copy and paste journalism especially in technology circles, where even a bigger Page Ranked website /blog can get away with scraping or stealing content from a lower page ranked website (or many websites)  after adding a cursory “expert comment”. This is also true when someone who is basically a corporate communication specialist (or PR -public relations) person is given a techinical text and encourage to write about it without completely understanding it.

A mild technical defect in the search engine algorithm is that it does not seem to pay attention to when the content was published, so the copying website or blog actually can get by as fresher content even if it is practically has 90% of the same words). The second flaw is over punishment or manual punishment of excessive linking – this can encourage search optimization minded people to hoard links or discourage trackbacks.

A free internet is one which promotes free sharing of content and does not encourage stealing or un-authorized scraping or content copying. Unfortunately current search engine optimization can encourage scraping and content copying without paying too much attention to origin of the words.

In addition the analytical rigor by which search algorithms search your inboxes (as in search all emails for a keyword) or media rich sites (like Youtube) are quite on a different level of quality altogether. The chances of garbage results are much more while searching for media content and/or emails.

It's a code code summer

East-German pupils ("Junge Pioniere"...
Image via Wikipedia

and soc is back!

also expecting some #Rstats entries (open source!)

from https://code.google.com/soc/

Google Summer of Code 2011

Visit the Google Summer of Code 2011 site for more details about the program this year.

For a detailed timeline and further information about the program, review our Frequently Asked Questions.

About Google Summer of Code

Google Summer of Code is a global program that offers student developers stipends to write code for various open source software projects. We have worked with several open source, free software, and technology-related groups to identify and fund several projects over a three month period. Since its inception in 2005, the program has brought together over 4500 successful student participants and over 3000 mentors from over 100 countries worldwide, all for the love of code. Through Google Summer of Code, accepted student applicants are paired with a mentor or mentors from the participating projects, thus gaining exposure to real-world software development scenarios and the opportunity for employment in areas related to their academic pursuits. In turn, the participating projects are able to more easily identify and bring in new developers. Best of all, more source code is created and released for the use and benefit of all.

To learn more about the program, peruse our 2011 Frequently Asked Questions page. You can also subscribe to the Google Open Source Blog or the Google Summer of Code Discussion Group to keep abreast of the latest announcements.

Participating in Google Summer of Code

For those of you who would like to participate in the program, there are many resources available for you to learn more. Check out the information pages from the 20052006200720082009, and 2010 instances of the program to get a better sense of which projects have participated as mentoring organizations in Google Summer of Code each year. If you are interested in a particular mentoring organization, just click on its name and you’ll find more information about the project, a summary of their students’ work and actual source code produced by student participants. You may also find the program Frequently Asked Questions (FAQs) pages for each year to be useful. Finally, check out all the great content and advice on participation produced by the community, for the community, on our program wiki.

If you don’t find what you need in the documentation, you can always ask questions on our program discussion list or the program IRC channel, #gsoc on Freenode.

 

SAS Knowledge Exchange

Visual analytics : research and practice
Image via Wikipedia

Here is an interesting website by SAS.com – it showcases lots of business analytics content more from a conceptual rather than a tool based perspective- have a glance yourself.

http://www.sas.com/knowledge-exchange/business-analytics/

Copyright © SAS Institute Inc. All rights reserved

Carole-Ann’s 2011 Predictions for Decision Management

Carole-Ann’s 2011 Predictions for Decision Management

For Ajay Ohri on DecisionStats.com

What were the top 5 events in 2010 in your field?
  1. Maturity: the Decision Management space was made up of technology vendors, big and small, that typically focused on one or two aspects of this discipline.  Over the past few years, we have seen a lot of consolidation in the industry – first with Business Intelligence (BI) then Business Process Management (BPM) and lately in Business Rules Management (BRM) and Advanced Analytics.  As a result the giant Platform vendors have helped create visibility for this discipline.  Lots of tiny clues finally bubbled up in 2010 to attest of the increasing activity around Decision Management.  For example, more products than ever were named Decision Manager; companies advertised for Decision Managers as a job title in their job section; most people understand what I do when I am introduced in a social setting!
  2. Boredom: unfortunately, as the industry matures, inevitably innovation slows down…  At the main BRMS shows we heard here and there complaints that the technology was stalling.  We heard it from vendors like Red Hat (Drools) and we heard it from bored end-users hoping for some excitement at Business Rules Forum’s vendor panel.  They sadly did not get it
  3. Scrum: I am not thinking about the methodology there!  If you have ever seen a rugby game, you can probably understand why this is the term that comes to mind when I look at the messy & confusing technology landscape.  Feet blindly try to kick the ball out while superhuman forces are moving randomly the whole pack – or so it felt when I played!  Business Users in search of Business Solutions are facing more and more technology choices that feel like comparing apples to oranges.  There is value in all of them and each one addresses a specific aspect of Decision Management but I regret that the industry did not simplify the picture in 2010.  On the contrary!  Many buzzwords were created or at least made popular last year, creating even more confusion on a muddy field.  A few examples: Social CRM, Collaborative Decision Making, Adaptive Case Management, etc.  Don’t take me wrong, I *do* like the technologies.  I sympathize with the decision maker that is trying to pick the right solution though.
  4. Information: Analytics have been used for years of course but the volume of data surrounding us has been growing to unparalleled levels.  We can blame or thank (depending on our perspective) Social Media for that.  Sites like Facebook and LinkedIn have made it possible and easy to publish relevant (as well as fluffy) information in real-time.  As we all started to get the hang of it and potentially over-publish, technology evolved to enable the storage, correlation and analysis of humongous volumes of data that we could not dream of before.  25 billion tweets were posted in 2010.  Every month, over 30 billion pieces of data are shared on Facebook alone.  This is not just about vanity and marketing though.  This data can be leveraged for the greater good.  Carlos pointed to some fascinating facts about catastrophic event response team getting organized thanks to crowd-sourced information.  We are also seeing, in the Decision management world, more and more applicability for those very technology that have been developed for the needs of Big Data – I’ll name for example Hadoop that Carlos (yet again) discussed in his talks at Rules Fest end of 2009 and 2010.
  5. Self-Organization: it may be a side effect of the Social Media movement but I must admit that I was impressed by the success of self-organizing initiatives.  Granted, this last trend has nothing to do with Decision Management per se but I think it is a great evolution worth noting.  Let me point to a couple of examples.  I usually attend traditional conferences and tradeshows in which the content can be good but is sometimes terrible.  I was pleasantly surprised by the professionalism and attendance at *un-conferences* such as P-Camp (P stands for Product – an event for Product Managers).  When you think about it, it is already difficult to get a show together when people are dedicated to the tasks.  How crazy is it to have volunteers set one up with no budget and no agenda?  Well, people simply show up to do their part and everyone has fun voting on-site for what seems the most appealing content at the time.  Crowdsourcing applied to shows: it works!  Similar experience with meetups or tweetups.  I also enjoyed attending some impromptu Twitter jam sessions on a given topic.  Social Media is certainly helping people reach out and get together in person or virtually and that is wonderful!

A segment of a social network
Image via Wikipedia

What are the top three trends you see in 2011?

  1. Performance:  I might be cheating here.   I was very bullish about predicting much progress for 2010 in the area of Performance Management in your Decision Management initiatives.  I believe that progress was made but Carlos did not give me full credit for the right prediction…  Okay, I am a little optimistic on timeline…  I admit it…  If it did not fully happen in 2010, can I predict it again in 2011?  I think that companies want to better track their business performance in order to correct the trajectory of course but also to improve their projections.  I see that it is turning into reality already here and there.  I expect it to become a trend in 2011!
  2. Insight: Big Data being available all around us with new technologies and algorithms will continue to propagate in 2011 leading to more widely spread Analytics capabilities.  The buzz at Analytics shows on Social Network Analysis (SNA) is a sign that there is interest in those kinds of things.  There is tremendous information that can be leveraged for smart decision-making.  I think there will be more of that in 2011 as initiatives launches in 2010 will mature into material results.
    5 Ways to Cultivate an Active Social Network
    Image by Intersection Consulting via Flickr
  3. Collaboration:  Social Media for the Enterprise is a discipline in the making.  Social Media was initially seen for the most part as a Marketing channel.  Over the years, companies have started experimenting with external communities and ideation capabilities with moderate success.  The few strategic initiatives started in 2010 by “old fashion” companies seem to be an indication that we are past the early adopters.  This discipline may very well materialize in 2011 as a core capability, well, or at least a new trend.  I believe that capabilities such Chatter, offered by Salesforce, will transform (slowly) how people interact in the workplace and leverage the volumes of social data captured in LinkedIn and other Social Media sites.  Collaboration is of course a topic of interest for me personally.  I even signed up for Kare Anderson’s collaboration collaboration site – yes, twice the word “collaboration”: it is really about collaborating on collaboration techniques.  Even though collaboration does not require Social Media, this medium offers perspectives not available until now.

Brief Bio-

Carole-Ann is a renowned guru in the Decision Management space. She created the vision for Decision Management that is widely adopted now in the industry. Her claim to fame is the strategy and direction of Blaze Advisor, the then-leading BRMS product, while she also managed all the Decision Management tools at FICO (business rules, predictive analytics and optimization). She has a vision for Decision Management both as a technology and a discipline that can revolutionize the way corporations do business, and will never get tired of painting that vision for her audience. She speaks often at Industry conferences and has conducted university classes in France and Washington DC.

Leveraging her Masters degree in Applied Mathematics / Computer Science from a “Grande Ecole” in France, she started her career building advanced systems using all kinds of technologies — expert systems, rules, optimization, dashboarding and cubes, web search, and beta version of database replication – as well as conducting strategic consulting gigs around change management.

She now tweets as @CMatignon, blogs at blog.sparklinglogic.com and interacts at community.sparklinglogic.com.

She started her career building advanced systems using all kinds of technologies — expert systems, rules, optimization, dashboarding and cubes, web search, and beta version of database replication.  At Cleversys (acquired by Kurt Salmon & Associates), she also conducted strategic consulting gigs mostly around change management.

While playing with advanced software components, she found a passion for technology and joined ILOG (acquired by IBM).  She developed a growing interest in Optimization as well as Business Rules.  At ILOG, she coined the term BRMS while brainstorming with her Sales counterpart.  She led the Presales organization for Telecom in the Americas up until 2000 when she joined Blaze Software (acquired by Brokat Technologies, HNC Software and finally FICO).

Her 360-degree experience allowed her to gain appreciation for all aspects of a software company, giving her a unique perspective on the business.  Her technical background kept her very much in touch with technology as she advanced.

She also became addicted to Twitter in the process.  She is active on all kinds of social media, always looking for new digital experience!

Outside of work, Carole-Ann loves spending time with her two boys.  They grow fruits in their Northern California home and cook all together in the French tradition.

profile on LinkedIn

TwitterFollow me on Twitter

Filtering to Gain Social Network Value
Image by Intersection Consulting via Flickr
Social Networks Hype Cycle
Image by fredcavazza via Flickr