Visual Guides to CRISP-DM ,KDD and SEMMA

UPDATED- Here are three great examples of a visualization making a process easy to understand. Please click on the images to read them clearly.

1) It visualizes CRISP-DM and is made by Nicole Leaper (http://exde.wordpress.com/2009/03/13/a-visual-guide-to-crisp-dm-methodology/)

12345

2) KDD -Knowledge Discovery in Databases -visualization by Fayyad whom I have interviewed here at http://www.decisionstats.com/interview-dr-usama-fayyad-founder-open-insights-llc/

and work By Gregory Piatetsky Shapiro interviewed by this website here

https://decisionstats.com/2009/08/13/interview-gregory-piatetsky-kdnuggets-com/

kdd

3) I am also attaching a visual representation of SEMMA from http://www.dataprix.net/en/blogs/respinosamilla/theory-data-mining

metodo-semma

 

Timo Elliott on 2012

Continuing the DecisionStats series on  trends for 2012, Timo Elliott , Technology Evangelist  at SAP Business Objects, looks at the predictions he made in the beginning of  2011 and follows up with the things that surprised him in 2011, and what he foresees in 2012.

You can read last year’s predictions by Mr Elliott at http://www.decisionstats.com/brief-interview-timo-elliott/

Timo- Here are my comments on the “top three analytics trends” predictions I made last year:

(1) Analytics, reinvented. New DW techniques make it possible to do sub-second, interactive analytics directly against row-level operational data. Now BI processes and interfaces need to be rethought and redesigned to make best use of this — notably by blurring the distinctions between the “design” and “consumption” phases of BI.

I spent most of 2011 talking about this theme at various conferences: how existing BI technology israpidly becoming obsolete and how the changes are akin to the move from film to digital photography. Technology that has been around for many years (in-memory, column stores, datawarehouse appliances, etc.) came together to create exciting new opportunities and even generally-skeptical industry analysts put out press releases such as “Gartner Says Data Warehousing Reaching Its Most Significant Inflection Point Since Its Inception.” Some of the smaller BI vendors had been pushing in-memory analytics for years, but the general market started paying more attention when megavendors like SAP started painting a long-term vision of in-memory becoming a core platform for applications, not just analytics. Database leader Oracle was forced to upgrade their in-memory messaging from “It’s a complete fantasy” to “we have that too”.

(2) Corporate and personal BI come together. The ability to mix corporate and personal data for quick, pragmatic analysis is a common business need. The typical solution to the problem — extracting and combining the data into a local data store (either Excel or a departmental data mart) — pleases users, but introduces duplication and extra costs and makes a mockery of information governance. 2011 will see the rise of systems that let individuals and departments load their data into personal spaces in the corporate environment, allowing pragmatic analytic flexibility without compromising security and governance.

The number of departmental “data discovery” initiatives continued to rise through 2011, but new tools do make it easier for business people to upload and manipulate their own information while using the corporate standards. 2012 will see more development of “enterprise data discovery” interfaces for casual users.

(3) The next generation of business applications. Where are the business applications designed to support what people really do all day, such as implementing this year’s strategy, launching new products, or acquiring another company? 2011 will see the first prototypes of people-focused, flexible, information-centric, and collaborative applications, bringing together the best of business intelligence, “enterprise 2.0”, and existing operational applications.

2011 saw the rise of sophisticated, user-centric mobile applications that combine data from corporate systems with GPS mapping and the ability to “take action”, such as mobile medical analytics for doctors or mobile beauty advisor applications, and collaborative BI started becoming a standard part of enterprise platforms.

And one that should happen, but probably won’t: (4) Intelligence = Information + PEOPLE. Successful analytics isn’t about technology — it’s about people, process, and culture. The biggest trend in 2011 should be organizations spending the majority of their efforts on user adoption rather than technical implementation.

Unsurprisingly, there was still high demand for presentations on why BI projects fail and how to implement BI competency centers.  The new architectures probably resulted in even more emphasis on technology than ever, while business peoples’ expectations skyrocketed, fueled by advances in the consumer world. The result was probably even more dissatisfaction in the past, but the benefits of the new architectures should start becoming clearer during 2012.

What surprised me the most:

The rapid rise of Hadoop / NoSQL. The potentials of the technology have always been impressive, but I was surprised just how quickly these technology has been used to address real-life business problems (beyond the “big web” vendors where it originated), and how quickly it is becoming part of mainstream enterprise analytic architectures (e.g. Sybase IQ 15.4 includes native MapReduce APIs, Hadoop integration and federation, etc.)

Prediction for 2012:

As I sat down to gather my thoughts about BI in 2012, I quickly came up with the same long laundry list of BI topics as everybody else: in-memory, mobile, predictive, social, collaborative decision-making, data discovery, real-time, etc. etc.  All of these things are clearly important, and where going to continue to see great improvements this year. But I think that the real “next big thing” in BI is what I’m seeing when I talk to customers: they’re using these new opportunities not only to “improve analytics” but also fundamentally rethink some of their key business processes.

Instead of analytics being something that is used to monitor and eventually improve a business process, analytics is becoming a more fundamental part of the business process itself. One example is a large telco company that has transformed the way they attract customers. Instead of laboriously creating a range of rate plans, promoting them, and analyzing the results, they now use analytics to automatically create hundreds of more complex, personalized rate plans. They then throw them out into the market, monitor in real time, and quickly cull any that aren’t successful. It’s a way of doing business that would have been inconceivable in the past, and a lot more common in the future.

 

About

 

Timo Elliott

Timo Elliott is a 20-year veteran of SAP BusinessObjects, and has spent the last quarter-century working with customers around the world on information strategy.

He works closely with SAP research and innovation centers around the world to evangelize new technology prototypes.

His popular Business Analytics blog tracks innovation in analytics and social media, including topics such as augmented corporate reality, collaborative decision-making, and social network analysis.

His PowerPoint Twitter Tools lets presenters see and react to tweets in real time, embedded directly within their slides.

A popular and engaging speaker, Elliott presents regularly to IT and business audiences at international conferences, on subjects such as why BI projects fail and what to do about it, and the intersection of BI and enterprise 2.0.

Prior to Business Objects, Elliott was a computer consultant in Hong Kong and led analytics projects for Shell in New Zealand. He holds a first-class honors degree in Economics with Statistics from Bristol University, England

Timo can be contacted via Twitter at https://twitter.com/timoelliott

 Part 1 of this series was from James Kobielus, Forrestor at http://www.decisionstats.com/jim-kobielus-on-2012/

Business Analytics Projects

As per me, Analytics Projects get into these four  broad phases-

  • Business Problem  PhaseWhat needs to be done?
  1. Increase Revenues
  2. Cut Costs
  3. Investigate Unusual Events
  4. Project Timelines
  • Technical Problem PhaseTechnical Problems in Project Execution 
  1. Data Availability /Data Quality/Data Augmentation Costs
  2. Statistical -(Technique based approach) , Hypothesis Formulation,Sampling, Iterations
  3. Programming-(Tool based approach) Analytics Platform Coding (Input, Formats,Processing)
  • Technical Solution PhaseProblem Solving using the Tools and Skills Available 
  1. Data Cleaning /Outlier Treatment/Missing Value Imputation
  2. Statistical -(Technique based approach) Error Minimization, Model Validation, Confidence Levels
  3. Programming-(Tool based approach) Analytics Platform Coding (Output, Display,Graphs)
  • Business Solution PhasePut it all together in a word document, presentation and/or spreadsheet
  1. Finalized- Forecasts  , Models and Data Strategies
  2. Improvements  in existing processes
  3.  Control and Monitoring of Analytical Results post Implementation
  4. Legal and Compliance  guidelines to execution
  5. (Internal or External) Client Satisfaction and Expectation Management
  6. Audience Feedback based on presenting final deliverable to broader audience

How to make an analytics project?

Some of the process methodologies I have used and been exposed to while making analytics projects are-1) DMAIC/Six Sigma

While Six Sigma was initially a quality control system, it has also been very succesful in managing projects. The various stages of an analytical project can be divided using the DMAIC methodology.

DMAIC stands for

  • Define
  • Measure
  • Analyze
  • Improve
  • Control

Related to this is DMADV, ( “Design For Six Sigma”)

  • Define
  • Measure and identify CTQs
  • Analyze
  • Design
  • Verify

2) CRISP
CRISP-DM stands for Cross Industry Standard Process for Data Mining

CRISP-DM breaks the process of data mining into six major phases- and these can be used for business analytics projects as well.

  • Business Understanding
  • Data Understanding
  • Data Preparation
  • Modeling
  • Evaluation
  • Deployment

3) SEMMA
SEMMA  stands for

  • sample
  • explore
  • modify
  • model
  • assess

4) ISO 9001

ISO 9001 is a certification as well as a philosophy for making a Quality Management System to measure , reduce and eliminate error and customer complaints. Any customer complaint or followup has to be treated as an error, logged, and investigated for control.

5) LEAN
LEAN is a philosophy to eliminate Wastage in a process. Applying LEAN principles to analytics projects helps a lot in eliminating project bottlenecks, technology compatibility issues and data quality resolution. I think LEAN would be great in data quality issues, and IT infrastructure design because that is where the maximum waste is observed in analytics projects.

6) Demings Plan Do Check Act cycle.

Analytics 2011 Conference

From http://www.sas.com/events/analytics/us/

The Analytics 2011 Conference Series combines the power of SAS’s M2010 Data Mining Conference and F2010 Business Forecasting Conference into one conference covering the latest trends and techniques in the field of analytics. Analytics 2011 Conference Series brings the brightest minds in the field of analytics together with hundreds of analytics practitioners. Join us as these leading conferences change names and locations. At Analytics 2011, you’ll learn through a series of case studies, technical presentations and hands-on training. If you are in the field of analytics, this is one conference you can’t afford to miss.

Conference Details

October 24-25, 2011
Grande Lakes Resort
Orlando, FL

Analytics 2011 topic areas include:

Interview- Top Data Mining Blogger on Earth , Sandro Saitta

Surajustement Modèle 2
Image via Wikipedia

If you do a Google search for Data Mining Blog- for the past several years one Blog will come on top. data mining blog – Google Search http://bit.ly/kEdPlE

To honor 5 years of Sandro Saitta’s blog (yes thats 5 years!) , we cover an exclusive interview with him where he reveals his unique sauce for cool techie blogging.

Ajay- Describe your journey as a scientist and data miner, from early experiences, to schooling to your work/research/blogging.

Sandro- My first experience with data mining was my master project. I used decision tree to predict pollen concentration for the following week using input data such as wind, temperature and rain. The fact that an algorithm can make a computer learn from experience was really amazing to me. I found it so interesting that I started a PhD in data mining. This time, the field of application was civil engineering. Civil engineers put a lot of sensors on their structure in order to understand how they behave. With all these sensors they generate a lot of data. To interpret these data, I used data mining techniques such as feature selection and clustering. I started my blog, Data Mining Research, during my PhD, to share with other researchers.

I then started applying data mining in the stock market as my first job in industry. I realized the difference between image recognition, where 99% correct classification rate is state of the art, and stock market, where you’re happy with 55%. However, the company ambiance was not as good as I thought, so I moved to consulting. There, I applied data mining in behavioral targeting to increase click-through rates. When you compare the number of customers who click with the ones who don’t, then you really understand what class imbalance mean. A few months ago, I accepted a very good opportunity at SICPA. I’m looking forward to resolving new challenges there.

Ajay- Your blog is the top ranked blog for “data mining blog”. Could you share some tips on better blogging for analytics and technical people

Sandro- It’s always difficult to start a blog, since at the beginning you have no reader. Writing for nobody may seem stupid, but it is not. By writing my first posts during my PhD I was reorganizing my ideas. I was expressing concepts which were not always clear to me. I thus learned a lot and also improved my English level. Of course, it’s still not perfect, but I hope most people can understand me.

Next come the readers. A few dozen each week first. To increase this number, I then started to learn SEO (Search Engine Optimization) by reading books and blogs. I tested many techniques that increased Data Mining Research visibility in the blogosphere. I think SEO is interesting when you already have some content published (which means not at the very beginning of your blog). After a while, once your blog is nicely ranked, the main task is to work on the content of the blog. To be of interest, your content must be particular: original, informative or provocative for example. I also had the chance to have a good visibility thanks to well-known people in the field like Kevin Hillstrom, Gregory Piatetsky-Shapiro, Will Dwinnell / Dean Abbott, Vincent Granville, Matthew Hurst and many others.

Ajay- Whats your favorite statistical software and what are the various softwares that you have worked with.
Could you compare and contrast these software as well.

Sandro- My favorite software at this point is SAS. I worked with it for two years. Once you know the language, you can perform ETL and data mining so easily. It’s also very fast compared to others. There are a lot of tools for data mining, but I cannot think of a tool that is as powerful as SAS and, in the same time, has a high-level programming language behind it.

I also worked with R and Matlab. R is very nice since you have all the up-to-date data mining algorithms implemented. However, working in the memory is not always a good choice, especially for ETL. Matlab is an excellent tool for prototyping. It’s not so fast and certainly not done for ETL, but the price is low regarding all the possibilities for data mining. According to me, SAS is the best choice for ETL and a good choice for data mining. Of course, there is the price.

Ajay- What are your favorite techniques and training resources for learning basics of data mining to say statisticians or business management graduates.

Sandro- I’m the kind of guy who likes to read books. I read data mining books one after the other. The fact that the same concepts are explained differently (and by different people) helps a lot in learning a topic like data mining. Of course, nothing replaces experience in the field. You can read hundreds of books, you will still not be a good practitioner until you really apply data mining in specific fields. My second choice after books is blogs. By reading data mining blogs, you will really see the issues and challenges in the field. It’s still not experience, but we are closer. Finally, web resources and networks such as KDnuggets of course, but also AnalyticBridge and LinkedIn.

Ajay- Describe your hobbies and how they help you ,if at all in your professional life.

Sandro- One of my hobbies is reading. I read a lot of books about data mining, SEO, Google as well as Sci-Fi and Fantasy. I’m a big fan of Asimov by the way. My other hobby is playing tennis. I think I simply use my hobbies as a way to find equilibrium in my life. I always try to find the best balance between work, family, friends and sport.

Ajay- What are your plans for your website for 2011-2012.

Sandro- I will continue to publish guest posts and interviews. I think it is important to let other people express themselves about data mining topics. I will not write about my current applications due to the policies of my current employer. But don’t worry, I still have a lot to write, whether it is technical or not. I will also emphasis more on my experience with data mining, advices for data miners, tips and tricks, and of course book reviews!

Standard Disclosure of Blogging- Sandro awarded me the Peoples Choice award for his blog for 2010 and carried out my interview. There is a lot of love between our respective wordpress blogs, but to reassure our puritan American readers- it is platonic and intellectual.

About Sandro S-



Sandro Saitta is a Data Mining Research Engineer at SICPA Security Solutions. He is also a blogger at Data Mining Research (www.dataminingblog.com). His interests include data mining, machine learning, search engine optimization and website marketing.

You can contact Mr Saitta at his Twitter address- 

https://twitter.com/#!/dataminingblog

Brief Interview Timo Elliott

Here is a brief interview with Timo Elliott.Timo Elliott is a 19-year veteran of SAP Business Objects.

Ajay- What are the top 5 events in Business Integration and Data Visualization services you saw in 2010 and what are the top three trends you see in these in 2011.


Timo-

Top five events in 2010:

(1) Back to strong market growth. IT spending plummeted last year (BI continued to grow, but more slowly than previous years). This year, organizations reopened their wallets and funded new analytics initiatives — all the signs indicate that BI market growth will be double that of 2009.

(2) The launch of the iPad. Mobile BI has been around for years, but the iPad opened the floodgates of organizations taking a serious look at mobile analytics — and the easy-to-use, executive-friendly iPad dashboards have considerably raised the profile of analytics projects inside organizations.

(3) Data warehousing got exciting again. Decades of incremental improvements (column databases, massively parallel processing, appliances, in-memory processing…) all came together with robust commercial offers that challenged existing data storage and calculation methods. And new “NoSQL” approaches, designed for the new problems of massive amounts of less-structured web data, started moving into the mainstream.

(4) The end of Google Wave, the start of social BI.Google Wave was launched as a rethink of how we could bring together email, instant messaging, and social networks. While Google decided to close down the technology this year, it has left its mark, notably by influencing the future of “social BI”, with several major vendors bringing out commercial products this year.

(5) The start of the big BI merge. While several small independent BI vendors reported strong growth, the major trend of the year was consolidation and integration: the BI megavendors (SAP, Oracle, IBM, Microsoft) increased their market share (sometimes by acquiring smaller vendors, e.g. IBM/SPSS and SAP/Sybase) and integrated analytics with their existing products, blurring the line between BI and other technology areas.

Top three trends next year:

(1) Analytics, reinvented. New DW techniques make it possible to do sub-second, interactive analytics directly against row-level operational data. Now BI processes and interfaces need to be rethought and redesigned to make best use of this — notably by blurring the distinctions between the “design” and “consumption” phases of BI.

(2) Corporate and personal BI come together. The ability to mix corporate and personal data for quick, pragmatic analysis is a common business need. The typical solution to the problem — extracting and combining the data into a local data store (either Excel or a departmental data mart) — pleases users, but introduces duplication and extra costs and makes a mockery of information governance. 2011 will see the rise of systems that let individuals and departments load their data into personal spaces in the corporate environment, allowing pragmatic analytic flexibility without compromising security and governance.

(3) The next generation of business applications. Where are the business applications designed to support what people really do all day, such as implementing this year’s strategy, launching new products, or acquiring another company? 2011 will see the first prototypes of people-focused, flexible, information-centric, and collaborative applications, bringing together the best of business intelligence, “enterprise 2.0”, and existing operational applications.

And one that should happen, but probably won’t:

(4) Intelligence = Information + PEOPLE. Successful analytics isn’t about technology — it’s about people, process, and culture. The biggest trend in 2011 should be organizations spending the majority of their efforts on user adoption rather than technical implementation.                 About- http://timoelliott.com/blog/about

Timo Elliott is a 19-year veteran of SAP BusinessObjects, and has spent the last twenty years working with customers around the world on information strategy.

He works closely with SAP research and innovation centers around the world to evangelize new technology prototypes.

His popular Business Analytics and SAPWeb20 blogs track innovation in analytics and social media, including topics such as augmented corporate reality, collaborative decision-making, and social network analysis.

His PowerPoint Twitter Tools lets presenters see and react to tweets in real time, embedded directly within their slides.

A popular and engaging speaker, Elliott presents regularly to IT and business audiences at international conferences, on subjects such as why BI projects fail and what to do about it, and the intersection of BI and enterprise 2.0.

Prior to Business Objects, Elliott was a computer consultant in Hong Kong and led analytics projects for Shell in New Zealand. He holds a first-class honors degree in Economics with Statistics from Bristol University, England. He blogs on http://timoelliott.com/blog/ (one of the best designed blogs in BI) . You can see more about him personal web site here and photo/sketch blog here. You should follow Timo at http://twitter.com/timoelliott

Art Credit- Timo Elliott

Related Articles