karl rexer – DECISION STATS

The Popularity of Data Analysis Software

Here is a nice page by Bob Muenchen (author of “R for SAS and SPSS” and “R for Stata” books)

It is available at http://r4stats.com/popularity and uses a variety of methods, including Google Insights, Page Rank, Link analysis, as well as information from Rexer Analytics and KDNuggets.

I believe the following two graphs sum it all up:

1 Number of Jobs at Monster.com using keywords

2 Google Scholar’s analysis of academic papers

Despite R’s Rapid Growth which is clearly evident, in terms of jobs as well as publications, it lags behind SAS and SPSS. So if you are a corporate user or an academic user, it makes sense to have more than one skill just to be sure. What do you think? Is learning R mutually exclusive and completely exhaustive from learning SAS or SPSS. See http://r4stats.com/popularity for the complete analysis by Bob Muenchen

Also it shows the tremendous opportunity for companies like Revolution Analytics and XL Solutions ( http://www.experience-rplus.com/ ) as the potential for growth is clearly evident.

Rexer Analytics Annual Data Miner Survey

HIGHLIGHTS from the 3rd Annual Data Miner Survey:

40-item survey of data miners, conducted on-line in early 2009.

710 participants from 58 countries.

Data miners’ most commonly used algorithms are regression, decision trees, and cluster analysis.

Data mining is playing an important role in organizations.

Half of data miners say their results are helping to drive strategic decisions and operational processes.

58% say they are adding to the knowledge base in the field.

60% of respondents say the results of their modeling are deployed always or most of the time.

Most data miners feel that the economy will not negatively impact them.

Almost half of industry data miners rate the analytic capabilities of their company as above average or excellent.  But 19% feel their company has minimal or no analytic capabilities.

The top challenges facing data miners are dirty data, explaining data mining to others, and difficult access to data.  However, in 2009 fewer data miners listed data quality and data access as challenges than in the previous year.

IBM SPSS Modeler (SPSS Clementine), Statistica, and IBM SPSS Statistics (SPSS Statistics) are identified as the “primary tools” used by the most data miners.

Open-source tools Weka and R made substantial movement up data miner’s tool rankings this year, and are now used by large numbers of both academic and for-profit data miners.

SAS Enterprise Miner dropped in data miner’s tool rankings this year.

Users of IBM SPSS Modeler, Statistica, and Rapid Miner are the most satisfied with their software.

Fields & Industries:  Data mining is everywhere.  The most sited areas are CRM / Marketing, Academic, Financial Services, & IT / Telecom.  And in the for-profit sector, the departments data miners most frequently work in are Marketing & Sales and Research & Development.

Additional Info can be taken from the Rexer Analytics website- I find their annual survey one of the most useful in summarizing the entire DM and A landscape.

SPSS Directions : Rexer Survey Results

Here are some results shared by Dr Karl Rexer of Rexer Analytics- they were presented at SPSS Directions. Clementine was #1 in customer satisfaction — everyone (N=78) who identified it as their primary tool were satisfied or very satisfied. It’s pretty amazing that not even one person was neutral (it was a 5-point scale).

For a detailed poster on the results contact http://www.RexerAnalytics.com More than 710 data mining professionals had completed the survey.

Here are some results shared by Dr Karl Rexer of Rexer Analytics- they were presented at SPSS Directions

When asked to select all of the software packages they use for data mining, each person selected an average of 5 tools. More data miners reported using SPSS Statistics than any other tool. And when we asked people to indicate their primary data mining tool, the tool selected by the most data miners was SPSS Modeler (Clementine). The SPSS people were also thrilled to see that Clementine was #1 in customer satisfaction — everyone (N=78) who identified it as their primary tool were satisfied or very satisfied. It’s pretty amazing that not even one person was neutral (it was a 5-point scale).

For a detailed poster on the results contact www.RexerAnalytics.com More than 710 data mining professionals had completed the survey.

SAS Data Mining 2009 Las Vegas

I am going to Las Vegas as a guest of SAS Institute for the Data Mining 2009 Conference. ( Note FCC regulations on bloggers come in effective December but my current policies are in ADVERTISE page unchanged since some months now)

With the big heavyweight of analytics, SAS Institute showcases events in both the SAS Global Forum and the Data Mining 2009

conference has a virtual who’s- who of partners there. This includes my friends at Aster Data and Shawn Rogers, Beye Network

in addition to Anne Milley, Senior Product Director. Anne is a frequent speaker for SAS Institute and has shrug off the beginning of the year NY Times spat with R /Open Source. True to their word they did go ahead and launch the SAS/IML with the interface to R – mindful of GPL as well as open source sentiments.

. While SPSS does have a data mining product there is considerable discussion on that help list today on what direction IBM will allow the data mining product to evolve.

Charlie Berger, from Oracle Data Mining , also announced at Oracle World that he is going to launch a GUI based data mining product for free ( or probably Software as a Service Model)- Thanks to Karl Rexer from Rexer Analytics for this tip.

While this is my first trip to Las Vegas ( a change from cold TN weather), I hope to read new stuff on data mining including sessions on blog and text mining and statistical usage of the same. Data Mining continues to be an enduring passion for me even though I need to get maybe a Divine Miracle for my Phd to get funded on that topic.

Also I may have some tweets at #M2009 for you and some video interviews/ photos. Ok- Watch this space.

ps _ We lost to Alabama #2 in the country by two points because 2 punts were blocked by hand which were as close as it gets.

Next week I hope to watch the South Carolina match in Orange Country.

How to use Oracle for Data Mining

Oracle for Data Mining!!!! Thats right I am talking of the same Database company that made waves with acquiring Sun ( and the beloved Java) and has been stealing market share left and right.

Here are some techie specific help- if you know SQL ( or Even Proc SQL) you can learn Oracle Data Mining in less than an hour- good enough to clear that job shortlist.

Check out the attached sample code examples. They are designed to run on the ODM demo data, but you could change that easily. They are posted on OTN here

Sample Code Demonstrating Oracle 11.1 Data Mining (230KB)
These files include sample programs in PL/SQL and Java illustrating each of the algorithms supported by Oracle Data Mining 11.1. There are examples of automatic data preparation and data transformations appropriate for each algorithm. Several programs illustrate the text transformation and text mining process.

Oracle Data Mining PL/SQL Sample Programs

The PL/SQL sample programs illustrate each algorithm supported by Oracle Data Mining as well as text transformation and text mining using NMF and SVM classification. Transformations that prepare the data for mining are included in the programs.Execute the PL/SQL sample programs.

Mining Function Algorithm Sample Program

Anomaly Detection One-Class Support Vector Machine dmsvodem.sql

Association Rules Apriori dmardemo.sql

Attribute Importance Minimum Descriptor Length dmaidemo.sql

Classification Adaptive Bayes Network (deprecated) dmabdemo.sql

Classification Decision Tree dmdtdemo.sql

Classification Decision Tree (cross validation) dmdtxvlddemo.sql

Classification Logistic Regression dmglcdem.sql

Classification Naive Bayes dmnbdemo.sql

Classification Support Vector Machine dmsvcdem.sql

Clustering k-Means dmkmdemo.sql

Clustering O-Cluster dmocdemo.sql

Feature Extraction Non-Negative Matrix Factorization dmnmdemo.sql

Regression Linear Regression dmglrdem.sql

Regression Support Vector Machine dmsvrdem.sql

Text Mining Text transformation using Oracle Text dmtxtfe.sql

Text Mining Non-Negative Matrix Factorization dmtxtnmf.sql

Text Mining Support Vector Machine (Classification) dmtxtsvm.sql

Mining Function	Algorithm	Sample Program
Anomaly Detection	One-Class Support Vector Machine	`dmsvodem.sql`
Association Rules	Apriori	`dmardemo.sql`
Attribute Importance	Minimum Descriptor Length	`dmaidemo.sql`
Classification	Adaptive Bayes Network (deprecated)	`dmabdemo.sql`
Classification	Decision Tree	`dmdtdemo.sql`
Classification	Decision Tree (cross validation)	`dmdtxvlddemo.sql`
Classification	Logistic Regression	`dmglcdem.sql`
Classification	Naive Bayes	`dmnbdemo.sql`
Classification	Support Vector Machine	`dmsvcdem.sql`
Clustering	k-Means	`dmkmdemo.sql`
Clustering	O-Cluster	`dmocdemo.sql`
Feature Extraction	Non-Negative Matrix Factorization	`dmnmdemo.sql`
Regression	Linear Regression	`dmglrdem.sql`
Regression	Support Vector Machine	`dmsvrdem.sql`
Text Mining	Text transformation using Oracle Text	`dmtxtfe.sql`
Text Mining	Non-Negative Matrix Factorization	`dmtxtnmf.sql`
Text Mining	Support Vector Machine (Classification)	`dmtxtsvm.sql`

And

a particularly cute and nifty example of Fraud ( as in Fraud Detection 😉

drop table CLAIMS_SET;

exec dbms_data_mining.drop_model(‘CLAIMSMODEL’);

create table CLAIMS_SET (setting_name varchar2(30), setting_value varchar2(4000));

insert into CLAIMS_SET values (‘ALGO_NAME’,’ALGO_SUPPORT_VECTOR_MACHINES’);

insert into CLAIMS_SET values (‘PREP_AUTO’,’ON’);

commit;

begin

dbms_data_mining.create_model(‘CLAIMSMODEL’, ‘CLASSIFICATION’,

‘CLAIMS’, ‘POLICYNUMBER’, null, ‘CLAIMS_SET’);

end;

/

— accuracy (per-class and overall)

col actual format a6

select actual, round(corr*100/total,2) percent, corr, total-corr incorr, total from

(select actual, sum(decode(actual,predicted,1,0)) corr, count(*) total from

(select CLAIMS actual, prediction(CLAIMSMODEL using *) predicted

from CLAIMS_APPLY)

group by rollup(actual));

— top 5 most suspicious claims where the number of previous claims is 2 or more:

select * from

(select POLICYNUMBER, round(prob_fraud*100,2) percent_fraud,

rank() over (order by prob_fraud desc) rnk from

(select POLICYNUMBER, prediction_probability(CLAIMSMODEL, ‘0’ using *) prob_fraud

from CLAIMS_APPLY

where PASTNUMBEROFCLAIMS in (‘2 to 4’, ‘more than 4’)

where rnk <= 5

order by percent_fraud desc;

Coming up- a series of tutorials on learning the skills by just sitting in your home.

Hat Tip- Karl Rexer , Rexer Analytics and Charlie Berger, Oracle.

Interview Karl Rexer -Rexer Analytics

Here is an interview with Karl Rexer of Rexer Analytics. His annual survey is considered a benchmark in the data mining and analytics industry. Here Karl talks of his career, his annual survey and his views on the industry direction and trends.

Almost 20% of data miners report that their company/organizations have only minimal analytic capabilities – Karl Rexer

Ajay- Describe your career in science. What advice would you give to young science graduates in this recession? What advice would you give to high school students choosing from science – non science careers?

Karl- My interests in science began as a child. My father has multiple science degrees, and I grew up listening to his descriptions of the cool things he was building, or the cool investigative tools he was using, in his lab. He worked in an industrial setting, so visiting was difficult. But when I could, I loved going in to see the high-temperature furnaces he was designing, the carbon-fiber production processes he was developing, and the electron microscope that allowed him to look at his samples. Both of my parents encouraged me to ask why, and to think critically about both scientific and social issues. It was also the time of the Apollo moon landings, and I was totally absorbed in watching and thinking about them. Together these things motivated me and shaped my world-view.

I have also had the good fortune to work across many diverse areas and with some truly outstanding people. In graduate school I focused on applied statistics and the use of scientific methods in the social sciences. As a grad student and young academic, I applied those skills to researching how our brains process language. But on the side, I pursued a passion for using the scientific method and analytics to address ….well anything I could. We called it “statistical consulting” then, but it often extended to research design and many other parts of the scientific process. Some early projects included assisting people with AIDS outcome studies, psycholinguistic research, and studies of adolescent adjustment.

My first taste of applying these skills outside of an academic environment was with my mentor Len Katz. The US Navy hired us to help assess the new recruits that were entering the submarine school. Early identification of sailors who would excel in this unusual and stressful environment was critical. Perhaps even more important was identifying sailors who would not perform well in that environment. Luckily, the Navy had years of academic and psychological testing on many sailors, and this data proved quite useful in predicting later job performance onboard the submarines. Even though we never got the promised submarine ride, I was hooked on applying measurement, scientific methods, and analytics in non-academic settings.

And that’s basically what I have continued to do – apply those skills and methods in diverse scientific and business settings. I worked for two banks and two consulting firms before founding Rexer Analytics in 2002. Last year we supported 30 clients. I’ve got great staff and they have great quant skills. Importantly, we also don’t hesitate to challenge each other, and we’re continually learning from each other and from each client engagement. We share a love of project diversity, and we seek it out in our engagements. We’ve forecasted sales for medical devices, measured B2B customer loyalty, identified manufacturing problems by analyzing product returns, predicted which customers will close their bank accounts, analyzed millions of tax returns, helped identify the dimensions of business team cohesion that result in better performance, found millions of dollars of B2B and B2C fraud, and helped many companies understand their customers better with segmentations, surveys, and analyses of sales and customer behavior.

The advice I would give to young science grads in this recession is to expand your view of where you can apply your scientific training. This applies to high school students considering science careers too. All science does not happen in universities, labs and other traditional science locations. Think about applying scientific methods everywhere! Sometimes our projects at Rexer Analytics seem far away from what most people would consider “science.” But we’re always asking “what data is available that can be brought to bear on the business issue we’re addressing.” Sometimes the best solution is to go out and collect more data – so we frequently help our clients improve their measurement processes or design surveys to collect the necessary data. I think there are enormous opportunities for science grads to apply their scientific training in the business world. The opportunities are not limited to physics wiz-kids making models for Wall Street trading or computer science students moving to Silicon Valley. One of the best analytic teams I ever worked on was at Fleet Bank in the late 90s. We had an economist, two physicists, a sociologist, a psychologist, an operations research guy, and person with a degree in marketing science. We were all very focused on data, measurement, and analytic methods.

I recommend that all science grads read Tom Davenport’s book Competing on Analytics *. It illustrates, with compelling examples, how businesses can benefit from using science and analytics. Several examples in Tom’s book come from Gary Loveman, CEO of Harrah’s Entertainment. I think that Gary also serves as a great example of how scientific methods can be applied in every industry. Gary has a PhD in economics from MIT, he’s worked at the Federal Reserve Bank, he’s been a professor at Harvard, but more recently he runs the world’s largest casino and gaming company. And he’s famously said many times that there are three ways to get fired at Harrah’s: steal, harass women, or not use a control group. Business leaders across all industries are increasingly wanting data, analytics and scientific decision-making. Science grads have great training that enables them to take on these roles and to demonstrate the success of these methods.

Ajay- One more survey- How does the Rexer survey differentiate itself from other surveys out there?

Karl- The Annual Rexer Analytics Data Miner Survey is the only broad-reaching research that investigates the analytic behaviors, views and preferences of data mining professionals. Each year our sample grows — in 2009 we had over 700 people around the globe complete our survey. Our participants include large numbers of both academic and business people.

Another way our survey is differentiated from other surveys is that each year we ask our participants to provide suggestions on ways to improve the survey. Incorporating participants’ suggestions improves our survey. For example, in 2008 several people suggested adding questions about model deployment and off-shoring. We asked about both of these topics in the 2009 survey.

Ajay -Could you please share some sneak previews of the survey results? What impact is the recession likely to have on IT spending?

Karl- We’re just starting to analyze the 2009 survey data. But, yes, here’s a peek at some of the findings that relate to the impact of the recession:

* Many data miners report that funding for data mining projects can sometimes be a problem.
* However, when asked what will happen in 2009 if the economic downturn continues, many data miners still anticipate that their company/organization will conduct more data mining projects in 2009 than in previous years (41% anticipate more projects in 2009; 27% anticipate fewer projects).
* The vast majority of companies conduct their data mining internally, and very few are sending data mining off-shore.

I don’t have a crystal ball that tells me about the trends in overall corporate spending on IT, Business Intelligence, or Data Mining. It’s my personal experience that many budgets are tight this year, but that key projects are still getting funded. And it is my strong opinion that in the coming years many companies will increase their focus on analytics, and I think that increasingly analytics will be a source of competitive advantage for these companies.

There are other people and other surveys that provide better insight into the trends in IT spending. For example, Gartner’s recent survey of over 1,500 CIOs (http://www.gartner.com/it/page.jsp?id=855612 ) suggests that 2009 IT spending is likely to be flat. I’m personally happy to see that in the Gartner survey, Business Intelligence is again CIOs’ top technology priority, and that “increasing the use of information/analytics” is the #5 business priority.

Ajay- I noticed you advise SPSS among others. Describe what an advisory role is for an analytics company and how can small open source companies get renowned advisors?

Karl- We have advised Oracle, SPSS, Hewlett-Packard and several smaller companies. We find that advisory roles vary greatly. The biggest source of variation is what the company wants advice about. Example include:

* assessing opportunity areas for the application of analytics
* strategic data assessments
* analytic strategy
* product strategy
* reviewing software

Both large and small companies that look to apply analytics to their businesses can benefit from analytic advisors. So can open source companies that sell analytic software. Companies can find analytic advisors in several ways. One way is to look around for analytic experts whose advice you trust, and hire them. Networking in your own industry and in the analytic communities can identify potential advisors. Don’t forget to look in both academia and the business world. Many skilled people cross back and forth between these two worlds. Another way for these companies to obtain analytic advice is to look in their business networks and user communities for analytic specialists who share some of the goals of the company – they will be motivated for your company to succeed. Especially if focused topic areas or time-constrained tasks can be identified, outside experts may be willing to donate their time, and they may be flattered that you asked.

Ajay- What made you decide to begin the Rexer Surveys? Describe some results of last year’s surveys and any trends from the last three years that you have seen.

Karl- I’ve been involved on the organizing committees of several data mining workshops and conferences. At these conferences I talk with a lot of data miners and companies involved in data mining. I found that many people were interested in hearing about what other data miners were doing: what algorithms, what types of data, what challenges were being faced, what they liked and disliked about their data mining tools, etc. Since we conduct online surveys for several of our clients, and my network of data miners is pretty large, I realized that we could easily do a survey of data miners, and share the results with the data mining community. In the first year, 314 data miners participated, and it’s just grown from there. In 2009 over 700 people completed the survey. The interest we’ve seen in our research summaries has also been astounding – we’ve had thousands of requests. Overall, this just confirms what we originally thought: people are hungry for information about data mining.

Here is a preview of findings from the initial analyses of the 2009 survey data:

* Each year we’ve seen that the most commonly used algorithms are decision trees, regression, and cluster analysis.
* Consistently, some of the top challenges data miners report are dirty data and explaining data mining to others. Previously, data access issues were also reported as a big challenge, but in 2009 fewer data miners reported facing this challenge.
* The most prevalent concerns with how data mining is being utilized are: insufficient training of some data miners, and resistance to using data mining in contexts where it would be beneficial.
* Data mining is playing an important role in organizations. Half of data miners indicate their results are helping to drive strategic decisions and operational processes.
* But there’s room for data mining to grow – almost 20% of data miners report that their company/organizations have only minimal analytic capabilities.

Bio-

Karl Rexer, PhD is President of Rexer Analytics, a small Boston-based consulting firm. Rexer Analytics provides analytic and CRM consulting to help clients use their data to make better strategic and tactical decisions. Recent projects include fraud detection, sales forecasting, customer segmentation, loyalty analyses, predictive modeling for cross-sell and attrition, and survey research. Rexer Analytics also conducts an annual survey of data miners and freely distributes research summaries to the data mining community. Karl has been on the organizing committees of several international data mining conferences, including 3 KDD conferences, and BIWA-2008. Karl is on the SPSS Customer Advisory Board and on the Board of Directors of the Oracle Business Intelligence, Warehousing, & Analytics (BIWA) Special Interest Group. Karl and other Rexer Analytics staff are frequent invited speakers at MBA data mining classes and conferences.

To know more do check out the website on www.rexeranalytics.com

Please share:

Please share:

Please share:

Please share:

Please share:

Almost 20% of data miners report that their company/organizations have only minimal analytic capabilities – Karl Rexer

Please share: