Data Mining Survey Results :Tools and Offshoring

Here are some survey results from  Rexer Analytics

The Graphics seem self explanatory: terrific Data Visualization

1) The field of Data Mining seems ripe for either more offshoring to cut down costs or

there will be price pressures to cut costs on software ( read More R and SaaS) and Hardware ( more cloud /time sharing  ?)

2) Satisfaction with both R and SAS seems similar but R seems to score higher than other flavors.

3) An added dimension of  utility ( or say

(satisfaction in terms of analyst comfort + functionality in terms of business benefit) divided by (License + Training + Installation + Transition costs)

would have even extra analysis.

But these are not final results- for that you need to see Dr Karl at Rexer Analytics

SPSS Directions : Rexer Survey Results

Here are some results shared by Dr Karl Rexer of Rexer Analytics- they were presented at SPSS Directions. Clementine was #1 in customer satisfaction — everyone (N=78) who identified it as their primary tool were satisfied or very satisfied. It’s pretty amazing that not even one person was neutral (it was a 5-point scale).

For a detailed poster on the results contact http://www.RexerAnalytics.com More than 710 data mining professionals had completed the survey.

Here are some results shared by Dr Karl Rexer of Rexer Analytics- they were presented at SPSS Directions

When asked to select all of the software packages they use for data mining, each person selected an average of 5 tools.  More data miners reported using SPSS Statistics than any other tool.  And when we asked people to indicate their primary data mining tool, the tool selected by the most data miners was SPSS Modeler (Clementine).  The SPSS people were also thrilled to see that Clementine was #1 in customer satisfaction — everyone (N=78) who identified it as their primary tool were satisfied or very satisfied.  It’s pretty amazing that not even one person was neutral (it was a 5-point scale).

For a  detailed poster on the results contact www.RexerAnalytics.com More than 710 data mining professionals had completed the survey.

Webcasts: Oracle Data Mining

How to do Data Mining for free? See these webcasts- unless you are too busy writing cheques for other softwares.

Wednesday, November 4, 12 noon Eastern
BIWA SIG Wednesday TechCast Series – 11th Event!
Oracle Data Mining for Text, Clustering, and Classification: Case Study of the Oracle OpenWorld Recommendation Engine
Mark Hornick –  Oracle Data Mining Development – Oracle

Returning from Oracle OpenWorld 09, some of the hot topics for data mining include text mining, document clustering, and recommendation engines. Learn how Oracle Data Mining was used in the Schedule Builder application at Oracle OpenWorld 2008 and 2009 to generate session recommendations and find similar and related sessions. This session includes a demonstration using the advanced text mining features of the Oracle Data Miner user interface. We also characterize recommendation effectiveness using objective test metrics and summarize recommendation usage.

Presenter:
Mark Hornick is Senior Manager, Data Mining Development, in Oracle’s Server Technology organization.  He has many years experience in data mining and is a leader of the JSR-73 standards body for data mining.  Mark wrote the book  Java Data Mining:  Strategy, Standard and Practice.

Audio Dial-In: 866 682 4770  Audio Meeting ID:  1683901  Audio Meeting Passcode:  334451
Web Conference:
https://conference.oracle.com/imtapp/app/cmn_jm_hub.uix?mID=161895628
Compatibility Check:  If you have not used Oracle’s web conference system before, please ensure your system    compatibility by going to https://conference.oracle.com/imtapp/app/nuf_sys.uix.

Other Wednesday TechCasts Coming Soon (always at noon Eastern time)
– Mark Your Calendars!

  • November 18, 2009 – OBIEE Case Study:  A High Performance Reporting Platform at Gallup
    Swapan Golla and Tom Ruhlman – Analytics and Gallup Net Analytics, Gallup Inc.
  • December 9, 2009 – Best practices for deploying a Data Warehouse on Oracle Database 11g
    Maria Colgan – Oracle Data Warehousing Product Management, Oracle
  • SAS Data Mining 2009 Las Vegas

    I am going to Las Vegas as a guest of SAS Institute for the Data Mining 2009 Conference. ( Note FCC regulations on bloggers come in effective December but my current policies are in ADVERTISE page unchanged since some months now)

    With the big heavyweight of analytics, SAS Institute showcases events in both the SAS Global Forum and the Data Mining 2009

    conference has a virtual who’s- who of partners there. This includes my friends at Aster Data and Shawn Rogers, Beye Network

    in addition to Anne Milley, Senior Product Director. Anne is a frequent speaker for SAS Institute and has shrug off the beginning of the year NY Times spat with R /Open Source. True to their word they did go ahead and launch the SAS/IML with the interface to R – mindful of GPL as well as open source sentiments.

    . While SPSS does have a data mining product there is considerable discussion on that help list today on what direction IBM will allow the data mining product to evolve.

    Charlie Berger, from Oracle Data Mining , also announced at Oracle World that he is going to launch a GUI based data mining product for free ( or probably Software as a Service Model)- Thanks to Karl Rexer from Rexer Analytics for this tip.

    While this is my first trip to Las Vegas ( a change from cold TN weather), I hope to read new stuff on data mining including sessions on blog and text mining and statistical usage of the same. Data Mining continues to be an enduring passion for me even though I need to get maybe a Divine Miracle for my Phd to get funded on that topic.

    Also I may have some tweets at #M2009 for you and some video interviews/ photos. Ok- Watch this space.

    ps _ We lost to Alabama #2 in the country by two points because 2 punts were blocked by hand which were as close as it gets.

    Next week I hope to watch the South Carolina match in Orange Country.

    Screenshot-32

    Interview Michael Zeller,CEO Zementis on PMML

    Here is a topic specific interview with Micheal Zeller of Zementis on PMML, the de facto standard for data mining.

    PMML Logo

    Ajay- What is PMML?

    Mike- The Predictive Model Markup Language (PMML) is the leading standard for statistical and data mining models and supported by all leading analytics vendors and organizations. With PMML, it is straightforward to develop a model on one system using one application and deploy the model on another system using another application. PMML reduces complexity and bridges the gap between development and production deployment of predictive analytics.

    PMML is governed by the Data Mining Group (DMG), an independent, vendor led consortium that develops data mining standards

    Ajay- Why can PMML help any business?

    Mike– PMML ensures business agility with respect to data mining, predictive analytics, and enterprise decision management. It provides one standard, one deployment process, across all applications, projects and business divisions. In this way, business stakeholders, analytic scientists, and IT are finally speaking the same language.

    In the current global economic crisis more than ever, a company must become more efficient and optimize business processes to remain competitive. Predictive analytics is widely regarded as the next logical step, implementing more intelligent, real-time decisions across the enterprise.

    However, the deployment of decisions based on predictive models and statistical algorithms has been a hurdle for many companies. Typically, it has been a complex, costly process to get such models integrated into operational systems. With the PMML standard, this no longer is the case. PMML simply eliminates the deployment complexity for predictive models.

    A standard also provides choices among vendors, allowing us to implement best-of-breed solutions, and creating a common knowledge framework for internal teams – analytics, IT, and business – as well external vendors and consultants. In general, having a solid standard is a sign of a mature analytics industry, creating more options for users and, most importantly, propelling the total analytics market to the next level.

    Ajay- Can PMML help your existing software in analytics and BI?

    Mike- PMML has been widely accepted among vendors, almost all major analytics and business intelligence vendors already support the standard. If you have any such software package in-house, you most likely have PMML at your disposal already.

    For example, you can develop your models in any of the tools that support PMML, e.g., SPSS, SAS, Microstrategy, or IBM, and then deploy that model in ADAPA, which is the Zementis decision engine. Or you can even choose from various open source tools, like R and KNIME.

    PMML_Now

    Ajay- How does Zementis and ADAPA and PMML fit?

    Mike- Zementis has been a avid supporter of the PMML standard and is very active in the development of the standard. We contributed to the PMML package for the open source R Project. Furthermore, we created a free PMML Converter tool which helps users to validate and correct PMML files from various vendors and convert legacy PMML files to the latest version of the standard.

    Most prominently with ADAPA, Zementis launched the first cloud-computing scoring engine on the Amazon EC2 cloud. ADAPA is a highly scalable deployment, integration and execution platform for PMML-based predictive models. Not only does it give you all the benefits of being fully standards-based, using PMML and web services, but it also leverages the cloud for scalability and cost-effectiveness.

    By being a Software as a Service (SaaS) application on Amazon EC2, ADAPA provides extreme flexibility, from casual usage which only costs a few dollars a month all the way to high-volume mission critical enterprise decision management which users can seamlessly launch in the United States or in European data centers.

    Ajay- What are some examples where PMML helped companies save money?

    Mike- For any consulting company focused on developing predictive analytics models for clients, PMML provides tremendous benefits, both for clients and service provider. In standardizing on PMML, it defines a clear deliverable – a PMML model – which clients can deploy instantly. No fixed requirements on which specific tools to choose for development or deployment, it is only important that the model adheres to the PMML standard which becomes the common interface between the business partners. This eliminates miscommunication and lowers the overall project cost. Another example is where a company has taken advantage of the capability to move models instantly from development to operational deployment. It allows them to quickly update models based on market conditions, say in the area of risk management and fraud detection, or to roll out new marketing campaigns.

    Personally, I think the biggest opportunities are still ahead of us as more and more businesses embrace operational predictive analytics. The true value of PMML is to facilitate a real-time decision environment where we leverage predictive models in every business process, at every customer touch point and on-demand to maximize value

    Ajay- Where can I find more information about PMML?

    Mike- First there is the Data Mining Group (DMG) web site at http://www.dmg.org

    I strongly encourage any company that has a significant interest in predictive analytics to become a member and help drive the development of the standard.

    We also created a knowledge base of PMML-related information at http://www.predictive-analytics.info and there is a PMML interest group on Linked

    In http://www.linkedin.com/groupRegistration?gid=2328634

    This group is more geared toward a general discussion forum for business benefits and end-user questions, and it is a great way to get started with PMML.

    Last but not least, the Zementis web site at http://www.zementis.com

    It contains various PMML example files, the PMML Converter tool, as well links to PMML resource pages on the web.

    For more on Michael Zeller and Zementis read his earlier interview at https://decisionstats.wordpress.com/2009/02/03/interview-michael-zeller-ceozementis-2/

    How to use Oracle for Data Mining

    Oracle for Data Mining!!!! Thats right I am talking of the same Database company that made waves with acquiring Sun ( and the beloved Java) and has been stealing market share left and right.

    Here are some techie specific help- if you know SQL ( or Even Proc SQL) you can learn Oracle Data Mining in less than an hour- good enough to clear that job shortlist.

    Check out the attached sample code examples.  They are designed to run on the ODM demo data, but you could change that easily.  They are posted on OTN here

    Sample Code Demonstrating Oracle 11.1 Data Mining (230KB)
    These files include sample programs in PL/SQL and Java illustrating each of the algorithms supported by Oracle Data Mining 11.1. There are examples of automatic data preparation and data transformations appropriate for each algorithm. Several programs illustrate the text transformation and text mining process.

    Oracle Data Mining PL/SQL Sample Programs

    The PL/SQL sample programs illustrate each algorithm supported by Oracle Data Mining as well as text transformation and text mining using NMF and SVM classification. Transformations that prepare the data for mining are included in the programs.Execute the PL/SQL sample programs.

    Mining Function Algorithm Sample Program
    Anomaly Detection One-Class Support Vector Machine dmsvodem.sql
    Association Rules Apriori dmardemo.sql
    Attribute Importance Minimum Descriptor Length dmaidemo.sql
    Classification Adaptive Bayes Network (deprecated) dmabdemo.sql
    Classification Decision Tree dmdtdemo.sql
    Classification Decision Tree (cross validation) dmdtxvlddemo.sql
    Classification Logistic Regression dmglcdem.sql
    Classification Naive Bayes dmnbdemo.sql
    Classification Support Vector Machine dmsvcdem.sql
    Clustering k-Means dmkmdemo.sql
    Clustering O-Cluster dmocdemo.sql
    Feature Extraction Non-Negative Matrix Factorization dmnmdemo.sql
    Regression Linear Regression dmglrdem.sql
    Regression Support Vector Machine dmsvrdem.sql
    Text Mining Text transformation using Oracle Text dmtxtfe.sql
    Text Mining Non-Negative Matrix Factorization dmtxtnmf.sql
    Text Mining Support Vector Machine (Classification) dmtxtsvm.sql

    And

    a particularly cute and nifty example of Fraud ( as in Fraud Detection 😉

    drop table CLAIMS_SET;
    exec dbms_data_mining.drop_model(‘CLAIMSMODEL’);
    create table CLAIMS_SET (setting_name varchar2(30), setting_value varchar2(4000));
    insert into CLAIMS_SET values (‘ALGO_NAME’,’ALGO_SUPPORT_VECTOR_MACHINES’);
    insert into CLAIMS_SET values (‘PREP_AUTO’,’ON’);
    commit;
    begin
    dbms_data_mining.create_model(‘CLAIMSMODEL’, ‘CLASSIFICATION’,
    ‘CLAIMS’, ‘POLICYNUMBER’, null, ‘CLAIMS_SET’);
    end;
    /
    — accuracy (per-class and overall)
    col actual format a6
    select actual, round(corr*100/total,2) percent, corr, total-corr incorr, total from
    (select actual, sum(decode(actual,predicted,1,0)) corr, count(*) total from
    (select CLAIMS actual, prediction(CLAIMSMODEL using *) predicted
    from CLAIMS_APPLY)
    group by rollup(actual));
    — top 5 most suspicious claims where the number of previous claims is 2 or more:
    select * from
    (select POLICYNUMBER, round(prob_fraud*100,2) percent_fraud,
    rank() over (order by prob_fraud desc) rnk from
    (select POLICYNUMBER, prediction_probability(CLAIMSMODEL, ‘0’ using *) prob_fraud
    from CLAIMS_APPLY
    where PASTNUMBEROFCLAIMS in (‘2 to 4’, ‘more than 4’)
    where rnk <= 5
    order by percent_fraud desc;

    Coming up- a series of tutorials on learning the skills by just sitting in your home.

    Hat Tip- Karl Rexer , Rexer Analytics and Charlie Berger, Oracle.

    Interview Charlie Berger Oracle Data Mining

    Here is an interview with Charlie Berger, Oracle Data Mining Product Management. Oracle is a company much respected for its ability to handle and manage data, and with it’s recent acquisition of Sun- has now considerable software and financial muscle to take the world of data mining to the next generation.

    Ajay- Describe your career in data mining so far from college, jobs, assignments and projects. How would you convince high school students to take up science careers?

    Charlie- In my family, we were all encouraged to pursue science and technical fields. My Dad was a Mechanical Engineer and all my siblings are in scientific and medical fields. Early on, I had narrowed my career choices to engineering or medicine; the question when I left for college was which kind. My Freshman Engineering exposed students to 6 weeks of the curriculum for each of the engineering disciplines. I found myself drawn to the field of Operations Research and Industrial Engineering. I liked the applied math and problem solving aspects. While not everyone has an aptitude or an interest in Math or the Sciences, if you do, it can be a fascinating field.

    Ajay- Please tell us some technical stuff about Oracle Data Mining and Oracle Data Miner products. How do they compare with other products notably from SAS and SPSS? What is unique in Oracle’s suite of data mining products- and some market share numbers to back these please?

    Charlie- Oracle doesn’t share product level revenue numbers. I can say that Oracle is changing the analytics industry. Ten years ago, when Oracle acquired the assets of Thinking Machines, we shared a vision that over time, as the volumes of data expand, at some point, you reach a point where you have to ask whether it makes more sense to “move the data to the algorithms” or to “move the algorithms to the data”. Obviously, you can see the direction that Oracle pursued. Now after 10 years of investing in in-database analytics, we have 50+ statistical techniques and 12 machine learning algorithms running natively inside the kernel of the Oracle Database. Essentially, we have transformed the database to become an analytical database. Today, you now see the traditional statistical software vendors announcing partnering initiatives for in-database processing or in the case of IBM, acquiring SPSS. Oracle pioneered the concept of using a relational database to not only store data, but to analyze it too. Moving forward, I think that we are close to the tipping point where in-database analytics are accepted as the winning IT architecture.

    This trend towards moving the analytics to where the data are stored makes a lot of sense for many reasons. First, you don’t have to move the data. You don’t have to have copies of the data in external analytical sandboxes where it open to security risks and over time, becomes more aged and irrelevant.

    I know of one major e-tailor who constantly experiments by randomly showing web visitors either offers “A” or a new experimental offer “B”. They would export massive amounts of data to SAS afterwards to perform simple statistical analyses. First, they would calculate the median purchase amounts for the duration of the experiment for customers who were shown both offers. Then, they would perform a t-test hypothesis test to determine whether a statistically valid monetary advantage could be gained. If offer “B” were outperforming offer “A”, the e-tailor would Continue reading “Interview Charlie Berger Oracle Data Mining”