Interview Gregory Piatetsky KDNuggets.com

Here is an interviw with Gregory Piatetsky, founder and editor of KDNuggets (www.KDnuggets.com ) ,the oldest and biggest independent industry websites in terms of data mining and analytics-

gps6

Ajay- Please describe your career in science, many challenges and rewards that came with it. Name any scientific research, degrees teaching etc.


Gregory-
I was born in Moscow, Russia and went to a top math high-school in Moscow. A unique  challenge for me was that my father was one of leading mathematicians in Soviet Union.  While I liked math (and still do), I quickly realized while still in high school that  I will never be as good as my father, and math career was not for me.

Fortunately, I discovered computers and really liked the process of programming and solving applied problems.  At that time (late 1970s) computers were not very popular and it was not clear that one can make a career in computers.  However I was very lucky that I was able to pursue what I liked and find demand for my skills.

I got my MS in 1979 and PhD in 1984 in Computer Science from New York University.
I was interested in AI (perhaps thanks to a lot of science fiction I read as a kid), but found a job in databases, so I was looking for ways to combine them.

In 1984 I joined GTE Labs where I worked on research in databases and AI, and in 1989 started the first project on Knowledge Discovery in data. To help convince my management that there will be a demand for this thing
called “data mining” (GTE management did not see much future for it), I also organized a AAAI workshop on the topic.

I thought “data mining” is not sexy enough name, and so I called it “Knowledge Discovery in Data”, or KDD.  Since 1989, I was working on KDD and data mining in all aspects – more on my page www.kdnuggets.com/gps.html

Ajay-  How would you encourage a young science entrepreneur in this recession.

Gregory- Many great companies were started or grew in a recession, e.g.
http://www.insidecrm.com/features/businesses-started-slump-111108/

Recession may be compared to a brush fire which removes dead wood and allows new trees to grow.

Ajay- What prompted you to set up KD Nuggets? Any reasons for the name (kNowledge Discovery Nuggets). Describe some key milestones in this iconic website for data mining people.

Gregory- After a third KDD workshop in 1993 I started a newsletter to connect about 50 people who attended the workshop and possibly others who were interested in data mining and KDD.  The idea was that it will have short items or “nuggets” of information. Also, at that time a popular metaphor for data miner was gold miners who were looking for gold “nuggets”.  So, I wanted a newsletter with “nuggets” – short, valuable items about Knowledge Discovery.  Thus, the name KDnuggets.

In 1994 I created a website on data mining at GTE and in 1997, after I left  GTE , I moved it to the current domain name www.kdnuggets.com .

In 1999, I was working for startup which provided data mining services to financial industry.  However, because of Y2K issues, all banks etc froze their systems in the second half of 1999, and we had very little work (and our salaries were reduced as well).  I decided that I will try to get some ads and was able to get companies like SPSS and Megaputer to advertise.

Since 2001, I am an independent consultant and KDnuggets is only part of what I am doing.  I also do data mining consulting, and actively participate in SIGKDD (Director 1998-2005, Chair 2005-2009).

Some people think that KDnuggets is a large company, with publisher, webmaster, editor, ad salesperson, billing dept, etc.  KDnuggets indeed has all this functions, but it is all me and my two cats.

Ajay- I am impressed by the fact KD nuggets is almost a dictionary or encyclopedia for data mining. But apart from advertising you have not been totally commercial- many features of your newsletter remain ad free – you still maintain a minimalistic look and do not take sponsership aligned with one big vendor. What is your vision for KD Nuggets for the years to come to keep it truly independent.

Gregory- My vision for KDnuggets is to be a comprehensive resource for data mining community, and I really enjoyed maintaining such resource for the first 7-8 years completely non-commercially. However, when I became self -employed, I could not do KDnuggets without any income, so I selectively introduced ads, and only those which are relevant to data mining.

I like to think of KDnuggets as a Craiglist for data mining community.

I certainly realize the importance of social media and Web 2.0 (and interested people can follow my tweets at tweeter.com/kdnuggets)  and plan to add more social features to KDnuggets.

Still, just like Wikipedia and Facebook do not make New York Times obsolete, I think there is room and need for an edited website, especially for such a nerdy and not very social group like data miners.

Ajay- What is the worst mistake/error in writing publishing that you did. What is the biggest triumph or high moment in the Nuggets history.

Gregory- My biggest mistake is probably in choosing the name kdnuggets – in retrospect,  I could have used a shorter and easier to spell domain name, but in 1997 I never expected that I will still be publishing www.KDnuggets.com 12 years later.

Ajay- Who are your favourite data mining students ( having known so many people). What qualities do you think set a data mining person apart from other sceinces.

Gregory- I was only an adjunct professor for a short time, so I did not really have data mining students, but I was privileged enough to know many current data mining leaders when they were students.  Among more recent students, I am very impressed with Jure Leskovec, who just finished his PhD and got the best KDD dissertation award.

Ajay- What does Gregory Piatetsky do for fun when he is not informing the world on analytics and knowledge discovery.

Gregory- I enjoy travelling with my family, and in the summer I like biking and windsurfing.
I also read a lot, and currently in the middle of reading Proust (which I periodically dilute by other, lighter books).

Ajay- What is your favourite reading blog and website ? Any India plans to visit.
Gregory
– I visit many blogs on www.kdnuggets.com/websites/blogs.html

and I like especially
– Matthew Hurst blog: Data Mining: Text Mining, Visualization, and Social Media
– Occam’s Razor by Avinash Kaushik, examining web analytics.
– Juice Analytics, blogging about analytics and visualization
– Geeking with Greg, exploring the future of personalized information.

I also like your website decisionstats.com and plan to visit it more frequently

I visited many countries, but not yet India – waiting for the right occasion !

Biography

(http://www.kdnuggets.com/gps.html)

Gregory Piatetsky-Shapiro, Ph.D. is the President of KDnuggets, which provides research and consulting services in the areas of data mining, web mining, and business analytics. Gregory is considered to be one of the founders of the data mining and knowledge discovery field.Gregory edited or co-edited many collections on data mining and knowledge discovery, including two best-selling books: Knowledge Discovery in Databases (AAAI/MIT Press, 1991) and Advances in Knowledge Discovery in Databases (AAAI/MIT Press, 1996), and has over 60 publications in the areas of data mining, artificial intelligence and database research.

Gregory is the founder of Knowledge Discovery in Database (KDD) conference series. He organized and chaired the first three Knowledge Discovery in Databases (KDD) workshops in 1989, 1991, and 1993. He then served as the Chair of KDD Steering committee and guided the conversion of KDD workshops into leading international conferences on data mining. He also was the General Chair of the KDD-98 conference.

Interview Dr Usama Fayyad Founder Open Insights LLC

Here is an interview with Dr Usama Fayyad, founder of Open Insights LLC (www.open-insights.com). Prior to this he was Yahoo’s Chief Data Officer. In his prior role as Chief Data Officer of Yahoo! he built the data teams and infrastructure to manage the 25 terabytes of data per day that resulted from the company’s operations.

 

Picture_004_(2)

Ajay-     Describe your career in science. How would you motivate young people today to take science careers rather than other careers
Dr Fayyad-
My career started out in science and engineering. My original plan was to be in research and to become a university professor. Indeed, my first few jobs were strictly in basic Research. After doing summer internships at place like GM Research Labs and JPL, my first full-time position was at the NASA – Jet Propulsion Laboratory, California Institute of Technology.

I started in research in Artificial Intelligence for autonomous monitoring and control and in Machine Learning and data mining. The first major success was with Caltech Astronomers on using machine learning classification techniques to automatically recognize objects in a large sky survey (POSS-II – the 2nd Palomar Observatory Sky Survey).  The Survey consists of taking high resolution images of the entire northern sky. The images, when digitized, contain over 2 billion sky objects. The main problem is to recognize if an object is a star of galaxy. For “faint objects” – which constitute the majority of objects, this was an exceedingly hard problem that people wrestled with for 30 years. I was surprised how well the algorithms could do at solving it.

This was a real example of data sets where the dimensionality is so high that algorithms are better suited at solving it than humans – even well-trained astronomers. Our methods had over 94% accuracy on faint objects that no one could reliably classify before at better than 75% accuracy. This additional accuracy made all the difference in enabling all sort of new science, discoveries and theories about formation of large scale structure in the Universe.
The success of this work and its wide recognition in scientific and engineering communities let to the creation of a new group – I founded and managed the Machine Learning Systems group at JPL which went on to address hard problems in object recognition in scientific data – mostly from remote sensing instruments – like Magellan images of the planet Venus (we recognized and classified over a million small volcanoes on the planet in collaboration with geologists at Brown University) and Earth Observing System data, including Atmospherics and storm data.
At the time, Microsoft was interested in figuring out data mining applications in the corporate world and after a long recruiting cycle they got me to join the newly formed Microsoft Research as a Senior Researcher in late 1995. My work there focus on algorithms, database systems, and basic science issues in the newly formed field of Data Mining and Knowledge Discovery. We had just finished publishing a nice edited collection of chapters in a book that became very popular, and I had agreed to become the founding Editor-in-Chief of a brand new journal called: Data Mining and Knowledge Discovery. This journal today is the premier scientific journal in the field. My research work at Microsoft led to several applications – especially in databases. I founded the Data Mining & Exploration group at MSR and later a product group in SQL Server that built and shipped the first integrated data mining product in a large-scale commercial DBMS  – SQL Server 2000 (analysis Services). We created extensions to the SQL language (that we called DMX) and tried to make data mining mainstream. I really enjoyed the life of doing basic research as well as having a real product group that built and shipped components in a major DBMS.
That’s when I learned that the real challenging problems in the real-world where really not in data mining but in getting the data ready and available for analysis – Data Warehousing was a field littered with failures and data stores that were write-only (meaning data never came out!)  — I used to call these Data Tombs at the time and I likened them to the pyramids in Ancient Egypt: great engineering feats to build, but really just tombs.

In 2000 I decided to leave the world of Research at Microsoft to do my first venture-backed start-up company – digiMine. The company wanted to solve the problem of managing the data and performing data mining and analysis over data sets, and we targeted a model of hosted data warehouses and mining applications as an ASP – one of the first Software as a Service (SaaS) firms in that arena. This began my transition from the world of research and science to business and technology.  We focused on on-line data and web analytics since the data volumes their were about 10x the size of transactional databases and most companies did not know how to deal with all that data. The business grew fast and so did the company – reaching 120 employees in about 1 year.

After 3 years of doing high-growth start-up and raising some $50 million in venture capital for the company, I was beginning to feel the itch again to do technical work.
In June 2003, we had a chance to spin-off part of the business that was focused on difficult high-end data mining problems. This opportunity was exactly what I needed and we formed DMX Group as a spinoff company that had a solid business from its first day. At DMX Group I got to work on some of the hardest data mining problems in predicting sales of automobiles, churn of wireless users, financial scoring and credit risk analysis, and many related deep business Intelligence problems.

Our client list included many of the Fortune 500 companies. One of these clients was Yahoo!  — After 6 months of working with Yahoo! As a client they decided to acquire DMX Group and use the people to build a serious data team for Yahoo!  We negotiated a deal that got about half the employees into Yahoo! And we spun-off the rest of DMX Group to continue focusing on consulting work in data mining and BI.  I thus became the industry’s first Chief Data Officer. 

 The original plan was to spend 2 years or so to help Yahoo! Form the right data teams and build the data processing and targeting technology to deliver high value from its inventory of ads.
Yahoo! Proved to be a wonderful experience and I learned so much about the Internet. I also learned that even someone like me who worked on Internet data from the early days of MSN (in 1996) and who ran a web analytics firm still did not scratch the service on the depth of the area. I learned a lot about the Internet from Jerry Yaang (Yahoo! Co-founder) and much about advertising/media business from Dan Rosensweig (COO) and mTerry Semel (then CEO) and lots about technology management and strategic deal-making from Farzad (Zod) Nazem who was the CTO. As Executive VP at Yahoo!

I built one of the industry’s largest and best data teams and we were able to to process over 25 terabytes of data per year and power several hundred million Dollars of new revenue for Yahoo! Resulting from these data systems. A year after joining Yahoo! I was asked to form a new Research Lab to study much of what we did not understand about the Internet. This was yet another return of basic research into my life. I founded Yahoo! Research to invent the new sciences of the Internet, and I wanted them to be focused on only 4 areas (the idea of focus came from my exposure to Caltech and its philosophy in picking few areas of excellence). The goal was the become the best research lab in the world in these new focused areas. Surprisingly we did it within 2 years. I hired Prabhakar Raghavan to run Research and he did a phenomenal job in building out the Research organization. The four areas we chose were: Search and information navigation, Community Systems, Micro-economics of the Web, and Computational Advertising.  We were able to attract the top talent in the world to lead or work on these emerging areas. Yahoo! Research was a success in basic research but also in influencing product. The chief scientists for all the major areas of company products all came from Yahoo! Research and all owned the product development agenda and plans: Raghu Ramakrishnan (CS for Audience), Andrew Tomkins (CS for Search), Anrei Broder (CS for Monetization) and Preston McCaffee (CS for Marketplaces/Exchanges). I consider this an unprecendented achievement in the world of Research in general: excellence in basic research and huge impact on company products, all within 3-4 years.
I have recently left Yahoo! And started Open Insights (www.open-insights.com) to focus on data strategy and helping enterprises realize the value of data, develop the right data strategies, and create new business models. Sort of an ‘outsourced version” of my Chief Data Officer job at Yahoo!
Finally, on my advice to young people: it is not just about science careers, I would call it engineering careers. My advice to any young person in fact, whether they plan to become a business person, a medical doctor, and artist, a lawyer, or a scientist – basic training in engineering and abstract problem solving will be a huge assets. Some of the best lawyers, doctors, and even CEO’s started out with engineering training.
For those young people who want to become scientists, my advice is always look for real-world applications where the research can be conducted in their context. The reason for that is technical and sociological. From a technical perspective, the reality of an application and the fact that things have to work force a regiment of technical discipline and make sure that the new ideas are tested and challenged. Socially, working on a real application forces interactions with people who care about the problem and provides continuous feedback which is really crucial in guiding good work (even if scientists deny this, social pressure is a big factor) – it also ensures that your work will be relevant and will evolve in relevant directions. I always tell people who are seeking basic research: “some of the deepest fundamental science problems can often be found lurking in the most mundane of applications”. So embrace applied work but always look for the abstract deep problems – that summarizes my advice.
Ajay- What are the challenges of running data mining for a big big website.
Dr Fayyad-
There are many challenges. Most algorithms will not work due to scale. Also, most of the problems have an unusually high dimensionality – so simple tricks like sampling won’t work. You need to be very clever on how to sample and how to reduce dimensionality by applying the right variable transformations.

The variety of problems is huge, and the fact that the Internet is evolving and growing rapidly, means that the problems are not fixed or stationary. A solution that works well today will likely fail in a few months – so you need to always innovate and always look at new approaches. Also, you need to build automated tools to help detect changes and address them as soon as they arise. 

Problems with 1000 10,000 or millions of variables are very common in web challenges. Finally, whatever you do needs to work fast or else you will not be able to keep up with the data flux. Imagine falling behind on processing 25 Terabytes of data per day. If you fall behind by two days, you will never be able to catch up again! Not within any reasonable budget constraint. So you try never to go down.
Ajay-      What are the 5 most important things that the data miner should avoid in doing analysis.

Dr Fayyad-I never thought about this in terms of top 5, but here are the big ones that come to mind, not necessarily in any order
a.       The algorithms knows nothing about the data, and the knowledge of the domain is in the head of the domain experts. As I always say, an ounce of knowledge is worth a ton of data – so seek and model what the experts know or your results will look silly
b.      Don’t let an algorithm fish blindly when you have lots of data. Use what you know to reduce the dimensionality quickly. The curse of dimensionality is never to be under-estimated
c.       Resist the temptation to cheat: selecting training and test sets can easily fool you into thinking you have something that works. Test it honestly against new data, never “peek” at the test data – what you see will force you to cheat without knowing it.
d.      Business rules typically dominate data mining accuracy, so be sure to incorporate the business and legal constraints into your mining.
e.       I have never seen a large database in my life that came from a static distribution that was sampled independently. Real databases grow to be big through lots of systematic changes and biases, and they are collected over years from changing underlying distribution: segmentation is a pre-requisite to any analysis. Most algorithms assume that data is IID (independent and identically distributed)

Ajay-   Do you think softwares like Hadoop and MapReduce will change the online database permanently. What further developments do you see in this area.


Dr Fayyad-
I think they will (and have) changed the landscape dramatically, but they do not address everything. Many problems lend themselves naturally to Map-Reduce and many new approaches are enabled by Map-Reduce. However, there are many problems where M-R does not do much. I see a lot of problems being addressed by a large grid nowadays when they don’t need it. This is often a huge waste of computational resources. We need to learn how to deal with a mix of tools and platforms. I think M-R will be with us for a long time and will be a staple tool – but not a universal one.
Ajay-    I look forward to the day when I have just a low priced netbook and fast internet connection, and upload a Gigabyte of data and run advanced analytics on the browser. How far or soon do you think it is possible?
Dr Fayyad- Well, I thnk the day is already here. In fact, much of our web search today is conducted exactly in that model. A lot of web analysis, and much of scientific analysis is done like this today.
Ajay-    Describe some of the conferences you are currently involved with and the research areas that excites you the most.
Dr Fayyad-
I am still very involved in knowledge discovery and data mining conferences (especially the KDD series), machine learning, some statistics, and some conferences on search and internet.  Most exciting conferences for me are ones that cover a mix of topics but that address real problems. Examples include understanding how social networks evolve and behave, understanding dimensionality reductions (like random projections in very high-D spaces) and generally any work that gives us insight into why a particular technique works better and where the open challenges are.
Ajay-  What are the next breakthrough areas in data mining. Can we have a  Google or Yahoo in fields of business intelligence as well given their huge market potential and uncertain ROI.
Dr Fayyad- We already have some large and healthy businesses in BI and quite a huge industry in consulting. If you are asking particularly about the tools market then I think that market is very limited. The users of analysis tools are always going to be small in number. However, once the BI and Data Mining tools are embedded in vertical applications, then the number of users will be tremendous. That’s where you will see success.
Consider the examples of Google or Yahoo! – and now Microsoft with BING search engine.  Search engines today would not be good without machine learning/data mining technology. In fact MLR (Machine Learned Ranking) is at the core of the ranking methodology that decides which search results bubble to the top of the list. The typical web query is 2.6 keywords long and has about a billion matches. What matters are the top 10. The function that determines these is a relevance ranking algrorithm that uses machine learning to tune a formula that considers hundreds or thousands of variables about each document. So in many ways, you have a great example of this technology being used by hundreds of millions of people every day – without knowing it!
Success will be in applications where the technology becomes invisible – much like the internal combustion engine in your car or the electric motor in your coffee grinder or power supply fan. I think once people start building verticalized solutions that embed data mining and BI, we will hit success. This already has happened in web search, in direct marketing, in advertising targeting, in credit scoring, in fraud detection, and so on…

Ajay-  What do you do to relax. What publications would you recommend for staying up to date for the data mining people especially the younger analysts.
Dr Fayyad-
My favorite activity is sleep when I can get it J.  But more seriously, I enjoy reading books, playing chess, skiing (on water or snow – downhill or x-country), or any activities with my kids.  I swim a lot and that gives me much time to think and sort things out.
I think for keeping up with the technical advances in data mining: the KDD conferences, some of the applied analytics conferences, the WWW conferences, and the data mining journals. The ACM SIGKDD publishes a nice newsletter called SIGKDD explorations. It is free with a very low membership fee and it has a lot of announcements and survey papers on new topics and important areas (www.kdd.org).  Also, a good list to keep up with is an email list called KDNuggets edited by Gregory Piatetsky-Shapiro.
 

Biography (www.fayyad.com/usama )-

Usama Fayyad founded Open Insights (www.open-insights.com) to deliver on the vision of bridging the gap between data and insights and to help companies develop strategies and solutions not only to turn data into working business assets, but to turn the insights available from the growing amounts of data into critical components of an enterprise’s strategy for approaching markets, dealing with competitors, and acquire and retain customers.

In his prior role as Chief Data Officer of Yahoo! he built the data teams and infrastructure to manage the 25 terabytes of data per day that resulted from the company’s operations. He also built up the targeting systems and the data strategy for how to utilize data to enhance revenue and to create new revenue sources for the company.

In addition, he was the founding executive for Yahoo! Research, a scientific research organization that became the top research place in the world working on inventing the new sciences of the Internet.

Growing Rapidly: Rapid Miner 4.5

The Europe based Rapid Miner came out with version 4.5 of their data mining tool ( also known as Yale) with a much promising “Script” tool.

Also, Rapid Miner came in 1st in open source data mining tools in a poll by Industry benchmark www.kdnuggets.com

They have a brilliant video here for people who just want to have a look at the new Rapid Miner

http://rapid-i.com/videos/rapidminer_tour_3_4_en.html

Citation-

http://rapid-i.com/content/view/147/1/

New Operators:

  • FormulaExtractor
  • Trend
  • LagSeries
  • VectorLinearRegression
  • ExampleSetMinus
  • ExampleSetIntersect
  • Partition
  • Script
  • ForwardSelection
  • NeuralNetImproved
  • KernelNaiveBayes
  • ExhaustiveSubgroupDiscovery
  • URLExampleSource
  • NonDominatedSorting
Image

More Features:

  • The new Script operator allows for arbitrary user defined operations based on Groovy script combined with a simplified RapidMiner syntax
  • Improved the join operator and added options for left and right outer joins
  • New notification mail mechanism at the end of processes
  • Most file based data input operators now provide an option to skip error lines
  • Most file based example source operators as well as the IOObjectReader and the new URLExampleSource now accept URLs instead of a filename for the input source location

PMML 4.0

There are some nice changes in the PMML 4.0 version. PMML is the XML version for data modeling , or specificallyquoting the DMG group itself

PMML uses XML to represent mining models. The structure of the models is described by an XML Schema. One or more mining models can be contained in a PMML document. A PMML document is an XML document with a root element of type PMML. The general structure of a PMML document is:

  <?xml version="1.0"?>
  <PMML version="4.0"
    xmlns="http://www.dmg.org/PMML-4_0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" >

    <Header copyright="Example.com"/>
    <DataDictionary> ... </DataDictionary>

    ... a model ...

  </PMML>

So what is new in version 4. Here are some powerful modeling changes. For anyone with any XML knowledge PMML is the way to go.

PMML 4.0 – Changes from PMML 3.2

Associations

  • Itemset and AssociationRule elements are no longer enclosed within a “Choice” element
  • Added different scoring procedures: recommendation, exclusiveRecommendation and ruleAssociation with explanation and example
  • Changed version to “4.0” from “3.2” in the example(s)

BuiltinFunctions

Added the following functions:
  • isMissing
  • isNotMissing
  • equal
  • notEqual
  • lessThan
  • lessOrEqual
  • greaterThan
  • greaterOrEqual
  • isIn
  • isNotIn
  • and
  • or
  • not
  • isIn
  • isNotIn
  • if

Click on Image for better resolution

ClusteringModel

  • Changed version to “4.0” from “3.2” in the example(s)
  • Added reference to ModelExplanation element in the model XSD

Conformance

  • Changed all version references from “3.2” to “4.0”

DataDictionary

  • No changes

Functions

  • No changes

GeneralRegression

  • Changed to allow for Cox survival models and model ensembles
    • Add new model type: CoxRegression.
    • Allow empty regression model when model type is CoxRegression, so that baseline-only model could be represented.
    • Add new optional model attributes: endTimeVariable, startTimeVariable, subjectIDVariable, statusVariable, baselineStrataVariable, modelDF.
    • Add optional Matrix in Predictor to specify a contrast matrix, optional attribute referencePoint in Parameter.
    • Add new elements: BaseCumHazardTables, EventValues, BaselineStratum, BaselineCell.
    • Add examples of scoring for Cox Regression and contrast matrices.
    • Add new type of distribution: tweedie.
    • Add new attribute in model: targetReferenceCategory, so that the model can be used in MiningModel.
    • Changed version to “4.0” from “3.2” in the example(s)
    • Added reference to ModelExplanation element in the model XSD

GeneralStructure

Header

  • No changes

Interoperability

  • Changed: “As a result, a new approach for interoperability was required and is being introduced in PMML version 3.2.” to “As a result, a new approach for interoperability was introduced in PMML version 3.2.”

MiningSchema

  • Added frequencyWeight and analysisWeight as new options for usageType. They will not affect scoring, but will make model information more complete.

ModelComposition — No longer used, replaced by MultipleModels

ModelExplanation

  • New addition to PMML 4.0 that contains information to explain the models, model fit statistics, and visualization information.

ModelVerification

  • No changes

MultipleModels

  • Replaces ModelComposition. Important additions are segmentation and ensembles.
  • Added reference to ModelExplanation element in the model XSD

NaïveBayes

  • Changed version to “4.0” from “3.2” in the example(s)
  • Added reference to ModelExplanation element in the model XSD

NeuralNetwork

  • Changed version to “4.0” from “3.2” in the example(s)
  • Added reference to ModelExplanation element in the model XSD

Output

  • Extended output type to include Association rule models. The changes add a number of new attributes: “ruleFeature”, “algorithm”, “rank”, “rankBasis”, “rankOrder” and “isMultiValued”. A new enumeration type “ruleValue” is added to the RESULT-FEATURE

Regression

  • Changed version to “4.0” from “3.2” in the example(s)
  • Added reference to ModelExplanation element in the model XSD

RuleSet

  • Changed version to “4.0” from “3.2” in the example(s)
  • Added reference to ModelExplanation element in the model XSD

Sequence

  • Changed version to “4.0” from “3.2” in the example(s)

Statistics

  • accommodate weighted counts by replacing INT-ARRAY with NUM-ARRAY in DiscrStats and ContStats
  • change xs:nonNegativeInteger to xs:double in several places
  • add new boolean attribute ‘weighted’ to UnivariateStats and PartitionFieldStats elements
  • add new attribute cardinality in Counts
  • Also some very long lines in this document are now wrapped.

SupportVectorMachine

  • Added optional attribute threshold
  • Added optional attribute classificationMethod
  • Attribute alternateTargetCategory removed from SupportVectorMachineModel element and moved to SupportVectorMachine element
  • Changed the example slightly
  • Changed version to “4.0” from “3.2” in the example(s)
  • Added reference to ModelExplanation element in the model XSD

Targets

  • No changes

Taxonomy

  • Changed: “A TableLocator may contain any description which helps an application to locate a certain table. PMML 3.2 does not yet define the content. PMML users have to use their own extensions. The same applies to InlineTable.” to “A TableLocator may contain any description which helps an application to locate a certain table. PMML standard does not yet define the content. PMML users have to use their own extensions. The same applies to InlineTable.”

Text

  • Changed version to “4.0” from “3.2” in the example(s)
  • Added reference to ModelExplanation element in the model XSD

TimeSeriesModel

  • New addition to PMML 4.0 to support Time series models

Transformations

  • No changes

TreeModel

  • Changed version to “4.0” from “3.2” in the example(s)
  • Added reference to ModelExplanation element in the model XSD

Sources

http://www.dmg.org/v4-0/GeneralStructure.html

http://www.dmg.org/v4-0/Changes.html

and here are some companies using PMML already

http://www.dmg.org/products.html

I found the tool at http://www.dmg.org/coverage/ much more interesting though (see screenshot).

Screenshot-Mozilla Firefox

Zementis who we have covered in the interviews has played a steller role in bring together this common standard for data mining. Note Kxen model is also highlighted there.

The best PMML convertor tutorial is here

http://www.zementis.com/videos/PMML_Converter_iGoogle_gadget_2_demo.htm

Interesting Times

Probably for the first time I am reproducing a comment from a reader in it’s entirity. As an ex GE Finance and exCiti man , the following parable stuck closer to heart.

Here are some nice views from Randall Stross of http://enhilex.com/ on current economic crisis- If his touch a chord so do write back

Hello Everyone, There is a lot of talk about what made Wall St. fail. Having spent some time inside firms that played pivotal roles, I must say that I believe there are as many reasons for the failure as there are ways to manifest lack of integrity, diligence, lack of responsibility and accountability. As recent failures demonstrate, all attempts to legislate all those characteristics have failed – for the same reasons. The manifestations I found were like these following examples:

1. Here’s a short conversation in a hallway. I would later realize that we were standing in front of the manager’s office… Me: “So , why don’t your models account for employment as a factor that influences the borrower’s ability to pay his mortgage?” Him, looking at me like I had 2 heads: “Er, uh… Well, that is because… very long pause… there is no reliable data on that. Yes, that’s right. So we leave it out…” That was the reply from a very intelligent person who knew he would be shown the door if he’d told me the truth. Lay persons understand that models are useful, though imperfect. But the example given above is something else. This model’s design was intended to mislead. People who do things like this may have been “good” employees, but they were not good citizens. The “good” employees remain… Anyone with an IQ over 50 knows that past performance does not predict future performance when you don’t account for the differences between past and present in the entirety of the nexus where the question belongs.

2. Whilst sitting next to me in a development lab, a young reporting analyst was directed to “hard code” the sum of a column of figures headed for regulators – because the bottom line “wasn’t good enough.” The young man hesitated. Relieved, I put my hand on his wrist and quietly suggested that he ask for our client’s manager to send that request to him in an email. The manager left the lab, calling us obscene names. Of course, the email never came.

3. A friend of mine working as a funder for a large mortgage firm was fired for refusing to fund a loan she knew was fraudulent. I’m proud to know the people in #2 and #3. These people have integrity and loyalty to principles the investing public can count on. But they both lost their jobs and have moved on into other fields, away from Wall St-ish areas. So, it is the people who are left that will work to rebuild confidence in the market. OK, now — who’s left? I wish we could let it all fall down and stand back up without the bums who caused it all. We do know who they are…

ps- Hope he is neither a spammer or joking. This does bring a lot of old memories when I worked for the big hot fin companies. Do you have  a personal story like that.

Decisionstats Interviews

Here is a list of interviews that I have published- these are specific to analytics and data mining and include only the most recent interviews. If I have missed out any notable recent interview related to analytics and data mining, kindly do let me know. Hat Tip to Karl Rexer, for this suggestion .

Date    Name of Interviewee    Designation and Organization

09-Jun    Karl Rexer                          President, Rexer Analytics
05-Jun    Jim Daves                          CMO, SAS Institute
04-Jun    Paul van Eikeren                 President and CEO, Blue Reference
29-May    David Smith                      Director of Community, REvolution Computing
17-May    Dominic Pouzin                 CEO, Data Applied
11-May    Bruno Delahaye                 VP, KXEN
04-May    Ron Ramos                        Director, Zementis
30-Apr    Oliver Jouve                       VP, SPSS Inc
21-Apr    Fabian Dill                         Co- Founder, Knime.com
18-Apr    Alicia Mcgreevey                 Head Marketing, Visual Numerics
27-Mar    Francoise Soulie Fogelman    VP, KXEN
17-Mar    Jon Peck                            Principal Software Engineer, SPSS Inc
06-Mar    Anne Milley                        Director of product marketing, SAS Institute
04-Mar    Anne Milley                        Director of product marketing, SAS Institute
03-Feb    Phil Rack                            Creator, Bridge to R,and CEO Minequest
03-Feb    Michael Zeller                     CEO, Zementis
31-Jan    Richard Schultz                   CEO, Revolution Computing
21-Jan    Bob Muenchen                    Author, R for SAS and SPSS Users
13-Jan    Dr Graham Williams           Creator, Rattle GUI for R
05-Jan    Roger Haddad                    CEO, KXEN
26-Sep    June Dershewitz                  VP, Semphonic
04-Sep    Vincent Granville                 Head, Analyticbridge

The URl’s to specific interviews are also in this sheet.

http://spreadsheets.google.com/pub?key=rWTqcMe9mqwHeFv1e4GS_yg&single=true&gid=0&range=a1%3Ae24&output=html

Interview Karl Rexer -Rexer Analytics

Here is an interview with Karl Rexer of Rexer Analytics. His annual survey is considered a benchmark in the data mining and analytics industry. Here Karl talks of his career, his annual survey and his views on the industry direction and trends.

Almost 20% of data miners report that their company/organizations have only minimal analytic capabilities – Karl Rexer

IMG_2031

Ajay- Describe your career in science. What advice would you give to young science graduates in this recession? What advice would you give to high school students choosing from science – non science careers?

Karl- My interests in science began as a child. My father has multiple science degrees, and I grew up listening to his descriptions of the cool things he was building, or the cool investigative tools he was using, in his lab. He worked in an industrial setting, so visiting was difficult. But when I could, I loved going in to see the high-temperature furnaces he was designing, the carbon-fiber production processes he was developing, and the electron microscope that allowed him to look at his samples. Both of my parents encouraged me to ask why, and to think critically about both scientific and social issues. It was also the time of the Apollo moon landings, and I was totally absorbed in watching and thinking about them. Together these things motivated me and shaped my world-view.

I have also had the good fortune to work across many diverse areas and with some truly outstanding people. In graduate school I focused on applied statistics and the use of scientific methods in the social sciences. As a grad student and young academic, I applied those skills to researching how our brains process language. But on the side, I pursued a passion for using the scientific method and analytics to address ….well anything I could. We called it “statistical consulting” then, but it often extended to research design and many other parts of the scientific process. Some early projects included assisting people with AIDS outcome studies, psycholinguistic research, and studies of adolescent adjustment.

My first taste of applying these skills outside of an academic environment was with my mentor Len Katz. The US Navy hired us to help assess the new recruits that were entering the submarine school. Early identification of sailors who would excel in this unusual and stressful environment was critical. Perhaps even more important was identifying sailors who would not perform well in that environment. Luckily, the Navy had years of academic and psychological testing on many sailors, and this data proved quite useful in predicting later job performance onboard the submarines. Even though we never got the promised submarine ride, I was hooked on applying measurement, scientific methods, and analytics in non-academic settings.

And that’s basically what I have continued to do – apply those skills and methods in diverse scientific and business settings. I worked for two banks and two consulting firms before founding Rexer Analytics in 2002. Last year we supported 30 clients. I’ve got great staff and they have great quant skills. Importantly, we also don’t hesitate to challenge each other, and we’re continually learning from each other and from each client engagement. We share a love of project diversity, and we seek it out in our engagements. We’ve forecasted sales for medical devices, measured B2B customer loyalty, identified manufacturing problems by analyzing product returns, predicted which customers will close their bank accounts, analyzed millions of tax returns, helped identify the dimensions of business team cohesion that result in better performance, found millions of dollars of B2B and B2C fraud, and helped many companies understand their customers better with segmentations, surveys, and analyses of sales and customer behavior.

The advice I would give to young science grads in this recession is to expand your view of where you can apply your scientific training. This applies to high school students considering science careers too. All science does not happen in universities, labs and other traditional science locations. Think about applying scientific methods everywhere! Sometimes our projects at Rexer Analytics seem far away from what most people would consider “science.” But we’re always asking “what data is available that can be brought to bear on the business issue we’re addressing.” Sometimes the best solution is to go out and collect more data – so we frequently help our clients improve their measurement processes or design surveys to collect the necessary data. I think there are enormous opportunities for science grads to apply their scientific training in the business world. The opportunities are not limited to physics wiz-kids making models for Wall Street trading or computer science students moving to Silicon Valley. One of the best analytic teams I ever worked on was at Fleet Bank in the late 90s. We had an economist, two physicists, a sociologist, a psychologist, an operations research guy, and person with a degree in marketing science. We were all very focused on data, measurement, and analytic methods.

I recommend that all science grads read Tom Davenport’s book Competing on Analytics *. It illustrates, with compelling examples, how businesses can benefit from using science and analytics. Several examples in Tom’s book come from Gary Loveman, CEO of Harrah’s Entertainment. I think that Gary also serves as a great example of how scientific methods can be applied in every industry. Gary has a PhD in economics from MIT, he’s worked at the Federal Reserve Bank, he’s been a professor at Harvard, but more recently he runs the world’s largest casino and gaming company. And he’s famously said many times that there are three ways to get fired at Harrah’s: steal, harass women, or not use a control group. Business leaders across all industries are increasingly wanting data, analytics and scientific decision-making. Science grads have great training that enables them to take on these roles and to demonstrate the success of these methods.

Ajay- One more survey- How does the Rexer survey differentiate itself from other surveys out there?

Karl- The Annual Rexer Analytics Data Miner Survey is the only broad-reaching research that investigates the analytic behaviors, views and preferences of data mining professionals. Each year our sample grows — in 2009 we had over 700 people around the globe complete our survey. Our participants include large numbers of both academic and business people.

Another way our survey is differentiated from other surveys is that each year we ask our participants to provide suggestions on ways to improve the survey. Incorporating participants’ suggestions improves our survey. For example, in 2008 several people suggested adding questions about model deployment and off-shoring. We asked about both of these topics in the 2009 survey.

Ajay -Could you please share some sneak previews of the survey results? What impact is the recession likely to have on IT spending?

Karl- We’re just starting to analyze the 2009 survey data. But, yes, here’s a peek at some of the findings that relate to the impact of the recession:

* Many data miners report that funding for data mining projects can sometimes be a problem.
* However, when asked what will happen in 2009 if the economic downturn continues, many data miners still anticipate that their company/organization will conduct more data mining projects in 2009 than in previous years (41% anticipate more projects in 2009; 27% anticipate fewer projects).
* The vast majority of companies conduct their data mining internally, and very few are sending data mining off-shore.

I don’t have a crystal ball that tells me about the trends in overall corporate spending on IT, Business Intelligence, or Data Mining. It’s my personal experience that many budgets are tight this year, but that key projects are still getting funded. And it is my strong opinion that in the coming years many companies will increase their focus on analytics, and I think that increasingly analytics will be a source of competitive advantage for these companies.

There are other people and other surveys that provide better insight into the trends in IT spending. For example, Gartner’s recent survey of over 1,500 CIOs (http://www.gartner.com/it/page.jsp?id=855612 ) suggests that 2009 IT spending is likely to be flat. I’m personally happy to see that in the Gartner survey, Business Intelligence is again CIOs’ top technology priority, and that “increasing the use of information/analytics” is the #5 business priority.

Ajay- I noticed you advise SPSS among others. Describe what an advisory role is for an analytics company and how can small open source companies get renowned advisors?

Karl- We have advised Oracle, SPSS, Hewlett-Packard and several smaller companies. We find that advisory roles vary greatly. The biggest source of variation is what the company wants advice about. Example include:

* assessing opportunity areas for the application of analytics
* strategic data assessments
* analytic strategy
* product strategy
* reviewing software

Both large and small companies that look to apply analytics to their businesses can benefit from analytic advisors. So can open source companies that sell analytic software. Companies can find analytic advisors in several ways. One way is to look around for analytic experts whose advice you trust, and hire them. Networking in your own industry and in the analytic communities can identify potential advisors. Don’t forget to look in both academia and the business world. Many skilled people cross back and forth between these two worlds. Another way for these companies to obtain analytic advice is to look in their business networks and user communities for analytic specialists who share some of the goals of the company – they will be motivated for your company to succeed. Especially if focused topic areas or time-constrained tasks can be identified, outside experts may be willing to donate their time, and they may be flattered that you asked.

Ajay- What made you decide to begin the Rexer Surveys? Describe some results of last year’s surveys and any trends from the last three years that you have seen.

Karl- I’ve been involved on the organizing committees of several data mining workshops and conferences. At these conferences I talk with a lot of data miners and companies involved in data mining. I found that many people were interested in hearing about what other data miners were doing: what algorithms, what types of data, what challenges were being faced, what they liked and disliked about their data mining tools, etc. Since we conduct online surveys for several of our clients, and my network of data miners is pretty large, I realized that we could easily do a survey of data miners, and share the results with the data mining community. In the first year, 314 data miners participated, and it’s just grown from there. In 2009 over 700 people completed the survey. The interest we’ve seen in our research summaries has also been astounding – we’ve had thousands of requests. Overall, this just confirms what we originally thought: people are hungry for information about data mining.

Here is a preview of findings from the initial analyses of the 2009 survey data:

* Each year we’ve seen that the most commonly used algorithms are decision trees, regression, and cluster analysis.
* Consistently, some of the top challenges data miners report are dirty data and explaining data mining to others. Previously, data access issues were also reported as a big challenge, but in 2009 fewer data miners reported facing this challenge.
* The most prevalent concerns with how data mining is being utilized are: insufficient training of some data miners, and resistance to using data mining in contexts where it would be beneficial.
* Data mining is playing an important role in organizations. Half of data miners indicate their results are helping to drive strategic decisions and operational processes.
* But there’s room for data mining to grow – almost 20% of data miners report that their company/organizations have only minimal analytic capabilities.

Bio-

Karl Rexer, PhD is President of Rexer Analytics, a small Boston-based consulting firm. Rexer Analytics provides analytic and CRM consulting to help clients use their data to make better strategic and tactical decisions. Recent projects include fraud detection, sales forecasting, customer segmentation, loyalty analyses, predictive modeling for cross-sell and attrition, and survey research. Rexer Analytics also conducts an annual survey of data miners and freely distributes research summaries to the data mining community. Karl has been on the organizing committees of several international data mining conferences, including 3 KDD conferences, and BIWA-2008. Karl is on the SPSS Customer Advisory Board and on the Board of Directors of the Oracle Business Intelligence, Warehousing, & Analytics (BIWA) Special Interest Group. Karl and other Rexer Analytics staff are frequent invited speakers at MBA data mining classes and conferences.

To know more do check out the website on www.rexeranalytics.com

*