Rattle Re-Introduced

Latest version of Rattle just went online-

Here is the change log- Dr Graham Williams is also coming out with a book on using Rattle- the R GUI devoted to data mining.

Source-http://cran.r-project.org/web/packages/rattle/index.html

rattle (2.5.42) unstable; urgency=low

  * Update rattle.info() to recursively identify all dependencies,
 report
    their version number and any updates available from CRAN and generate
    command to update packages that have updates available. See
    ?rattle.info for the options.

  * Fix bug causing R Dataset option of the Evaluate window to always
    revert to the first named dataset.

  * Fix bug in transforms where weights were not being handled in
    refreshing of the Data tab.

  * Fix a bug in box plots when trying to label outliers when there aren't
    any.

 -- Graham Williams <Graham.Williams@togaware.com>  Sun, 
19 Sep 2010 05:01:51 +1000

rattle (2.5.41) unstable; urgency=low

  * Use GtkBuilder for Export dialog.

  * Test use of glade vs GtkBuilder on multiple platforms.

  * Rename rattle.info to rattle.version.

  * Add weight column to data tab.

  * Support weights for nnet, multinom, survival.

  * Add weights information to PMML as a PMML Extension.

  * Ensure GtkFrame is available as a data type whilst waiting for 
updated
    RGtk2.

  * Bug fix to packageIsAvailable not reruning any result.

  * Replace destroy with withdraw for plot window as the former has
    started crashing R.

  * Improve Log formatting for various model build commands.

  * Be sure to include the car package for Anova for multinom models.

  * Release pmml 1.2.24: Bug fix glm binomial regression - note as
    classification model.

 -- Graham Williams <Graham.Williams@togaware.com>  Wed, 15 Sep 2010 
14:56:09 +1000
And a video I did of exploring various Rattle options using Camtasia,
 a very useful software for screen capture and video tutorials
from http://www.techsmith.com/download/camtasiatrial.asp
Updated- my video skils being quite bad- I replaced it with another video. 
However Camtasia is the best screen capture video tool
Also , an update Analyticdroid is on hold for now. see- for more details http://rattle.togaware.com/

Event: Predictive analytics with R, PMML and ADAPA

From http://www.meetup.com/R-Users/calendar/14405407/

The September meeting is at the Oracle campus. (This is next door to the Oracle towers, so there is plenty of free parking.) The featured talk is from Alex Guazzelli (Vice President – Analytics, Zementis Inc.) who will talk about “Predictive analytics with R, PMML and ADAPA”.

Agenda:
* 6:15 – 7:00 Networking and Pizza (with thanks to Revolution Analytics)
* 7:00 – 8:00 Talk: Predictive analytics with R, PMML and ADAPA
* 8:00 – 8:30 General discussion

Talk overview:

The rule in the past was that whenever a model was built in a particular development environment, it remained in that environment forever, unless it was manually recoded to work somewhere else. This rule has been shattered with the advent of PMML (Predictive Modeling Markup Language). By providing a uniform standard to represent predictive models, PMML allows for the exchange of predictive solutions between different applications and various vendors.

Once exported as PMML files, models are readily available for deployment into an execution engine for scoring or classification. ADAPA is one example of such an engine. It takes in models expressed in PMML and transforms them into web-services. Models can be executed either remotely by using web-services calls, or via a web console. Users can also use an Excel add-in to score data from inside Excel using models built in R.

R models have been exported into PMML and uploaded in ADAPA for many different purposes. Use cases where clients have used the flexibility of R to develop and the PMML standard combined with ADAPA to deploy range from financial applications (e.g., risk, compliance, fraud) to energy applications for the smart grid. The ability to easily transition solutions developed in R to the operational IT production environment helps eliminate the traditional limitations of R, e.g. performance for high volume or real-time transactional systems and memory constraints associated with large data sets.

Speaker Bio:

Dr. Alex Guazzelli has co-authored the first book on PMML, the Predictive Model Markup Language which is the de facto standard used to represent predictive models. The book, entitled PMML in Action: Unleashing the Power of Open Standards for Data Mining and Predictive Analytics, is available on Amazon.com. As the Vice President of Analytics at Zementis, Inc., Dr. Guazzelli is responsible for developing core technology and analytical solutions under ADAPA, a PMML-based predictive decisioning platform that combines predictive analytics and business rules. ADAPA is the first system of its kind to be offered as a service on the cloud.
Prior to joining Zementis, Dr. Guazzelli was involved in not only building but also deploying predictive solutions for large financial and telecommunication institutions around the globe. In academia, Dr. Guazzelli worked with data mining, neural networks, expert systems and brain theory. His work in brain theory and computational neuroscience has appeared in many peer reviewed publications. At Zementis, Dr. Guazzelli and his team have been involved in a myriad of modeling projects for financial, health-care, gaming, chemical, and manufacturing industries.

Dr. Guazzelli holds a Ph.D. in Computer Science from the University of Southern California and a M.S and B.S. in Computer Science from the Federal University of Rio Grande do Sul, Brazil.

PMML 4.0

There are some nice changes in the PMML 4.0 version. PMML is the XML version for data modeling , or specificallyquoting the DMG group itself

PMML uses XML to represent mining models. The structure of the models is described by an XML Schema. One or more mining models can be contained in a PMML document. A PMML document is an XML document with a root element of type PMML. The general structure of a PMML document is:

  <?xml version="1.0"?>
  <PMML version="4.0"
    xmlns="http://www.dmg.org/PMML-4_0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" >

    <Header copyright="Example.com"/>
    <DataDictionary> ... </DataDictionary>

    ... a model ...

  </PMML>

So what is new in version 4. Here are some powerful modeling changes. For anyone with any XML knowledge PMML is the way to go.

PMML 4.0 – Changes from PMML 3.2

Associations

  • Itemset and AssociationRule elements are no longer enclosed within a “Choice” element
  • Added different scoring procedures: recommendation, exclusiveRecommendation and ruleAssociation with explanation and example
  • Changed version to “4.0” from “3.2” in the example(s)

BuiltinFunctions

Added the following functions:
  • isMissing
  • isNotMissing
  • equal
  • notEqual
  • lessThan
  • lessOrEqual
  • greaterThan
  • greaterOrEqual
  • isIn
  • isNotIn
  • and
  • or
  • not
  • isIn
  • isNotIn
  • if

Click on Image for better resolution

ClusteringModel

  • Changed version to “4.0” from “3.2” in the example(s)
  • Added reference to ModelExplanation element in the model XSD

Conformance

  • Changed all version references from “3.2” to “4.0”

DataDictionary

  • No changes

Functions

  • No changes

GeneralRegression

  • Changed to allow for Cox survival models and model ensembles
    • Add new model type: CoxRegression.
    • Allow empty regression model when model type is CoxRegression, so that baseline-only model could be represented.
    • Add new optional model attributes: endTimeVariable, startTimeVariable, subjectIDVariable, statusVariable, baselineStrataVariable, modelDF.
    • Add optional Matrix in Predictor to specify a contrast matrix, optional attribute referencePoint in Parameter.
    • Add new elements: BaseCumHazardTables, EventValues, BaselineStratum, BaselineCell.
    • Add examples of scoring for Cox Regression and contrast matrices.
    • Add new type of distribution: tweedie.
    • Add new attribute in model: targetReferenceCategory, so that the model can be used in MiningModel.
    • Changed version to “4.0” from “3.2” in the example(s)
    • Added reference to ModelExplanation element in the model XSD

GeneralStructure

Header

  • No changes

Interoperability

  • Changed: “As a result, a new approach for interoperability was required and is being introduced in PMML version 3.2.” to “As a result, a new approach for interoperability was introduced in PMML version 3.2.”

MiningSchema

  • Added frequencyWeight and analysisWeight as new options for usageType. They will not affect scoring, but will make model information more complete.

ModelComposition — No longer used, replaced by MultipleModels

ModelExplanation

  • New addition to PMML 4.0 that contains information to explain the models, model fit statistics, and visualization information.

ModelVerification

  • No changes

MultipleModels

  • Replaces ModelComposition. Important additions are segmentation and ensembles.
  • Added reference to ModelExplanation element in the model XSD

NaïveBayes

  • Changed version to “4.0” from “3.2” in the example(s)
  • Added reference to ModelExplanation element in the model XSD

NeuralNetwork

  • Changed version to “4.0” from “3.2” in the example(s)
  • Added reference to ModelExplanation element in the model XSD

Output

  • Extended output type to include Association rule models. The changes add a number of new attributes: “ruleFeature”, “algorithm”, “rank”, “rankBasis”, “rankOrder” and “isMultiValued”. A new enumeration type “ruleValue” is added to the RESULT-FEATURE

Regression

  • Changed version to “4.0” from “3.2” in the example(s)
  • Added reference to ModelExplanation element in the model XSD

RuleSet

  • Changed version to “4.0” from “3.2” in the example(s)
  • Added reference to ModelExplanation element in the model XSD

Sequence

  • Changed version to “4.0” from “3.2” in the example(s)

Statistics

  • accommodate weighted counts by replacing INT-ARRAY with NUM-ARRAY in DiscrStats and ContStats
  • change xs:nonNegativeInteger to xs:double in several places
  • add new boolean attribute ‘weighted’ to UnivariateStats and PartitionFieldStats elements
  • add new attribute cardinality in Counts
  • Also some very long lines in this document are now wrapped.

SupportVectorMachine

  • Added optional attribute threshold
  • Added optional attribute classificationMethod
  • Attribute alternateTargetCategory removed from SupportVectorMachineModel element and moved to SupportVectorMachine element
  • Changed the example slightly
  • Changed version to “4.0” from “3.2” in the example(s)
  • Added reference to ModelExplanation element in the model XSD

Targets

  • No changes

Taxonomy

  • Changed: “A TableLocator may contain any description which helps an application to locate a certain table. PMML 3.2 does not yet define the content. PMML users have to use their own extensions. The same applies to InlineTable.” to “A TableLocator may contain any description which helps an application to locate a certain table. PMML standard does not yet define the content. PMML users have to use their own extensions. The same applies to InlineTable.”

Text

  • Changed version to “4.0” from “3.2” in the example(s)
  • Added reference to ModelExplanation element in the model XSD

TimeSeriesModel

  • New addition to PMML 4.0 to support Time series models

Transformations

  • No changes

TreeModel

  • Changed version to “4.0” from “3.2” in the example(s)
  • Added reference to ModelExplanation element in the model XSD

Sources

http://www.dmg.org/v4-0/GeneralStructure.html

http://www.dmg.org/v4-0/Changes.html

and here are some companies using PMML already

http://www.dmg.org/products.html

I found the tool at http://www.dmg.org/coverage/ much more interesting though (see screenshot).

Screenshot-Mozilla Firefox

Zementis who we have covered in the interviews has played a steller role in bring together this common standard for data mining. Note Kxen model is also highlighted there.

The best PMML convertor tutorial is here

http://www.zementis.com/videos/PMML_Converter_iGoogle_gadget_2_demo.htm

KNIME and Zementis shake hands

Two very good and very customer centric (and open source ) companies shook hands on a strategic partnership today.

Knime  www.knime.org and Zementis www.zementis.com .

Decision Stats has been covering these companies and both the products are amazing good, synch in very well thanks to the support of the PMML standard and lower costs considerably for the consumer. (http://www.decisionstats.com/2009/02/knime/ ) and http://www.decisionstats.com/2009/02/interview-michael-zeller-ceozementis/ )

While Knime has both a free personal as well as a commercial license , it supports R thanks to the PMML (www.dmg.org initiative ). Knime also supports R very well .

See http://www.knime.org/blog/export-and-convert-r-models-pmml-within-knime

The following example R script learns a decision tree based on the Iris-Data and exports this as PMML and as an R model which is understood by the R Predictor node:

# load the library for learning a tree model
library(rpart);
# load the pmml export library
library(pmml);
# use class column as predicted column to build decision tree
dt <- rpart(class~., R)
# export to PMML
r_pmml <- pmml(dt)
# write the PMML model to an export file
write(toString(r_pmml), file="C:/R.pmml")
# provide the native R model at the out-port
R<-dt

 

Zementis takes the total cost of ownership and total pain of creating scored models to something close to 1$ /hour thanks to using their proprietary ADAPA engine.

As mentioned before, Zementis is at the forefront of using Cloud Computing ( Amazon EC2 ) for open source analytics. Recently I came in contact with Michael Zeller for a business problem , and Mike being the gentleman he is not only helped me out but also agreed on an extensive and exclusive interview.(!)

image

Ajay- What are the traditional rivals to scoring solutions offered by you. How does ADAPA compare to each of them. Case Study- Assume I have 50000 leads daily on a Car buying website. How would ADAPA help me in scoring the model ( created say by KXEN or , R or,SAS, or SPSS).What would my approximate cost advantages be if I intend to mail say the top 5 deciles everyday.

Michael- Some of the traditional scoring solutions used today are based on SAS, in-database scoring like Oracle, MS SQL Server, or very often even custom code.  ADAPA is able to import the models from all tools that support the PMML standard, so any of the above tools, open source or commercial, could serve as an excellent development environment.

The key differentiators for ADAPA are simple and focus on cost-effective deployment:

1) Open Standards – PMML & SOA:

Freedom to select best-of-breed development tools without being locked into a specific vendor;  integrate easily with other systems.

2) SaaS-based Cloud Computing:

Delivers a quantum leap in cost-effectiveness without compromising on scalability.

In your example, I assume that you’d be able to score your 50,000 leads in one hour using one ADAPA engine on Amazon.  Therefore, you could choose to either spend US$100,000 or more on hardware, software, maintenance, IT services, etc., write a project proposal, get it approved by management, and be ready to score your model in 6-12 months

OR, you could use ADAPA at something around US$1-$2 per day for the scenario above and get started today!  To get my point across here, I am of course simplifying the scenario a little bit, but in essence these are your choices.

Sounds too good to be true?  We often get this response, so please feel free to contact us today [http://www.zementis.com/contact.htm] and we will be happy show you how easy it can be to deploy predictive models with ADAPA!

 

Ajay- The ADAPA solution seems to save money on both hardware and software costs. Comment please. Also any benchmarking tests that you have done on a traditional scoring configuration system versus ADAPA.

Michael-Absolutely, the ADAPA Predictive Analytics Edition [http://www.zementis.com/predictive_analytics_edition.htm] on Amazon’s cloud computing infrastructure (Amazon EC2) eliminates the upfront investment in hardware and software.  It is a true Software as a Service (SaaS) offering on Amazon EC2 [http://www.zementis.com/howtobuy.htm] whereby users only pay for the actual machine time starting at less than US$1 per machine hour.  The ADAPA SaaS model is extremely dynamic, e.g., a user is able to select an instance type most appropriate for the job at hand (small, large, x-large) or launch one or even 100 instances within minutes.

In addition to the above savings in hardware/software, ADAPA also cuts the time-to-market for new models (priceless!) which adds to business agility, something truly critical for the current economic climate.

Regarding a benchmark comparison, it really depends on what is most important to the business.  Business agility, time-to-market, open standards for integration, or pure scoring performance?  ADAPA addresses all of the above.  At its core, it is a highly scalable scoring engine which is able to process thousands of transactions per second.  To tackle even the largest problems, it is easy to scale ADAPA via more CPUs, clustering, or parallel execution on multiple independent instances. 

Need to score lots of data once a month which would take 100 hours on one computer?  Simply launch 10 instances and complete the job in 10 hours over night.  No extra software licenses, no extra hardware to buy — that’s capacity truly on-demand, whenever needed, and cost-effective.

Ajay- What has been your vision for Zementis. What exciting products are we going to see from it next.

Michael – Our vision at Zementis [http://www.zementis.com] has been to make it easier for users to leverage analytics.  The primary focus of our products is on the deployment side, i.e., how to integrate predictive models into the business process and leverage them in real-time.  The complexity of deployment and the cost associated with it has been the main hurdle for a more widespread adoption of predictive analytics. 

Adhering to open standards like the Predictive Model Markup Language (PMML) [http://www.dmg.org/] and SOA-based integration, our ADAPA engine [http://www.zementis.com/products.htm] paves the way for new use cases of predictive analytics — wherever a painless, fast production deployment of models is critical or where the cost of real-time scoring has been prohibitive to date.

We will continue to contribute to the R/PMML export package [http://www.zementis.com/pmml_exporters.htm] and extend our free PMML converter [http://www.zementis.com/pmml_converters.htm] to support the adoption of the standard.  We believe that the analytics industry will benefit from open standards and we are just beginning to grasp what data-driven decision technology can do for us.  Without giving away much of our roadmap, please stay tuned for more exciting products that will make it easier for businesses to leverage the power of predictive analytics!

Ajay- Any India or Asia specific plans for the Zementis.

Michael-Zementis already serves customers in the Asia/Pacific region from its office in Hong Kong.  We expect rapid growth for predictive analytics in the region and we think our cost-effective SaaS solution on Amazon EC2 will be of great service to this market.  I could see various analytics outsourcing and consulting firms benefit from using ADAPA as their primary delivery mechanism to provide clients with predictive  models that are ready to be executed on-demand.

Ajay-What do you believe be the biggest challenges for analytics in 2009. What are the biggest opportunities.

Michael-The biggest challenge for analytics will most likely be the reduction in technology spending in a deep, global recession.  At the same time, companies must take advantage of analytics to cut cost, optimize processes, and to become more competitive.  Therefore, the biggest opportunity for analytics will be in the SaaS field, enabling clients to employ analytics without upfront capital expenditures.

Ajay – What made you choose a career in science. Describe your journey so far.What would your advice be to young science graduates in this recessionary times.

Michael- As a physicist, my research focused on neural networks and intelligent systems.  Predictive analytics is a great
way for me to stay close to science while applying such complex algorithms to solve real business problems.  Even in a recession, there is always a need for good people with the desire to excel in their profession.  Starting your career, I’d say the best way is to remain broad in expertise rather than being too specialized on one particular industry or proficient in a single analytics tool.  A good foundation of math and computer science, combined with curiosity in how to apply analytics to specific business problems will provide opportunities, even in the current economic climate.

About Zementis

Zementis, Inc. is a software company focused on predictive analytics and advanced Enterprise Decision Management technology. We combine science and software to create superior business imageand industrial solutions for our clients. Our scientific expertise includes statistical algorithms, machine learning, neural networks, and intelligent systems and our scientists have a proven record in producing effective predictive models to extract hidden patterns from a variety of data types. It is complemented by our product offering ADAPA, a decision engine framework for real-time execution of predictive models and rules. For more information please visit www.zementis.com

Ajay-If you have a lot of data ( GBs and GBs) , an existing model ( in SAS,SPSS,R) which you converted to PMML, and it is time for you to choose between spending more money to upgrade your hardware, renew your software licenses  then instead take a look at the ADAPA from www.zementis.com and score models as low as 1$ per hour. Check it out ( test and control !!)

Do you have any additional queries from Michael ? Use the comments page to ask.

Parsing XML files easily

To parse a XML (or KML or PMML) file easily without using any complicated softwares, here is a piece of code that fits right in your excel sheet.

Just import this file using Excel, and then use the function getElement, after pasting the XML code in 1 cell.

It is used  for simply reading the xml/kml code as a text string. Just pasted all the xml code in one cell, and used the start ,end function (for example start=<constraints> and end=</constraints> to get the value of constraints in the xml code).

Simply read into the value in another cell using the getElement function.

heres the code if you ever need it.Just paste it into the VB editor of Excel to create the GetElement function (if not there already) or simply import the file in the link above.

Attribute VB_Name = “Module1”
Public Function getElement(xml As String, start As String, finish As String)
For i = 1 To Len(xml)
If Mid(xml, i, Len(start)) = start Then
For j = i + Len(start) To Len(xml)
If Mid(xml, j, Len(finish)) = finish Then
getElement = Mid(xml, i + Len(start), j – i – Len(start))
Exit Function
End If
Next j
End If
Next i
End Function

FOR Using the R Package for parsing XML …………………………reference this site –

http://www.omegahat.org/RSXML/Overview.html

or this thread from R -Help

> Lines <- ‘
+ <root>
+  <data loc=”1″>
+    <val i=”t1″> 22 </val>
+    <val i=”t2″> 45 </val>
+  </data>
+  <data loc=”2″>
+    <val i=”t1″> 44 </val>
+    <val i=”t2″> 11 </val>
+  </data>
+ </root>
+ ‘
>
> library(XML)
> doc <- xmlTreeParse(Lines, asText = TRUE, trim = TRUE, useInternalNodes = TRUE)
> root <- xmlRoot(doc)
>
> data1 <- getNodeSet(root, “//data”)[[1]]
> xmlValue(getNodeSet(data1, “//val”)[[1]])
[1] ” 22 “