Here is a new-old system in open source for
for building and scoring statistical models designed to work with data sets that are too large to fit into memory.
Augustus is an open source software toolkit for building and scoring statistical models. It is written in Python and its
most distinctive features are:
• Ability to be used on sets of big data; these are data sets that exceed either memory capacity or disk capacity, so
that existing solutions like R or SAS cannot be used. Augustus is also perfectly capable of handling problems
that can fit on one computer.
• PMML compliance and the ability to both:
– produce models with PMML-compliant formats (saved with extension .pmml).
– consume models from files with the PMML format.
Augustus has been tested and deployed on serveral operating systems. It is intended for developers who work in the
financial or insurance industry, information technology, or in the science and research communities.
Augustus produces and consumes Baseline, Cluster, Tree, and Ruleset models. Currently, it uses an event-based
approach to building Tree, Cluster and Ruleset models that is non-standard.
New to PMML ?
The Predictive Model Markup Language or PMML is a vendor driven XML markup language for specifying statistical and data mining models. In other words, it is an XML language so that analytic models can be expressed in a in a platform and application independent fashion.
Without PMML, it is often the cases that
- Models are deployed in proprietary formats
- Models are application dependent
- Models are system dependent
- Models are architecture dependent
- Time required to deploy models is long.
PMML aims for portability and safe deployments.
PMML’s approach to developing and deploying analytical applications is based upon a few key concepts:
- View analytic models as first class objects. With PMML, statistical and data mining models can be thought of as first class objects described using XML. Applications or services can be thought of as producing PMML or consuming PMML. A PMML XML file contains enough information so that an application can process and score a data stream with a statistical or data mining model using only the information in the PMML file.
- Provide an interface between model producers and model consumers. Broadly speaking most analytic applications consist of a learning phase that creates a (PMML) model and a scoring phase that employs the (PMML) model to score a data stream or batch of records. The learning phase usually consists of the following sub-stages: exploratory data analysis, data preparation, event shaping, data modeling, & model validation. The scoring phase is typically simpler and either a stream or batch of data is scored using a model. PMML is designed so that different systems and applications can be used for producing models (PMML Producers) and for consuming models (PMML Consumers).
- View data as event based. Many analytic applications can be naturally thought of as event based. Event based data presents itself as a stream of events that are transformed, integrated, or aggregated to produce the state vectors that are inputs to statistical or data mining models. The current version of PMML provides implicit support for event based processing of data; future versions are expected to provide explicit support.
- Support data preparation. As mentioned above, data preparation is often the most time consuming part of the data mining process. PMML provides explicit support for many common data transformations and aggregations used when preparing data. Once encapsulated in this way, data preparation can more easily be re-used and leveraged by different components and applications.
PMML conists of the following components:
- Data Dictionary. The data dictionary defines the fields which are the inputs to models and specifies the type and value range for each field.
- Mining Schema. Each model contains one mining schema which lists the fields used in the model. These fields are a subset of the fields in the Data Dictionary. The mining schema contains information that is specific to a certain model, while the data dictionary contains data definitions which do not vary with the model. For example, the Mining Schema specifies the usage type of an attribute, which may be active (an input of the model), predicted (an output of the model), or supplementary (holding descriptive information and ignored by the model).
- Transformation Dictionary. The Transformation Dictionary defines derived fields. Derived fields may be defined by normalization, which maps continuous or discrete values to numbers; by discretization, which maps continuous values to discrete values; by value mapping, which maps discrete values to discrete values; or by aggregation, which summarizes or collects groups of values, for example by computing averages.
- Model Statistics. The Model Statistics component contains basic univariate statistics about the model, such as the minimum, maximum, mean, standard deviation, median, etc. of numerical attributes.
- Model Parameters. PMML also specifies the actual parameters defining the statistical and data mining models per se. Models in PMML include regression models, clusters models, trees, neural networks, bayesian models, association rules, and sequence models.
The diagram below shows how input files to PMML models can be defined.
Data attributes are defined using the PMML data dictionary. Those data attributes used in a model are defined using the PMML Mining Schema. In addition, derived attributes can be defined that are inputs to a model using the PMML Transformation Dictionary or using PMML defined local transformations.
The Predictive Model Markup Lanaguage (PMML), http://www.dmg.org