, the Universal PMML Plug-in for in-database scoring. Available now for the EMC Greenplum Database, a high-performance massively parallel processing (MPP) database, the plug-in leverages the Predictive Model Markup Language (PMML) to execute predictive models directly within EMC Greenplum, for highly optimized in-database scoring.
Developed by the Data Mining Group (DMG), PMML is supported by all major data mining vendors, e.g., IBM SPSS, SAS, Teradata, FICO, STASTICA, Microstrategy, TIBCO and Revolution Analytics as well as open source tools like R, KNIME and RapidMiner. With PMML, models built in any of these data mining tools can now instantly be deployed in the EMC Greenplum database. The net result is the ability to leverage the power of standards-based predictive analytics on a massive scale, right where the data resides.
“By partnering with Zementis, a true PMML innovator, we are able to offer a vendor-agnostic solution for moving enterprise-level predictive analytics into the database execution environment,” said Dr. Steven Hillion, Vice President of Analytics at EMC Greenplum. “With Zementis and PMML, the de-facto standard for representing data mining models, we are eliminating the need to recode predictive analytic models in order to deploy them within our database. In turn, this enables an analyst to reduce the time to insight required in most businesses today.”
Want to learn more?
To learn more about how the EMC Greenplum Database and the Universal PMML Plug-in work together, feel free to:
Another aspect of tool usefulness is how much does it help with the entire data mining process from data preparation and cleaning, modeling, evaluation, visualization and presentation (excluding deployment).
New KDnuggets Poll is asking: What part of your analytics / data mining work in the past 12 months was done in R?
More than 71 Million individuals in the United States are admitted to
hospitals each year, according to the latest survey from the American
Hospital Association. Studies have concluded that in 2006 well over
$30 billion was spent on unnecessary hospital admissions. Each of
these unnecessary admissions took away one hospital bed from someone
else who needed it more.
The goal of the prize is to develop a predictive algorithm that can identify patients who will be admitted to the hospital within the next year, using historical claims data.
Official registration will open in 2011, after the launch of the prize. At that time, pre-registered teams will be notified to officially register for the competition. Teams must consent to be bound by final competition rules.
Registered teams will develop and test their algorithms. The winning algorithm will be able to predict patients at risk for an unplanned hospital admission with a high rate of accuracy. The first team to reach the accuracy threshold will have their algorithms confirmed by a judging panel. If confirmed, a winner will be declared.
The competition is expected to run for approximately two years. Registration will be open throughout the competition.
Registered teams will be granted access to two separate datasets of de-identified patient claims data for developing and testing algorithms: a training dataset and a quiz/test dataset. The datasets will be comprised of de-identified patient data. The datasets will include:
Outpatient encounter data
Hospitalization encounter data
Medication dispensing claims data, including medications
Outpatient laboratory data, including test outcome values
The data for each de-identified patient will be organized into two sections: “Historical Data” and “Admission Data.” Historical Data will represent three years of past claims data. This section of the dataset will be used to predict if that patient is going to be admitted during the Admission Data period. Admission Data represents previous claims data and will contain whether or not a hospital admission occurred for that patient; it will be a binary flag.
The training dataset includes several thousand anonymized patients and will be made available, securely and in full, to any registered team for the purpose of developing effective screening algorithms.
The quiz/test dataset is a smaller set of anonymized patients. Teams will only receive the Historical Data section of these datasets and the two datasets will be mixed together so that teams will not be aware of which de-identified patients are in which set. Teams will make predictions based on these data sets and submit their predictions to HPN through the official Heritage Health Prize web site. HPN will use the Quiz Dataset for the initial assessment of the Team’s algorithms. HPN will evaluate and report back scores to the teams through the prize website’s leader board.
Scores from the final Test Dataset will not be made available to teams until the accuracy thresholds are passed. The test dataset will be used in the final judging and results will be kept hidden. These scores are used to preserve the integrity of scoring and to help validate the predictive algorithms.
Teams can begin developing and testing their algorithms as soon as they are registered and ready. Teams will log onto the official Heritage Health Prize website and submit their predictions online. Comparisons will be run automatically and team accuracy scores will be posted on the leader board. This score will be only on a portion of the predictions submitted (the Quiz Dataset), the additional results will be kept back (the Test Dataset).
Once a team successfully scores above the accuracy thresholds on the online testing (quiz dataset), final judging will occur. There will be three parts to this judging. First, the judges will confirm that the potential winning team’s algorithm accurately predicts patient admissions in the Test Dataset (again, above the thresholds for accuracy).
Next, the judging panel will confirm that the algorithm does not identify patients and use external data sources to derive its predictions. Lastly, the panel will confirm that the team’s algorithm is authentic and derives its predictive power from the datasets, not from hand-coding results to improve scores. If the algorithm meets these three criteria, it will be declared the winner.
Failure to meet any one of these three parts will disqualify the team and the contest will continue. The judges reserve the right to award second and third place prizes if deemed applicable.