Heritage Health Prize- Data Mining Contest for 3mill USD

An animation of the quicksort algorithm sortin... — Image via Wikipedia

If Netflix was about 1 mill USD to better online video choices, here is a chance to earn serious money, write great code, and save lives!

From http://www.heritagehealthprize.com/

Heritage Health Prize
Launching April 4

Laptop

More than 71 Million individuals in the United States are admitted to
hospitals each year, according to the latest survey from the American
Hospital Association. Studies have concluded that in 2006 well over
$30 billion was spent on unnecessary hospital admissions. Each of
these unnecessary admissions took away one hospital bed from someone
else who needed it more.

http://www.heritagehealthprize.com/competition.php

Prize Goal & Participation

The goal of the prize is to develop a predictive algorithm that can identify patients who will be admitted to the hospital within the next year, using historical claims data.

Official registration will open in 2011, after the launch of the prize. At that time, pre-registered teams will be notified to officially register for the competition. Teams must consent to be bound by final competition rules.

Registered teams will develop and test their algorithms. The winning algorithm will be able to predict patients at risk for an unplanned hospital admission with a high rate of accuracy. The first team to reach the accuracy threshold will have their algorithms confirmed by a judging panel. If confirmed, a winner will be declared.

The competition is expected to run for approximately two years. Registration will be open throughout the competition.

Data Sets

Registered teams will be granted access to two separate datasets of de-identified patient claims data for developing and testing algorithms: a training dataset and a quiz/test dataset. The datasets will be comprised of de-identified patient data. The datasets will include:

Outpatient encounter data
Hospitalization encounter data
Medication dispensing claims data, including medications
Outpatient laboratory data, including test outcome values

The data for each de-identified patient will be organized into two sections: “Historical Data” and “Admission Data.” Historical Data will represent three years of past claims data. This section of the dataset will be used to predict if that patient is going to be admitted during the Admission Data period. Admission Data represents previous claims data and will contain whether or not a hospital admission occurred for that patient; it will be a binary flag.

Data The training dataset includes several thousand anonymized patients and will be made available, securely and in full, to any registered team for the purpose of developing effective screening algorithms.

The quiz/test dataset is a smaller set of anonymized patients. Teams will only receive the Historical Data section of these datasets and the two datasets will be mixed together so that teams will not be aware of which de-identified patients are in which set. Teams will make predictions based on these data sets and submit their predictions to HPN through the official Heritage Health Prize web site. HPN will use the Quiz Dataset for the initial assessment of the Team’s algorithms. HPN will evaluate and report back scores to the teams through the prize website’s leader board.

Scores from the final Test Dataset will not be made available to teams until the accuracy thresholds are passed. The test dataset will be used in the final judging and results will be kept hidden. These scores are used to preserve the integrity of scoring and to help validate the predictive algorithms.

Teams can begin developing and testing their algorithms as soon as they are registered and ready. Teams will log onto the official Heritage Health Prize website and submit their predictions online. Comparisons will be run automatically and team accuracy scores will be posted on the leader board. This score will be only on a portion of the predictions submitted (the Quiz Dataset), the additional results will be kept back (the Test Dataset).

Form

Once a team successfully scores above the accuracy thresholds on the online testing (quiz dataset), final judging will occur. There will be three parts to this judging. First, the judges will confirm that the potential winning team’s algorithm accurately predicts patient admissions in the Test Dataset (again, above the thresholds for accuracy).

Next, the judging panel will confirm that the algorithm does not identify patients and use external data sources to derive its predictions. Lastly, the panel will confirm that the team’s algorithm is authentic and derives its predictive power from the datasets, not from hand-coding results to improve scores. If the algorithm meets these three criteria, it will be declared the winner.

Failure to meet any one of these three parts will disqualify the team and the contest will continue. The judges reserve the right to award second and third place prizes if deemed applicable.

HPN Health Prize: The X-Prize of Health Care (medicineandtechnology.com)
$3 million machine learning prize (heritagehealthprize.com)
Heritage Providers Continues to Promote $3 Million Dollar Prize to Create An Algorithm To Predict and Prevents Hospitalizations (ducknetweb.blogspot.com)
Netflix Prize-Style Competition Predicts Hospitalizations (fastcompany.com)
For Data Crunchers, A Glittering Prize (online.wsj.com)
The American Hospital Association Awards Its Exclusive Endorsement to HR Solutions’ Physician Engagement Survey (prweb.com)

Google Refine

An interesting data cleaning software from Google at

https://code.google.com/p/google-refine/

From the page at

https://code.google.com/p/google-refine/wiki/UserGuide

The Basics

First, although Google Refine might start out looking like a spreadsheet program (Microsoft Excel, Google Spreadsheets, etc.), don’t expect it to work like a spreadsheet program. That’s almost like expecting a database to work like a text editor.

Google Refine is NOT for entering new data one cell at a time. It is NOT for doing accounting.

Google Refine is for applying transformations over many existing cells in bulk, for the purpose of cleaning up the data, extending it with more data from other sources, and getting it to some form that other tools can consume.

To use Google Refine, think in big patterns. For example, to spot errors, think

Show me every row where the string length of the customer’s name is longer than 50 characters (because I suspect that the customer’s address is mistakenly included in the name field)
Show me every row where the contract fee is less than 1 (because I suspect the fee was entered in unit of thousand dollars rather than dollars)
Show me every row where the description field (scraped from some web site) contains “&” (because I suspect it wasn’t decoded properly)

To edit data, think

For every row where the contract fee is less than 1, multiply the fee by 1000.
For every row where the customer name contains a comma (it has been entered as “last_name, first_name”), split the name by the comma, reverse the array, and join it back with a space (producing “first_name last_name”)

To specify patterns, use filters and facets. Typically, you create a filter or facet on a particular column. For example, you can create a numeric facet on the “contract fee” column and adjust its range selector to select values less than 1. If the default facet doesn’t do what you want, you can configure it (by clicking “change” on the facet’s header). For example, you can create a text facet with on the same “contract fee” column with this expression:

  value < 1

It will show 2 choices: true and false. Just select true. Then, invoke the Transform command on that same column and enter the expression

  value * 1000

That Transform command affects only rows where the “contract fee” cell contains a value less than 1.

You can use several filters and facets together. Only rows that are selected by all facets and filters will be shown in the data table. For example, say you have two text facets, one on the “contract fee” column with the expression

  value < 1

and another on the “state” column (with the default expression). If you select “true” in the first facet and “Nevada” in the second, then you will only see rows for contracts in Nevada with fees less than 1.

Analogies

Databases

If you have programmed databases before (performing SQL queries), then what Google Refine works should be quite familiar to you. Creating filters and facets and selecting something in them is like performing this SELECT statement:

  SELECT *
  WHERE ... constraints determined by selection in facets and filters ...

And invoking the Transform command on a column while having some filters and facets selected is like performing this UPDATE statement

  UPDATE whole_table SET column_X = ... expression ...
  WHERE ... constraints determined by selection in facets and filters ...

The difference between Google Refine and databases is that the facets show you choices that you can select, whereas databases assume that you already know what’s in the data.

Transforming spreadsheets into SKOS with Google Refine (semantic-web.at)
Adding geographical information to a spreadsheet based on postcodes – Google Refine and APIs (onlinejournalismblog.com)
Chapter 1. Using Google Refine to Clean Messy Data – ProPublica (propublica.org)

Zementis partners with R Analytics Vendor- Revo

Image via Wikipedia

Just got a PR email from Michael Zeller,CEO , Zementis annoucing Zementis (ADAPA) and Revolution Analytics just partnered up.

Is this something substantial or just time-sharing http://bi.cbronline.com/news/sas-ceo-says-cep-open-source-and-cloud-bi-have-limited-appeal or a Barney Partnership (http://www.dbms2.com/2008/05/08/database-blades-are-not-what-they-used-to-be/)

Summary- Thats cloud computing scoring of models on EC2 (Zementis) partnering with the actual modeling software in R (Revolution Analytics RevoDeployR)

See previous interviews with both Dr Zeller at https://decisionstats.com/2009/02/03/interview-michael-zeller-ceozementis/ ,https://decisionstats.com/2009/05/07/interview-ron-ramos-zementis/ and https://decisionstats.com/2009/10/05/interview-michael-zellerceo-zementis-on-pmml/)

and Revolution guys at https://decisionstats.com/2010/08/03/q-a-with-david-smith-revolution-analytics/

and https://decisionstats.com/2009/05/29/interview-david-smith-revolution-computing/

–

strategic partnership with Revolution Analytics, the leading commercial provider of software and support for the popular open source R statistics language. With this partnership, predictive models developed on Revolution R Enterprise are now accessible for real-time scoring through the ADAPA Decisioning Engine by Zementis.

ADAPA is an extremely fast and scalable predictive platform. Models deployed in ADAPA are automatically available for execution in real-time and batch-mode as Web Services. ADAPA allows Revolution R Enterprise to leverage the Predictive Model Markup Language (PMML) for better decision management. With PMML, models built in R can be used in a wide variety of real-world scenarios without requiring laborious or expensive proprietary processes to convert them into applications capable of running on an execution system.

“By partnering with Zementis, Revolution Analytics is building an end-to-end solution for moving enterprise-level predictive R models into the execution environment,” said Jeff Erhardt, Revolution Analytics Chief Operation Officer. “With Zementis, we are eliminating the need to take R applications apart and recode, retest and redeploy them in order to obtain desirable results.”

Got demo?

Yes, we do! Revolution Analytics and Zementis have put together a demo which combines the building of models in R with automatic deployment and execution in ADAPA. It uses Revolution Analytics’ RevoDeployR, a new Web Services framework that allows for data analysts working in R to publish R scripts to a server-based installation of Revolution R Enterprise.

Action Items:

Try our INTERACTIVE DEMO

DOWNLOAD the white paper

Try the ADAPA FREE TRIAL

RevoDeployR & ADAPA allow for real-time analysis and predictions from R to be effectively used by existing Excel spreadsheets, BI dashboards and Web-based applications, all in real-time.

Predictive analytics with RevoDeployR from Revolution Analytics and ADAPA from Zementis put model building and real-time scoring into a league of their own. Seriously!

Revolution R Enterprise 4.2 now available (revolutionanalytics.com)
Enterprise Startup Spotlight: Revolution Analytics, Taking on SAS, SPSS (readwriteweb.com)
Gartner predicts business intelligence revolution (v3.co.uk)

Using R from other Software

Bridge to R for WPS

http://www.minequest.com/Bridge2R.html

SAS/IML Interface to R

http://www.sas.com/technologies/analytics/statistics/iml/index.html

Official Screenshot-

RapidMiner Extension to R

https://rapid-i.com/content/view/202/206/lang,en/#r

(UN)Official Screenshot-

IBM SPSS plugin for R

https://www.spss.com/software/statistics/developer/

and

https://www.spss.com/devcentral/index.cfm?pg=rresources

Tutorial-

https://sites.google.com/site/r4statistics/running-r-from-spss

http://rwiki.sciviews.org/doku.php?id=tips:callingr:spss

(UN)official Screenshot

Knime

http://www.knime.org/downloads/extensions

Official Screenshot-

Oracle Data Miner

http://www.oracle.com/technetwork/database/options/odm/odm-r-integration-089013.html

Official Screenshot-

JMP

http://jmp.com/software/jmp9/keyfeatures.shtml

and

http://www.jmp.com/applications/analytical_apps/

Tutorial

http://blogs.sas.com/jmp/index.php?/archives/298-JMP-Into-R!.html

Screenshot-

R ready to Deduce you (ekonometrics.blogspot.com)
Remove all rows of an R dataframe (r-bloggers.com)
Create a Web Crawler in R (r-bloggers.com)

PSPP – SPSS 's Open Source Counterpart

New Website for Windows Installers for PSPP– try at your own time if you are dedicated to either SPSS or free statistical computing.

http://pspp.awardspace.com/

This page is intended to give a stable root for downloading the PSPP-for-Windows setup from free mirrors.

Highlights of the current PSPP-for-Windows setup

PSPP info:

Current version:	Master version = 0.7.6
Release date:	See filenames
Information about PSPP:	http://www.gnu.org/software/pspp
PSPP Manual:	PDF or HTML (current version will be installed on your PC by the installer package)

Package info:

Windows version:	Windows XP and newer
Package Size:	15 Mb
Size on disk:	34 Mb
Technical:	MinGW based Cross-compiled on openSUSE 11.3

Downloads:
There are issues with the latest build. Some users report crashes on their systems on other systems it works fine.

Version	Installer for multi-user installation. Administrator privileges required. Recommended version.	Installer for single-user installation. No administrator privileges required
0.7.6-g38ba1e-blp-build20101116 0.7.5-g805e7e-blp-build20100908 0.7.5-g7803d3-blp-build20100820 0.7.5-g333ac4-blp-build20100727	PSPP-Master-2010-11-16 PSPP-Master-2010-09-08 PSPP-Master-2010-08-20 PSPP-Master-2010-07-27	PSPP-Master-single-user-2010-11-16 PSPP-Master-single-user-2010-09-08 PSPP-Master-single-user-2010-08-20 PSPP-Master-single-user-2010-07-27

Sources can be found here.

Also see http://en.wikipedia.org/wiki/PSPP

At the user’s choice, statistical output and graphics are done in ASCII, PDF, PostScript or HTML formats. A limited range of statistical graphs can be produced, such as histograms, pie-charts and np-charts.

PSPP can import Gnumeric, OpenDocument and Excel spreadsheets, Postgres databases, comma-separated values– and ASCII-files. It can export files in the SPSS ‘portable’ and ‘system’ file formats and to ASCII files. Some of the libraries used by PSPP can be accessed programmatically; PSPP-Perl provides an interface to the libraries used by PSPP.

and

http://www.gnu.org/software/pspp/

A brief list of some of the features of PSPP follows:

Supports over 1 billion cases.
Supports over 1 billion variables.
Syntax and data files are compatible with SPSS.
Choice of terminal or graphical user interface.
Choice of text, postscript or html output formats.
Inter-operates with Gnumeric, OpenOffice.Org and other free software.
Easy data import from spreadsheets, text files and database sources.
Fast statistical procedures, even on very large data sets.
No license fees.
No expiration period.
No unethical “end user license agreements”.
Fully indexed user manual.
Free Software; licensed under GPLv3 or later.
Cross platform; Runs on many different computers and many different operating systems.

R ready to Deduce you (ekonometrics.blogspot.com)
SPSS Guru embraces the freeware, R (ekonometrics.blogspot.com)
Create a Web Crawler in R (r-bloggers.com)

R Node- and other Web Interfaces to R

vector version of this image — Image via Wikipedia

R Node is a great web interface to R.

http://squirelove.net/r-node/doku.php

Features

Access to a R server backend via a web browser UI
The web browser UI works in all modern browsers, including IE 7 and 8 (excluding SVG based graphs).
Username/password login (both from the browser to the R-Node server, and from the R-Node server to Rserve and R).
- Per-user R sessions. Each user can have their own R workspace, or they can share.
Support for most R commands that perform statistical analysis and provide textual feedback.
Support for most standard R commands that provide graphical feedback via server side generation of the graphs. Some graphs (e.g. plot() can be plotted via SVG client-side as well).
Downloading of generated graphs.
Accessing R help files using help() and ? commands (Note R v2.10 altered how help is provided, so this currently is not working in R v2.10)
Uploading files to work with their data in R.

Many commands will work. Try a command, if it does not work, use the feedback button in the application to let us know.

Limitations

Various R functions are not supported. These include:
- Installation of new R packages.
- Searching of help via ??.
- Example calls (via example()).

Of course other Web Interfaces to R are-http://cran.r-project.org/doc/FAQ/R-FAQ.html#R-Web-Interfaces

First and now not so updated Rweb: Web-based Statistical Analysis Last Modified: 25-Jun-1999 JSS Paper (http://www.jstatsoft.org/v04/i01/

R-Online https://user.cs.tu-berlin.de/~ulfi/cgi-bin/r-online/r-online.cgi(The official FAQ seems outdated )

Rcgi (it is not clear if the project is still active as per official FQ) http://www.ms.uky.edu/~statweb/testR3.html

Rphp

http://dssm.unipa.it/R-php/

RWui

http://sysbio.mrc-bsu.cam.ac.uk/Rwui/

R.Rsp

http://cran.r-project.org/web/packages/R.rsp/index.html

RServe

http://www.rforge.net/Rserve/

http://www.rforge.net/doc/packages/Rserve/00Index.html

RPad

http://rpad.googlecode.com/svn-history/r76/Rpad_homepage/index.html

CGIwithR

JSS paper Citation. CGIwithR: Facilities for processing Web forms using R. Journal of Statistical Software, 8(10), pp. 1-8, 2003.

A lecture on aspects of using CGI

R Apache

http://biostat.mc.vanderbilt.edu/rapache/

Open Infrastructure for Outcomes with a live reporting module using RSessionDA
Free statistics software– Wessa server using R (see http://www.wessa.net/rwasp_arimaforecasting.wasp)

Wessa, P. (2011), Free Statistics Software, Office for Research Development and Education,
version 1.1.23-r6, URL http://www.wessa.net/

You can even see my results here

http://www.freestatistics.org/blog/index.php?v=date/2011/Feb/14/t12976948805e8vh3v1e680a0z.htm/

An impressive implementation of time series analysis based on R and Javascript. This web server creates separate browser windows for data entry, graphics, and procedure selection. These windows are dynamic. For example, after entering data there is no submit button to submit the data. The procedure selection window is used to start the analysis, which uses the current values in the data window.

http://alpha1.ism.ac.jp/inets2/parent_sample.html?

Online multivariate analysis and graphical displays from PBIL, Lyon

http://pbil.univ-lyon1.fr/Rweb/

An R web server for robust rank-based linear models

http://www.stat.wmich.edu/slab/RGLM/ (impressive except for the red font)

gWidgetsWWW

http://www.math.csi.cuny.edu/gWidgetsWWWrun/ex-index.R

To make an interactive GUI in gWidgets can be as easy as creating the following script:

w <- gwindow(’simple interactive GUI with one button’, visible=FALSE) g <- ggroup(cont=w) b <- gbutton(’click me’, cont=g, handler=function(h,...) { gmessage(’hello world’, parent=b) }) visible(w) <- TRUE

A big and slightly outdated resource page from (which I used for some find and seek of resources)

http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/StatCompCourse

AND

The famous site at http://www.yeroon.net/ggplot2/ (but no sharing of this site’s source code ,sigh!)

Thats all for now- but watch this space its exciting (to watch AND code) –

Code Enhancers for R

This page lists code editors (or IDE)

https://rforanalytics.wordpress.com/code-enhancers-for-r/

Graphical User Interfaces for R

https://rforanalytics.wordpress.com/graphical-user-interfaces-for-r/

ODBC /Databases for R

https://rforanalytics.wordpress.com/odbc-databases-for-r/

WebTunes provides Web-based iTunes interface (macworld.com)
5 Reasons to Use Twitter Web Interface (madrasgeek.com)
Getting Started With Riak & Python (pragmaticbadger.com)
Rserve – Binary R server – RForge.net (rforge.net)
How to Run Apache and Node.js on the Same Server (readwriteweb.com)
Rserve – TCP/IP interface to R – RoSuDa – Lehrstuhl für Rechnerorientierte Statistik und Datenanalyse – Universität Augsburg (stats.math.uni-augsburg.de)

Protovis a graphical toolkit for visualization

I just found about a new data visualization tool called Protovis http://vis.stanford.edu/protovis/ex/

Protovis composes custom views of data with simple marks such as bars and dots. Unlike low-level graphics libraries that quickly become tedious for visualization, Protovis defines marks through dynamic properties that encode data, allowing inheritance, scales and layouts to simplify construction.

Protovis is free and open-source and is a Stanford project. It has been used in web interface R Node (which I will talk later )

http://squirelove.net/r-node/doku.php

Conventional

While Protovis is designed for custom visualization, it is still easy to create many standard chart types. These simpler examples serve as an introduction to the language, demonstrating key abstractions such as quantitative and ordinal scales, while hinting at more advanced features, including stack layout.

Custom

Many charting libraries provide stock chart designs, but offer only limited customization; Protovis excels at custom visualization design through a concise representation and precise control over graphical marks. These examples, including a few recreations of unusual historical designs, demonstrate the language’s expressiveness.

Try Protovis today 🙂 http://vis.stanford.edu/protovis/

It uses JavaScript and SVG for web-native visualizations; no plugin required (though you will need a modern web browser)! Although programming experience is helpful, Protovis is mostly declarative and designed to be learned by example.

Linking Petterson – Visualising FRBR data with Protovis (home.hio.no)
The Stanford Visualization Group Debuts Visual Tool for Cleaning Up Data (readwriteweb.com)
Roll your own JavaScript lambda syntax (strobe.cc)

Heritage Health Prize Launching April 4

Prize Goal & Participation

Data Sets

Related Articles

Please share:

The Basics

Analogies

Databases

Related Articles

Please share:

Related Articles

Please share:

Bridge to R for WPS

SAS/IML Interface to R

RapidMiner Extension to R

IBM SPSS plugin for R

Oracle Data Miner

JMP

Related Articles

Please share:

Related Articles

Please share:

Rphp

RWui

R.Rsp

RServe

RPad

CGIwithR

R Apache

Code Enhancers for R

Graphical User Interfaces for R

ODBC /Databases for R

Related Articles

Please share:

Conventional

Custom

Related Articles

Please share:

Heritage Health Prize
Launching April 4