Building a Regression Model in R – Use #Rstats

One of the most commonly used uses of Statistical Software is building models, and that too logistic regression models for propensity in marketing of goods and services.

 

If building a model is what you do-here is a brief easy essay on  how to build a model in R.

1) Packages to be used-

For smaller datasets

use these

  1. CAR Package http://cran.r-project.org/web/packages/car/index.html
  2. GVLMA Package http://cran.r-project.org/web/packages/gvlma/index.html
  3. ROCR Package http://rocr.bioinf.mpi-sb.mpg.de/
  4. Relaimpo Package
  5. DAAG package
  6. MASS package
  7. Bootstrap package
  8. Leaps package

Also see

http://cran.r-project.org/web/packages/rms/index.html or RMS package

rms works with almost any regression model, but it was especially written to work with binary or ordinal logistic regression, Cox regression, accelerated failure time models, ordinary linear models, the Buckley-James model, generalized least squares for serially or spatially correlated observations, generalized linear models, and quantile regression.

For bigger datasets also see Biglm http://cran.r-project.org/web/packages/biglm/index.html and RevoScaleR packages.

http://www.revolutionanalytics.com/products/enterprise-big-data.php

2) Syntax

  1. outp=lm(y~x1+x2+xn,data=dataset) Model Eq
  2. summary(outp) Model Summary
  3. par(mfrow=c(2,2)) + plot(outp) Model Graphs
  4. vif(outp) MultiCollinearity
  5. gvlma(outp) Heteroscedasticity using GVLMA package
  6. outlierTest (outp) for Outliers
  7. predicted(outp) Scoring dataset with scores
  8. anova(outp)
  9. > predict(lm.result,data.frame(conc = newconc), level = 0.9, interval = “confidence”)

 

For a Reference Card -Cheat Sheet see

http://cran.r-project.org/doc/contrib/Ricci-refcard-regression.pdf

3) Also read-

http://cran.r-project.org/web/views/Econometrics.html

http://cran.r-project.org/web/views/Robust.html

 

Augustus- a PMML model producer and consumer. Scoring engine.

A Bold GNU Head
Image via Wikipedia

I just checked out this new software for making PMML models. It is called Augustus and is created by the Open Data Group (http://opendatagroup.com/) , which is headed by Robert Grossman, who was the first proponent of using R on Amazon Ec2.

Probably someone like Zementis ( http://adapasupport.zementis.com/ ) can use this to further test , enhance or benchmark on the Ec2. They did have a joint webinar with Revolution Analytics recently.

https://code.google.com/p/augustus/

Recent News

  • Augustus v 0.4.3.1 has been released
  • Added a guide (pdf) for including Augustus in the Windows System Properties.
  • Updated the install documentation.
  • Augustus 2010.II (Summer) release is available. This is v 0.4.2.0. More information is here.
  • Added performance discussion concerning the optional cyclic garbage collection.

See Recent News for more details and all recent news.

Augustus

Augustus is a PMML 4-compliant scoring engine that works with segmented models. Augustus is designed for use with statistical and data mining models. The new release provides Baseline, Tree and Naive-Bayes producers and consumers.

There is also a version for use with PMML 3 models. It is able to produce and consume models with 10,000s of segments and conforms to a PMML draft RFC for segmented models and ensembles of models. It supports Baseline, Regression, Tree and Naive-Bayes.

Augustus is written in Python and is freely available under the GNU General Public License, version 2.

See the page Which version is right for me for more details regarding the different versions.

PMML

Predictive Model Markup Language (PMML) is an XML mark up language to describe statistical and data mining models. PMML describes the inputs to data mining models, the transformations used to prepare data for data mining, and the parameters which define the models themselves. It is used for a wide variety of applications, including applications in finance, e-business, direct marketing, manufacturing, and defense. PMML is often used so that systems which create statistical and data mining models (“PMML Producers”) can easily inter-operate with systems which deploy PMML models for scoring or other operational purposes (“PMML Consumers”).

Change Detection using Augustus

For information regarding using Augustus with Change Detection and Health and Status Monitoring, please see change-detection.

Open Data

Open Data Group provides management consulting services, outsourced analytical services, analytic staffing, and expert witnesses broadly related to data and analytics. It has experience with customer data, supplier data, financial and trading data, and data from internal business processes.

It has staff in Chicago and San Francisco and clients throughout the U.S. Open Data Group began operations in 2002.


Overview

The above example contains plots generated in R of scoring results from Augustus. Each point on the graph represents a use of the scoring engine and a chart is an aggregation of multiple Augustus runs. A Baseline (Change Detection) model was used to score data with multiple segments.

Typical Use

Augustus is typically used to construct models and score data with models. Augustus includes a dedicated application for creating, or producing, predictive models rendered as PMML-compliant files. Scoring is accomplished by consuming PMML-compliant files describing an appropriate model. Augustus provides a dedicated application for scoring data with four classes of models, Baseline (Change Detection) ModelsTree ModelsRegression Models and Naive Bayes Models. The typical model development and use cycle with Augustus is as follows:

  1. Identify suitable data with which to construct a new model.
  2. Provide a model schema which proscribes the requirements for the model.
  3. Run the Augustus producer to obtain a new model.
  4. Run the Augustus consumer on new data to effect scoring.

Separate consumer and producer applications are supplied for Baseline (Change Detection) models, Tree models, Regression models and for Naive Bayes models. The producer and consumer applications require configuration with XML-formatted files. The specification of the configuration files and model schema are detailed below. The consumers provide for some configurability of the output but users will often provide additional post-processing to render the output according to their needs. A variety of mechanisms exist for transmitting data but user’s may need to provide their own preprocessing to accommodate their particular data source.

In addition to the producer and consumer applications, Augustus is conceptually structured and provided with libraries which are relevant to the development and use of Predictive Models. Broadly speaking, these consist of components that address the use of PMML and components that are specific to Augustus.

Post Processing

Augustus can accommodate a post-processing step. While not necessary, it is often useful to

  • Re-normalize the scoring results or performing an additional transformation.
  • Supplements the results with global meta-data such as timestamps.
  • Formatting of the results.
  • Select certain interesting values from the results.
  • Restructure the data for use with other applications.

Redlining in Internet Access and notes on Regression Models

This is the definition of Redlining Citation- The AD FREE Wikepedia-

Redlining is the practice of denying, or increasing the cost of, services such as bankinginsuranceaccess to jobs,[2]access to health care,[3] or even supermarkets[4] to residents in certain, often racially determined,[5] areas. The term “redlining” was coined in the late 1960s by community activists in Chicago.[citation needed] It describes the practice of marking a red line on a map to delineate the area where banks would not invest; later the term was applied todiscrimination against a particular group of people (usually by race or sex) no matter the geography.

As of today, redlining in financial services is outlawed by the Fair Credit Lending Act which prohibits using variables in regression models which end up red-lining districts. However as far as 2005, redlining was used in Auto Insurance by using suitably disguised zip9 variables ( I carried data for 55 million American Citizens and 88 million Accounts for a major North American Automotive Insurance provider as part of an offshoring contract from Atlanta, GA  in 2005).

It exists today by informal arrangements between internet service providers who carve up territories and districts. Internet access redlining is still not illegal. This is especially true in Austin ( I traveled there as a consultant last year) and Knoxville, Tennessee where I still study as a grad student.

Neither are suitably proprietary insurance and health care claim denial models used for minimizing litigation risk. Litigation risk minimization is the next level of retail logistic regression model just as predictive modeling used by political consultants during elections.