Analytics – Page 46 – DECISION STATS

One More Cool Widget

Another Cool and By Now Quite Familiar Widget (example bloglines etc)

Note this is an example only. Decision Stats site is non commercial.
But if you slog and blog, earn some more peanuts here.

Enjoy!!

More fun than plain ol Adsense ..say what ?

Segmenting Models : When and Why

Creating segmented models in SAS is quite easy, with a by group processing . It is less easy in other softwares , but that is understandable given that

the first generic rule of segmentation is

1) each segment has statistically similar characteristics .

2) different segments have statistically different characteristics .

This means that just using Proc freq to
check response rate versus
independent variable is not a good way
to check the level of difference.
Proc univariate with plot option and
a by group processing
is actually a better way to test out
because it is a combination of means ,
median analysis
(measures of central value) but also
box plot ,normal distributions and standard deviations
(measures of dispersion).

Proc freq with cross tab is incredibly powerful to decide whether to create a model in the first place. But fine tuning of decisions on segments is better done with proc univariate. The SAS equivalent for clustering of course remains Proc Fastclus and family which will be dealt in a separate post.

(Note :lovely image that explains the above from Dr Ariel Shamir’s home page (he is a research expert on Visual Succinct Representation of Information ————-from Israel, land of the brave and intelligent).

A Picture is truly worth a thousand words (or posts !).)

How to do Logistic Regression

Logistic regression is a widely used technique in database marketing for creating scoring models and in risk classification . It helps develop propensity to buy, and propensity to default scores (and even propensity to fraud ) .

This is more of a practical approach to make the model than a theory based approach.(I was never good at the theory 😉 )

If you need to do Logistic Regression using SPSS, a very good tutorial ia available here

http://www2.chass.ncsu.edu/garson/PA765/logistic.htm

(Note -Copyright 1998, 2008 by G. David Garson.
Last update 5/21/08.)

For SAS a very good tutorial is here –

SAS Annotated Output
Ordered Logistic Regression. UCLA: Academic Technology Services, Statistical Consulting Group.

from http://www.ats.ucla.edu/stat/sas/output/sas_ologit_output.htm (accessed July 23, 2007).

For R the documentation (note :Still searching for R ‘s Logistic Regression ) is here
http://lib.stat.cmu.edu/S/Harrell/help/Design/html/lrm.html
–

lrm(formula, data, subset, na.action=na.delete, method=”lrm.fit”, model=FALSE, x=FALSE, y=FALSE, linear.predictors=TRUE, se.fit=FALSE, penalty=0, penalty.matrix, tol=1e-7, strata.penalty=0, var.penalty=c(‘simple’,’sandwich’), weights, normwt, …)

For linear models in R –
http://datamining.togaware.com/survivor/Linear_Model0.html

An extremely good book if you want to work with R , and do not have time to learn it is to use the GUI
rattle and look at this book

http://datamining.togaware.com/survivor/Contents.html

SAS Fun: Sudoko

Here is a SAS program to help you beat others at Sudoko, and impress people. It was written by a chap named Ryan Howard in 2006, and I am thankful for him in allowing me in sharing this.You can let us know if you find a puzzle it could not solve , or if you tweak the program a bit. The code is pasted below.

Have fun !

And the SAS paper on this was at SAS Global Forum 2007- the resulting
paper, “SAS and Sudoku”, was written by Richard DeVenezia, John Gerlach,
Larry Hoyle, Talbot Katz and Rick Langston, and can be viewed at
http://www2.sas.com/proceedings/forum2007/011-2007.pdf.

(p.s. I haven’t tested this on WPS , they still dont have the SAS Macro language ,but let me know if you have any equivalent in SPSS or R 🙂 )

*=============================================================================;
* sudoku.sas                                                                  ;
* Written by: Ryan Howard                                                     ;
* Date: Sept. 2006                                                            ;
*-----------------------------------------------------------------------------;
* Summary: This program solves sudoku puzzles consisting of a 9X9 matrix.     ;
*-----------------------------------------------------------------------------;
* Upgrade Ideas:  1. Add a GUI to collect the input numbers and display output;
*                 2. Expand logic to work for 16X16 matrices                  ;
*=============================================================================;

title;
options nodate nonumber;

data _null_;

    *-----------------------------------------------------------------------------;
    * input  inital values for each cell from puzzle                              ;
    *-----------------------------------------------------------------------------;

    _1111=9; _1112=.; _1113=.;   _1211=.; _1212=.; _1213=.;   _1311=1; _1312=.; _1313=.;
    _1121=5; _1122=.; _1123=.;   _1221=.; _1222=6; _1223=.;   _1321=.; _1322=4; _1323=2;
    _1131=.; _1132=.; _1133=.;   _1231=7; _1232=1; _1233=.;   _1331=5; _1332=.; _1333=.;

    _2111=.; _2112=.; _2113=2;   _2211=.; _2212=.; _2213=.;   _2311=.; _2312=1; _2313=.;
    _2121=.; _2122=3; _2123=.;   _2221=.; _2222=.; _2223=.;   _2321=2; _2322=9; _2323=.;
    _2131=.; _2132=7; _2133=.;   _2231=.; _2232=.; _2233=6;   _2331=.; _2332=.; _2333=3;

    _3111=.; _3112=2; _3113=.;   _3211=.; _3212=.; _3213=8;   _3311=.; _3312=.; _3313=.;
    _3121=.; _3122=.; _3123=4;   _3221=5; _3222=.; _3223=.;   _3321=.; _3322=.; _3323=.;
    _3131=.; _3132=.; _3133=.;   _3231=.; _3232=3; _3233=.;   _3331=8; _3332=.; _3333=9;

    %macro printmatrix;
    *----------------------------------------------------------------;
    * print the result matrix                                        ;
    *----------------------------------------------------------------;
       *---------------------------------------------------;
       * Assign column positions for printing matrix       ;
       *---------------------------------------------------;
       c1=1;
       c2=10;
       c3=20;
 Continue reading "SAS Fun: Sudoko"

More Advanced SAS Modeling Procs

A special thanks to Peter Flom ( www.peterflom.com )for suggesting the following –

5) Proc NLMIXED

PROC NLMIXED can be viewed as generalizations of the random coefficient models fit by the MIXED procedure. This generalization allows the random coefficients to enter the model nonlinearly, whereas in PROC MIXED they enter linearly. With PROC MIXED you can perform both maximum likelihood and restricted maximum likelihood (REML) estimation, whereas PROC NLMIXED only implements maximum likelihood. This is because the analog to the REML method in PROC NLMIXED would involve a high dimensional integral over all of the fixed-effects parameters, and this integral is typically not available in closed form. Finally, PROC MIXED assumes the data to be normally distributed, whereas PROC NLMIXED enables you to analyze data that are normal, binomial, or Poisson or that have any likelihood programmable with SAS statements.

http://aerg.canberra.edu.au/envirostats/bm/SASHelp/stat/chap46/sect4.htm

6) Proc Glimmix

PROC GLIMMIX fits statistical models to data with correlations or nonconstant variability and where the response is not necessarily normally distributed. These generalized linear mixed models (GLMM), like linear mixed models, assume normal (Gaussian) random effects. Conditional on these random effects, data can have any distribution in the exponential family. The binary, binomial, Poisson, and negative binomial distributions, for example, are discrete members of this family. The normal, beta, gamma, and chi-square distributions are representatives of the continuous distributions in this family.

Some PROC GLIMMIX features are:

Flexible covariance structures for random effects and correlated errors
Programmable link and variance functions
Bias-adjusted empirical covariance estimators
Univariate and multivariate low-rank smoothing
Joint modeling for multivariate data

Besides including performance enhancements and various fixes, the production release of the GLIMMIX procedure provides numerous additional features. These include:

ODS statistical graphics to display LS-means and confidence limits
Analysis of Means
Odds ratios
Custom hypotheses concerning LS-means with the LSMESTIMATE statement
New multiplicity adjustments
Beta regression

www2.sas.com/proceedings/sugi30/196-30.pdf

http://support.sas.com/rnd/app/da/glimmix.html

3) Proc QUANTREG

www.stat.uiuc.edu/~x-he/ENAR-Tutorial.pdf

Ordinary least squares regression models the relationship between one or more covariates X and the conditional mean of the response variable Y given X=x. Quantile regression extends the regression model to conditional quantiles of the response variable, such as the 90th percentile. Quantile regression is particularly useful when the rate of change in the conditional quantile, expressed by the regression coefficients, depends on the quantile. The main advantage of quantile regression over least squares regression is its flexibility for modeling data with heterogeneous conditional distributions. Data of this type occur in many fields, including biomedicine, econometrics, and ecology.

Some PROC QUANTREG features are:

Implements the simplex, interior point, and smoothing algorithms for estimation
Provides three methods to compute confidence intervals for the regression quantile parameter: sparsity, rank, and resampling.
Provides two methods to compute the covariance and correlation matrices of the estimated parameters: an asymptotic method and a bootstrap method
Provides two tests for the regression parameter estimates: the Wald test and a likelihood ratio test
Uses robust multivariate location and scale estimates for leverage point detection
Multithreaded for parallel computing when multiple processors are available

4) Proc Catmod-

http://www.uidaho.edu/ag/statprog/sas/workshops/catmod/outline.html

Categorical data with more than two factors are referred to as multi-dimensional distributions. Procedure CATMOD will be used for analyses concerning such data. PROC CATMOD may also be used to analyze one-and two-way data structures , however it is an effective means to approach more complex data structures.

PROC CATMOD utilizes a different technique to do categorical analysis than the ‘Pearson type’ chi-square. The analysis is based on a transformation of the cell probabilities. This transformation is called the response function. The exact form of the response function depends on the data type and it is normally motivated by certain theoretical considerations. SAS offers many different forms of response functions and even allows the user to specify their own, however, the most common (default) is the Generalized Logit. This function is defined as:

Generalized Logit = LOG(pi/pk),
where pi is the ith cell probability and pk is the last cell probability. The ratio of pi/pk is called an odds ratio and the log of the odds ratio is just a comparison of the ith category to the last, on a log scale. The logit can be rewritten as:
Generalized Logit = LOG(pi) – LOG(pk).
It should be noted that if there are k categories, then there will be only k-1 response functions since the kth one will be zero.

SAS Modeling Procs

Well, so you want to be a SAS Modeler. Or atleast get a job as a junior one , and then learn on the job (we all did). Here are some SAS Procs you need to brush up on-

1) Proc Reg – Continuous Regression.

2) Proc Logistic –Logistic Regression.

3) Proc Probit –Categorical regressors also included in this.

4) Proc GLM –General Linear Models based on OLS. PROC GLM handles models relating one or several continuous dependent variables to one or several independent variables. The independent variables may be either classification variables, which divide the observations into discrete groups, or continuous variables.Proc GLM is the preferred procedure for doing univariate analysis of variance , multivariate analysis of variance , and most types of regression. :Note there is a Proc Anova also.

5) Proc Mixed –The PROC MIXED was specifically designed to fit mixed effect models. It can model random and mixed effect data.PROC MIXED has three options for the method of estimation. They are: ML (Maximum Likelihood), REML (Restricted or Residual maximum likelihood, which is the default method) and MIVQUE0 (Minimum Variance Quadratic Unbiased Estimation). ML and REML are based on a maximum likelihood estimation approach. They require the assumption that the distribution of the dependent variable (error term and the random effects) is normal. ML is just the regular maximum likelihood method,that is, the parameter estimates that it produces are such values of the model parameters that maximize the likelihood function. REML method is a variant of maximum likelihood estimation; REML estimators are obtained not from maximizing the whole likelihood function, but only that part that is invariant to the fixed effects part of the linear model. In other words, if y = Xb + Zu + e, where Xb is the fixed effects part, Zu is the random effects part and e is the error term, then the REML estimates are obtained by maximizing the likelihood function of K’y, where K is a full rank matrix with columns orthogonal to the columns of the X matrix, that is, K’X = 0. I

6) Proc Genmod-PROC GENMOD uses a class statement for specifying categorical (classification) variables, so indicator variables do not have to be constructed in advance, as is the case with, for example, PROC LOGISTIC. Interactions can be fitted by specifying, for example, age*sex. The response variable or the explanatory variable can be character while PROC LOGISTIC requires explanatory variables to be numeric.

7) Proc Corr-CORR procedure computes correlation coefficients between variables. It can also produce covariances.

8) Proc Anova-PROC ANOVA handles only balanced ANOVA designs

Required reading –http://en.wikipedia.org/wiki/Regression_analysis

SAS Online Doc

Additional Reading-

http://www.pauldickman.com/teaching/sas/genmod_logistic.php

http://www.psych.yorku.ca/lab/sas/sasanova.htm

Project Management Certification (PMP)

Some resources for getting the PMP certification (based on a Linkedin Question)- This is a useful not too expensive and not very very tough certification for professionals who manage projects (and don’t we all !)

Online Websites-

Providers- http://tel.occe.ou.edu/cgi-bin/PMI_Provider/repsearch.cgi

The main website –http://www.pmi.org/Pages/default.aspx

Credentials-http://www.pmi.org/CareerDevelopment/Pages/Obtaining-Credential.aspx

Some white papers –http://www.globalknowledge.com/training/whitepaperlist.asp?pageid=502&wpcat=7&sort=&country=United+States

An additional book-http://www.amazon.com/PMP-Exam-Prep-Fifth-Passing/dp/1932735003

The main book – PMBOK