10 Steps to create a Model

Models are used to predict specific events in specific time frames upon stimuli –

1) How many people will buy credit cards if we mail them

2)How many people will buy credit cards if we call them

3) How many people are more likely to buy a credit card if we target only 50 % of total population

Modeling consists of the following –

1) Select a pre-event . Eg Previous credit card campaign

2) Divide data into building, validation and third sample.Sampling should be random.

3) Remove outliers with a capping (maximum,minimum) and invalid /missing values with central tendencies (mean,median)

4) Choose method of modeling based on data-

Regression

Segmentation

Time Series

5) Choose software of modeling dependent on resource availability (budget,people,softwares) _ you can do it in sas, spss ,excel with different amounts of skill sets ,data handling and complexity.

6) For regression modeling (most commonly used) ,do iterative tests in the building sample to remove multi collinearity (like variance inflation factor ) ,and statistical measures to test goodness of model (concordance, p value of individual variables, gain or lift across deciles).

7) Score the validation sample with the same equation variables, check for lift  ,and signs and parameter estimates , they should not change radically

8) Cross validate on third sample

9) Additionally go for transformation of variables (x,1/x,x^2,logx,e^x)  to get incremental lift

10) Go for binning of variables….do the variables behave in same manner across deciles as their sign in  regression equation.

p-s. search wikipedia for any statistical definition that you are not familiar with

pps – This will be followed by ways to present a model, which is a different ball game all together

Comparing Base SAS and SPSS

Comparing Base SAS and SPSS is an age old question between analytics professionals as both of these are one of the longest running statistical softwares in the world.

While Base SAS is on version 9 + and has greatly improved it’s visual appeal to counter SPSS’s click and get results interface, SPSS has moved beyond version 15.0 + and started adding modules like SAS has done.

Here I will be comparing specific SAS and SPSS components like SAS ETS with SPSS Trends, and SAS Base /Stat with SPSS Base.

Base SAS is almost 1.75 times as expensive in upfront cost for a single installation than SPSS.

SAS ETS is better than SPSS Trends for time series analysis for bad data, but SPSS Trends can easily make huge numbers of time series analysis than SAS ETS.

SAS is more tougher to learn than the point and click interface of SPSS.

SPSS Documentation is much better and give better clarity on algorithms used for statistical procedures.

Base SAS is much more powerful for crunching huge numbers of data (like sorting or splicing data),

for data that is smaller than say 100 mb, the difference is not much between SAS and SPSS.

SPSS is a perpetual license, while SAS has year on year license. This eventually makes it 2-3 times more expensive.

Modeling is easier done in SPSS but SAS can provide more control thanks to command line interface/advanced editor coding.The SAS Enterprise is not as good a visual interface as the SPSS.

For a startup analytics body, the best installation for both SAS and SPSS is network licenses preferably over a Linux network. You should ideally have a mix of both SAS and SPSS to optimize both costs and analytical flexibility.

Other Comparisons with Base SAS (a SAS Institute Copyrighted Product ) can be found at http://www.ats.ucla.edu/stat/technicalreports/

or by searching packages at http://finzi.psych.upenn.edu/search.html


10 SAS Procs:Learning SAS Easily

when-i-was-beginning-in-analytics-and-learning-sas.doc

When I was beginning in analytics and learning SAS, one of my team members pulled me aside and said man-just learn 10 types of procs and you have learnt it.You can download actual cut and paste code from the link above. 

 Proc Freq  ,Proc Datasets  

Proc Means, Proc Print

  Proc Contents, Proc Univariate

Proc Summary   ,Proc Sort

  Additional – Merging data pulled from multiple locations

Sort all datasets by the key (here var1) before merging  

 

Proc Rank Proc Tabulate  

 p.s It proved to be more complex but learning different  kinds of procs is  an easy way for a SAS beginner to  learn SAS faster and clear SAS skills interviews

Analytics Training Websites

This post deals with books , certifications and tutorial for data miners and analysts from beginners to intermediate level.Excel Modeling

Regression –
http://academic.brooklyn.cuny.edu/economic/friedman/Regressnxce.html

Optimization-
http://office.microsoft.com/en-us/help/HA011245951033.aspx

Free Tutorials from SAS

http://www.sas.com/apps/elearning/elearning_courses.jsp?cat=Free+Tutoria…

SAS certification

http://support.sas.com/certify/

SPSS tutorials

http://calcnet.mth.cmich.edu/org/spss/toc.htm

http://www.ats.ucla.edu/STAT/spss/webbooks/reg/default.htm

Using Batch Files to Automate

Batch files may be old fashioned but can save you a lot of time in repeated tasks.Here is an example of a batch file that has

1) Some Global variables (Inputs)

set SALES_GRP=s1
set ID=f2
set SALES_DATE=nov0107

#BATCH FILE COMMENT using hash order_date format is monddyy

2) Used notepad to open a file for making changes

call c:\software\Notepad.exe C:\folder1\input1.sas
call c:\software\Notepad.exe C:\folder2\input2.sas

3)Created a directory to store results

call mkdir C:\folder2\%SALES_GRP%\%ID%\%SALES_DATE%

4) Executed another batch file or code using relevant software (eg SAS.exe for SAS Code)

call C:\software\sas8.exe C:\folder1\input1.sas

call C:\software\sas8.exe C:\folder2\input2.sas


5) Changes active folder

call cd C:\folder1

This can particularly help you schedule SAS code running at specific times using Windows Scheduler

A Simple SAS Code to Automate Reports

Organizations use an increasing amount of reports. Countless hours are spent manipulating spreadsheets one after another, into and out of MS Access databases and finally emailing it to a select audience of people.

The following  tries to present a simple and integrated approach to report generation and presentation. This involves the use of Macros, for repeatability and flexibility.You can copy and paste the code directly to your SAS editor and make relevant changes.

Steps in preparing a report

  • Import data from data sources (spreadsheets, databases mostly .Here assuming BASE SAS ,only csv files are considered)
  • Manipulating data for analysis (in the case below the only data manipulation done is removal of duplicates. This section can be changed as needed by user for
    • Frequency counts using Proc Freq.
    • Mean, maximum and minimum counts using Proc Means.
    • Customized reports using Proc Report Procedure.)
  • Presenting and formatting data
  • Output data in the form of a report /spreadsheet.
  • Emailing internal customers of the report.

———————————————————————————————————

Using BASE SAS, the following can be coded as:

     /*PROGRAM TO AUTOMATE REPORTING */

    /*DECLARING THE PATH OF THE INPUT/OUTPUT FOLDER */

%let pathfile = ‘X:\Auto\AUG06-AUTOS\’ ;

     /*CREATING LIBRARY NAME */

libname auto &pathfile;

run;

   /*TO CONSERVE SPACE*/

options compress=yes;

   /*TO MAKE LOG READABLE */

options macrogen symbolgen;

/* STARTING MACRO FOR REPEATABLE EXECUTION OF REPORT*/

%macro impmth(mth,num,emailid);

 /*MANIPULATING VARIABLES TO REDUCE CHANGES */

data _null_ ;

call symput(‘filepath’,”‘”||&pathfile||&mth||’.csv’||”‘” );

call symput(‘unqpath’,”‘”||&pathfile||”unq”||&mth||’.csv’||”‘” );

call symput(‘unqxls’,”‘”||&pathfile||”unq”||&mth||’.xls’||”‘” );

run;

/*IMPORT*/

/*IMPORTING DATA FROM CSV FILES STORED IN FOLDER DECLARED ABOVE*/

PROC IMPORT OUT= auto.&num

DATAFILE= &filepath

DBMS=CSV REPLACE;

GETNAMES=YES;

DATAROW=2;

quit;

/*CREATING VARIABLE FOR DISTINGUISHING INPUT*/

data auto.#

set auto.#

&num =1;

run;

———————————————————————–

/*DATA MANIPULATION HERE REMOVING DUPLICATES*/

proc sort data=auto.&num nodupkey;

by REFAGREEMENTID;

run;

———————————————————————–

/*EXPORT*/

/*EXPORTING DATA TO A CSV FILE*/

PROC EXPORT DATA= AUTO.&num

OUTFILE= &unqpath

DBMS=CSV REPLACE;

RUN;

/*EXPORTING DATA TO A XLS FILE*/

ODS HTML FILE=&unqxls;

proc print data=auto.#

run;

ODS HTML CLOSE;

/*EMAILING THE RESULTS*/

filename outbox email &emailid;

data _null_ ;

file outbox

to=(&emailid)/* Overrides value in filename statement */

cc=(&emailid)

subject=’File from’

attach=(&unqxls)

;

put ‘Hi,’;

put ‘This is an integrated approach to automating reports’;

put ‘It works great!’;

run;

%mend;

/*CALLING THE MACRO*/

/*HERE ‘Ahmedabad’ IS THE INPUT FILE WHILE Ahd IS THE OUTPUT FILE*/

/*HERE ‘aohri@hotmail.com’ IS THE EMAIL ID TO BE MAILED THE REPORT*/

%impmth(‘Ahmedabad’,Ahd,’aohri@hotmail.com’);

Note- This was written on version 8 of SAS, so let me know any issues.

Ajay 

Basic Text Mining :3 Simple Paths

The locals of Punjab (india). These are the tr...
Image via Wikipedia

Text Mining in which you search alpha numeric data for meaningful patterns is relatively more complex than plain numeric variable data crunching. The reason for that is human eye can measure only a few hundred rows of data before getting tired, and analytics software algorithms need to properly programmed else they miss the relevant solution or text. An example, how many Punjabis live in Delhi (Stats needed), suppose you have a Dataset that has all the names in Delhi,in order to send an sms contest (Marketing Decision) on Lohri (Punjabi specific Festival)

Text Manipulation can be done by TRIM and LOWER functions in EXCEL and corresponding functions in SAS. For Mining use the following options-

1)SAS Basic Text Mining -Using Only Base SAS

In SAS you can use the INDEXW function for text mining.

As per SAS Online DOc

INDEXW(source, excerpt)

Arguments

source
specifies the character expression to search.
excerpt
specifies the string of characters to search for in the character expression. SAS removes the leading and trailing blanks from excerpt.

The INDEXW function searches source, from left to right, for the first occurrence of excerpt and returns the position in source of the substring’s first character. If the substring is not found in source, INDEXW returns a value of 0. If there are multiple occurrences of the string, INDEXW returns only the position of the first

occurrence.”

2) MS EXCEL

You can use MS Excel for text mining too. I recommend Office 2007 simply because it can handle more rows.

The function in Excel is SEARCH

image

3) MS ACCESS

In MS Access you can use LIKE Queries to create a different table or append a value to certain columns

.Example

Some problems can?t be solved with comparisons : e.g. ?His name begins with Mc or Mac. In a case like this, wildcards are required, and are represented in SQL with the % sign and the LIKE keyword.

e.g.

SELECT au_lname, city

FROM authors

WHERE au_lname LIKE ?Mc&? or au_lanme LIKE ?Mac%?

UPDATED- The above post is now obsolete- there are easier and better ways to to text mining. That includes weka and R