Learning R for SAS and SPSS Users

So you decided to cut down on your Statistical software expenses and decided to get R.

but the problem is you know SAS /SPSS and you need to learn R fast enough to justify switching over …….

the ideal book for you is  http://oit.utk.edu/scc/RforSAS&SPSSusers.pdf

Thanks to the guys who pointed me here. Its a really easy book, you have the SAS Syntax, the corresponding SPSS Syntax and the R Syntax.

 That’s useful for learners in R who got projects to execute, and need to learn either SPSS or R or even switch from SPSS to SAS.

The Nastiness Index

Reading about the latest back and forth and then back again rounds between not that conservative Republicans, conservative Republicans, liberal Democrats, not so liberal Democrats makes you wonder if there is some way of quantifying the daily dose of political news. If we can quantify the level of terror threat in color coding, can we create a blue and red nastiness index for quantifying the political debates in a country.

The Nastiness Index can be used to decribe political equilibrium in a country. eg in USA if the Index shows high blue , and less red it means the Democrats are pummeling the Republicans. If the Index shows low blue and low red, it means all is quiet. If it shows high red, and low blue , it means the opposition is down and the administration is up.Stock markets can use this index to quantify the level of political activity, inactivity and stability of elected governments.

If the nastiness index shows high blue and high red, it means its election time folks.

Model Presentation

Presenting a model is different from making a model, as the end audience is non technical and business minded. These are some thumb rules I use for making model presentation templates

1) Model Lift- How good is the model vs current effort.This is best shown by lift curves or KS statistics where you plot % Responders on X Axis and % Population on Y axis. Maximum separation between goods and bads is the KS statistic.

2) Model Robustness- What facts back up statistical validity of model output/equation ? Is there a way to test the model without executing it fully?

3) Model Assumptions- This deals with historical assumptions like which event is the model based on, data assumptions for validation and missing value treatment, capping of outliers.

The best way to convince business audiences is splitting the dataset into three random samples of 60 %,30 % and 10 % for model building, validation and testing again.

Then rerun the model equation on another random sample ,using a different seed in the RANUNI function. The KS should be similar and so should be the stats.

Ultimately models get validated or battered in the marketplace. A 1 % difference in response rates can make or lose hundreds of thousands of dollars especially in mass marketing or credit modeling. Business perspective and buy in is thus essential and so is continuous model performance feedback to avoid deterioration of  model, as it will eventually deteriorate over a period of time.

Politics in Analytics

Observers of American Electoral politics ,including the current Presidential Campaign would be struck by the sophisticated degree of analytics being involved. This includes the following –

1) Segmentation of likely percentage Response Rates (vote yes(1) , vote opponent (0)

based on

history of voting

response to stimuli (experience vs change)

ethnicity (black, white ,latino)

income groups (<40,000 USD ,>100000 USD)

education (college educated)

gender (male,female)

geography (rural ,urban,college town)

union affliation

and even coffee (latte drinkers etc 🙂 )

What is striking is that most of these variables like race, gender cannot be used for marketing anything else like credit cards, or financial services on charges of discrimination.

What could be really interesting is if they add credit bureau variables and create logistic models (and not just segmentation). Maybe by 2016, there will be a different category of analytics called Quantitative Political analytics.

Another note – What is similar between Ralph Nader , Chaos Theory and Butterfly effect.

Chaos theory states that future results can vary a lot based on slight changes in initial differences.

Butterfly Effect uses this to say a small event like butterfly fluttering in china can cause a big event like typhoon in the US.

Ralph Nader entering the race in 2000 got 90000 votes in Florida, mostly to be siphoned away from Al Gore , who lost Florida and the elections by less than 1000 votes.

Al Gore was against Iraq war since the beginning and had he been President maybe the world would have been greener and no war in Iraq. maybe. No offense meant to anyone.

Howard Dean screaming or Bill Clinton calling Obama as similar to Jesse Jackson’s wins (which is analytically and quantitatively true) or Hillary crying in Iowa , can be similar butterfly effects.No offense meant to anyone.

Comparing Big SpreadSheet A to Big SpreadSheet B

Many organizations have pre-fixed formats for their reporting needs.  These formats or Management Information Reports are updated at monthly and quarterly intervals at exactly the same format. However when the spreadsheets become big, analysis becomes tedious in comparing two big spreadsheets due to the sheer number of cells involved.

Using SAS , we can automate this process almost instantly.

We will use proc import to import data from the spreadsheets in such a manner that top row imported consist column headings (sas dataset variables).Note both spreadsheets are exactly in same format.

We will then use proc compare to compare these two datasets.

We can then use the integrated approach to automated reporting in SAS (See Archives- Category Analytics) to further reduce this to a simple batch process.

The relevant codes are –

%let pathfile = “C:\Documents and Settings\” ;
run;

/*CREATING LIBRARY NAME */

libname auto &pathfile;

run;

/*TO CONSERVE SPACE*/

options compress=yes;

/*TO MAKE LOG READABLE */

options macrogen symbolgen;

PROC IMPORT OUT= auto.TEST1
DATAFILE= “C:\Documents and Settings\excel1-full.xls”
DBMS=EXCEL2000 REPLACE;
SHEET=”‘Sales$'”;

/*SPECIFYING WORKSHEET FOR MULTIPLE SHEETS */
GETNAMES=YES;

/*TO TAKE VARIABLE NAMES FROM TOP ROW */

   RANGE=”A4:AB2000″;

/*SPECIFYING RANGE OF CELLS  IN SPREADSHEET TO BE READ */

RUN;

PROC IMPORT OUT= auto.TEST2
DATAFILE= “C:\Documents and Settings\excel2-full.xls”
DBMS=EXCEL2000 REPLACE;
SHEET=”‘Sales$'”;

/*SPECIFYING WORKSHEET FOR MULTIPLE SHEETS */
GETNAMES=YES;

/*TO TAKE VARIABLE NAMES FROM TOP ROW */

   RANGE=”A4:AB2000″;

/*SPECIFYING RANGE OF CELLS  IN SPREADSHEET TO BE READ */

RUN;

/* COMPARING THE TWO SPREADSHEETS */

proc compare base=auto.test1 compare=auto.test2;
var

/*SPECIFYING WHAT VARIABLES TO BE COMPARED */
Applications

Approvals

Disbursals

30dayplus

60dayplus

90dayplus

;
with Branch;

/*SPECIFYING VARIABLE FOR COMPARISON

FOR SAME BRANCH IN THIS CASE */
 run;
The output will simply compare and compute the cell by cell difference.

You can then use ods to ouput this in another big spreadsheet 🙂

This is particularly relevant in telecommunications and banks, where they need to compare a lot of metrics across timely intervals.