Compression Tips

1) Stuck with Huge Datasets in SAS.

Use SAS Code,

Options compress=yes

2)Stuck with huge datasets in UNIX Space.

Use compress “filename.extension”

3) Huge data in Windows- Use the following utility

Use 7 Zip.Open source

You don’t need to register or pay for 7-Zip.

www.7zip.org/

SAVE SPACE ON YOUR SYSTEMS 🙂

Learning R for SAS and SPSS Users

So you decided to cut down on your Statistical software expenses and decided to get R.

but the problem is you know SAS /SPSS and you need to learn R fast enough to justify switching over …….

the ideal book for you is  http://oit.utk.edu/scc/RforSAS&SPSSusers.pdf

Thanks to the guys who pointed me here. Its a really easy book, you have the SAS Syntax, the corresponding SPSS Syntax and the R Syntax.

 That’s useful for learners in R who got projects to execute, and need to learn either SPSS or R or even switch from SPSS to SAS.

Model Presentation

Presenting a model is different from making a model, as the end audience is non technical and business minded. These are some thumb rules I use for making model presentation templates

1) Model Lift- How good is the model vs current effort.This is best shown by lift curves or KS statistics where you plot % Responders on X Axis and % Population on Y axis. Maximum separation between goods and bads is the KS statistic.

2) Model Robustness- What facts back up statistical validity of model output/equation ? Is there a way to test the model without executing it fully?

3) Model Assumptions- This deals with historical assumptions like which event is the model based on, data assumptions for validation and missing value treatment, capping of outliers.

The best way to convince business audiences is splitting the dataset into three random samples of 60 %,30 % and 10 % for model building, validation and testing again.

Then rerun the model equation on another random sample ,using a different seed in the RANUNI function. The KS should be similar and so should be the stats.

Ultimately models get validated or battered in the marketplace. A 1 % difference in response rates can make or lose hundreds of thousands of dollars especially in mass marketing or credit modeling. Business perspective and buy in is thus essential and so is continuous model performance feedback to avoid deterioration of  model, as it will eventually deteriorate over a period of time.

Comparing Big SpreadSheet A to Big SpreadSheet B

Many organizations have pre-fixed formats for their reporting needs.  These formats or Management Information Reports are updated at monthly and quarterly intervals at exactly the same format. However when the spreadsheets become big, analysis becomes tedious in comparing two big spreadsheets due to the sheer number of cells involved.

Using SAS , we can automate this process almost instantly.

We will use proc import to import data from the spreadsheets in such a manner that top row imported consist column headings (sas dataset variables).Note both spreadsheets are exactly in same format.

We will then use proc compare to compare these two datasets.

We can then use the integrated approach to automated reporting in SAS (See Archives- Category Analytics) to further reduce this to a simple batch process.

The relevant codes are –

%let pathfile = “C:\Documents and Settings\” ;
run;

/*CREATING LIBRARY NAME */

libname auto &pathfile;

run;

/*TO CONSERVE SPACE*/

options compress=yes;

/*TO MAKE LOG READABLE */

options macrogen symbolgen;

PROC IMPORT OUT= auto.TEST1
DATAFILE= “C:\Documents and Settings\excel1-full.xls”
DBMS=EXCEL2000 REPLACE;
SHEET=”‘Sales$'”;

/*SPECIFYING WORKSHEET FOR MULTIPLE SHEETS */
GETNAMES=YES;

/*TO TAKE VARIABLE NAMES FROM TOP ROW */

   RANGE=”A4:AB2000″;

/*SPECIFYING RANGE OF CELLS  IN SPREADSHEET TO BE READ */

RUN;

PROC IMPORT OUT= auto.TEST2
DATAFILE= “C:\Documents and Settings\excel2-full.xls”
DBMS=EXCEL2000 REPLACE;
SHEET=”‘Sales$'”;

/*SPECIFYING WORKSHEET FOR MULTIPLE SHEETS */
GETNAMES=YES;

/*TO TAKE VARIABLE NAMES FROM TOP ROW */

   RANGE=”A4:AB2000″;

/*SPECIFYING RANGE OF CELLS  IN SPREADSHEET TO BE READ */

RUN;

/* COMPARING THE TWO SPREADSHEETS */

proc compare base=auto.test1 compare=auto.test2;
var

/*SPECIFYING WHAT VARIABLES TO BE COMPARED */
Applications

Approvals

Disbursals

30dayplus

60dayplus

90dayplus

;
with Branch;

/*SPECIFYING VARIABLE FOR COMPARISON

FOR SAME BRANCH IN THIS CASE */
 run;
The output will simply compare and compute the cell by cell difference.

You can then use ods to ouput this in another big spreadsheet 🙂

This is particularly relevant in telecommunications and banks, where they need to compare a lot of metrics across timely intervals.

Comparing Base SAS and SPSS

Comparing Base SAS and SPSS is an age old question between analytics professionals as both of these are one of the longest running statistical softwares in the world.

While Base SAS is on version 9 + and has greatly improved it’s visual appeal to counter SPSS’s click and get results interface, SPSS has moved beyond version 15.0 + and started adding modules like SAS has done.

Here I will be comparing specific SAS and SPSS components like SAS ETS with SPSS Trends, and SAS Base /Stat with SPSS Base.

Base SAS is almost 1.75 times as expensive in upfront cost for a single installation than SPSS.

SAS ETS is better than SPSS Trends for time series analysis for bad data, but SPSS Trends can easily make huge numbers of time series analysis than SAS ETS.

SAS is more tougher to learn than the point and click interface of SPSS.

SPSS Documentation is much better and give better clarity on algorithms used for statistical procedures.

Base SAS is much more powerful for crunching huge numbers of data (like sorting or splicing data),

for data that is smaller than say 100 mb, the difference is not much between SAS and SPSS.

SPSS is a perpetual license, while SAS has year on year license. This eventually makes it 2-3 times more expensive.

Modeling is easier done in SPSS but SAS can provide more control thanks to command line interface/advanced editor coding.The SAS Enterprise is not as good a visual interface as the SPSS.

For a startup analytics body, the best installation for both SAS and SPSS is network licenses preferably over a Linux network. You should ideally have a mix of both SAS and SPSS to optimize both costs and analytical flexibility.

Other Comparisons with Base SAS (a SAS Institute Copyrighted Product ) can be found at http://www.ats.ucla.edu/stat/technicalreports/

or by searching packages at http://finzi.psych.upenn.edu/search.html