Common Analytical Tasks

WorldWarII-DeathsByCountry-Barchart
Image via Wikipedia

 

Some common analytical tasks from the diary of the glamorous life of a business analyst-

1) removing duplicates from a dataset based on certain key values/variables
2) merging two datasets based on a common key/variable/s
3) creating a subset based on a conditional value of a variable
4) creating a subset based on a conditional value of a time-date variable
5) changing format from one date time variable to another
6) doing a means grouped or classified at a level of aggregation
7) creating a new variable based on if then condition
8) creating a macro to run same program with different parameters
9) creating a logistic regression model, scoring dataset,
10) transforming variables
11) checking roc curves of model
12) splitting a dataset for a random sample (repeatable with random seed)
13) creating a cross tab of all variables in a dataset with one response variable
14) creating bins or ranks from a certain variable value
15) graphically examine cross tabs
16) histograms
17) plot(density())
18)creating a pie chart
19) creating a line graph, creating a bar graph
20) creating a bubbles chart
21) running a goal seek kind of simulation/optimization
22) creating a tabular report for multiple metrics grouped for one time/variable
23) creating a basic time series forecast

and some case studies I could think of-

 

As the Director, Analytics you have to examine current marketing efficiency as well as help optimize sales force efficiency across various channels. In addition you have to examine multiple sales channels including inbound telephone, outgoing direct mail, internet email campaigns. The datawarehouse is an RDBMS but it has multiple data quality issues to be checked for. In addition you need to submit your budget estimates for next year’s annual marketing budget to maximize sales return on investment.

As the Director, Risk you have to examine the overdue mortgages book that your predecessor left you. You need to optimize collections and minimize fraud and write-offs, and your efforts would be measured in maximizing profits from your department.

As a social media consultant you have been asked to maximize social media analytics and social media exposure to your client. You need to create a mechanism to report particular brand keywords, as well as automated triggers between unusual web activity, and statistical analysis of the website analytics metrics. Above all it needs to be set up in an automated reporting dashboard .

As a consultant to a telecommunication company you are asked to monitor churn and review the existing churn models. Also you need to maximize advertising spend on various channels. The problem is there are a large number of promotions always going on, some of the data is either incorrectly coded or there are interaction effects between the various promotions.

As a modeller you need to do the following-
1) Check ROC and H-L curves for existing model
2) Divide dataset in random splits of 40:60
3) Create multiple aggregated variables from the basic variables

4) run regression again and again
5) evaluate statistical robustness and fit of model
6) display results graphically
All these steps can be broken down in little little pieces of code- something which i am putting down a list of.
Are there any common data analysis tasks that you think I am missing out- any common case studies ? let me know.

 

 

 

Redlining in Internet Access and notes on Regression Models

This is the definition of Redlining Citation- The AD FREE Wikepedia-

Redlining is the practice of denying, or increasing the cost of, services such as bankinginsuranceaccess to jobs,[2]access to health care,[3] or even supermarkets[4] to residents in certain, often racially determined,[5] areas. The term “redlining” was coined in the late 1960s by community activists in Chicago.[citation needed] It describes the practice of marking a red line on a map to delineate the area where banks would not invest; later the term was applied todiscrimination against a particular group of people (usually by race or sex) no matter the geography.

As of today, redlining in financial services is outlawed by the Fair Credit Lending Act which prohibits using variables in regression models which end up red-lining districts. However as far as 2005, redlining was used in Auto Insurance by using suitably disguised zip9 variables ( I carried data for 55 million American Citizens and 88 million Accounts for a major North American Automotive Insurance provider as part of an offshoring contract from Atlanta, GA  in 2005).

It exists today by informal arrangements between internet service providers who carve up territories and districts. Internet access redlining is still not illegal. This is especially true in Austin ( I traveled there as a consultant last year) and Knoxville, Tennessee where I still study as a grad student.

Neither are suitably proprietary insurance and health care claim denial models used for minimizing litigation risk. Litigation risk minimization is the next level of retail logistic regression model just as predictive modeling used by political consultants during elections.

How to do Logistic Regression

Logistic regression is a widely used technique in database marketing for creating scoring models and in risk classification . It helps develop propensity to buy, and propensity to default scores (and even propensity to fraud ) .

This is more of a practical approach to make the model than a theory based approach.(I was never good at the theory 😉 )

If you need to do Logistic Regression using SPSS, a very good tutorial ia available here

http://www2.chass.ncsu.edu/garson/PA765/logistic.htm

(Note -Copyright 1998, 2008 by G. David Garson.
Last update 5/21/08.)

For SAS a very good tutorial is here –

SAS Annotated Output
Ordered Logistic Regression. UCLA: Academic Technology Services, Statistical Consulting Group.

from http://www.ats.ucla.edu/stat/sas/output/sas_ologit_output.htm (accessed July 23, 2007).

For R the documentation (note :Still searching for R ‘s Logistic Regression ) is here
http://lib.stat.cmu.edu/S/Harrell/help/Design/html/lrm.html

lrm(formula, data, subset, na.action=na.delete, method=”lrm.fit”, model=FALSE, x=FALSE, y=FALSE, linear.predictors=TRUE, se.fit=FALSE, penalty=0, penalty.matrix, tol=1e-7, strata.penalty=0, var.penalty=c(‘simple’,’sandwich’), weights, normwt, …)

For linear models in R –
http://datamining.togaware.com/survivor/Linear_Model0.html

An extremely good book if you want to work with R , and do not have time to learn it is to use the GUI
rattle and look at this book

http://datamining.togaware.com/survivor/Contents.html