More Ways to get a Scoring Model wrong

I got the following answer from Linkedin groups�http://www.linkedin.com/groupAnswers?viewQuestionAndAnswers=&gid=53432&discussionID=1946379&commentID=2213879&goback=.mgr_false_0_DATE.mgr_true_1_DATE.mid_1066685320#commentID_2213879

�

on my Ten Ways to get a Scoring Model Wrong.

�Typo�
Refuse to use central tendency to patch missing values. Instead, assign highest response rate because WOE says so�
Marketing people tell me to force the variable into the model�
�Selection bias�
�Forgot to segment�
Solely rely on data to segment without consulting the biz side�
�Just delete observations with missing values, OK, without studying geometricl boundaries�
�Using oversampling, but refuse to weight it back. That boosts lift, right? Let us do 50-50�
Insist random sampling is sufficient, while stratified sampling is critical�
Binning too much, or two little�
Selecting variables without repeated sampling�
Forgot to exclude numeric customer id from the candidate variables. AND,it pops….Well, both Unica and Kxen accepted it, So I see no problem�
When the same variable is sourced by different vendors, did not look up the scales under the same name. Just combine them�
�Well, SAS Enterprise Miner gave me this model yesterday�
The binary variable is statistically significant, but there are only 27 event=1, out of ~1mm, since only 27 made some purchases..�
Well, I only have 250 events=1. But I think I can use exact logistic to make it up, all right? I got a PHD in Statistics, Trust me, my professor is OK with it. I just called her.�
�Build two-stage model without Heckman adjustment�
Use global mean over the WHOLE customer base to replace missing value on a much smaller universe/subset. So average networth of a high networth client group has 22% worth only 225K�
I just spent the past two days boosting R-square. Now it is 92. Great.�
Forgot to set descending option in proc logistic in SAS�
I think we should hold out missing values when conducting EDA.�
Without proper separation of ‘treatment and control�
Treat business entities and individuals as equal and mix them in the same universe
Runing clustering without validation�
Running discriminant model without validation. So correct classification rate on development is 89% and that over validation is …35%.(no wonder you finished it in two hours and came here to ask me for a raise)�
Disregard link function in multi-nomil models�
I think this is a better variable: xnew=y*y*y*. It is the top variable dominating others.�
Use standardized coefficient to calculate relative importance, because many people are doing and marketing loves it.�
I tried Goolge Analtyics last Friday. It recommends this variable: click stream density over Thanksgivning weekend, on my web portal, on this item�
�Let us treat this matrix as unary so we can apply Euclidean, since that runs faster and has a lot of optimal properties. It makes our life easier�
Let us use score from that model to boost this model and use score from this model to boost it back. Is that what they call neural nets, Jia?�

Enough?

�

31 Ways to get a model wrong – and Hats off to a fellow mate in suffering -Jia

Coming up – One Way to get a scoring model correct

Author: Ajay Ohri

http://about.me/ajayohri View all posts by Ajay Ohri

2 thoughts on “More Ways to get a Scoring Model wrong”

You can leave a question at the Linked In link I have out at the top thats where they come from.

Hello Ajay

Many thanks for this list. A fine mixture of “haha”, “umm.. I did this wrong, did I” and many “yes, yes”. However, I must admit I did not understand point 30. What is an unary matrix ? English is not my native tongue and I also have no mathematics book in english available, web search results confused me …can you point me to an explanation ?

kind regards,

Steffen