More Ways to get a Scoring Model wrong

I got the following answer from Linkedin groups http://www.linkedin.com/groupAnswers?viewQuestionAndAnswers=&gid=53432&discussionID=1946379&commentID=2213879&goback=.mgr_false_0_DATE.mgr_true_1_DATE.mid_1066685320#commentID_2213879

 

on my Ten Ways to get a Scoring Model Wrong.

  1.  Typo 
  2. Refuse to use central tendency to patch missing values. Instead, assign highest response rate because WOE says so 
  3. Marketing people tell me to force the variable into the model 
  4.  Selection bias 
  5.  Forgot to segment 
  6. Solely rely on data to segment without consulting the biz side 
  7.  Just delete observations with missing values, OK, without studying geometricl boundaries 
  8.  Using oversampling, but refuse to weight it back. That boosts lift, right? Let us do 50-50 
  9. Insist random sampling is sufficient, while stratified sampling is critical 
  10. Binning too much, or two little 
  11. Selecting variables without repeated sampling 
  12. Forgot to exclude numeric customer id from the candidate variables. AND,it pops….Well, both Unica and Kxen accepted it, So I see no problem 
  13. When the same variable is sourced by different vendors, did not look up the scales under the same name. Just combine them 
  14.  Well, SAS Enterprise Miner gave me this model yesterday 
  15. The binary variable is statistically significant, but there are only 27 event=1, out of ~1mm, since only 27 made some purchases.. 
  16. Well, I only have 250 events=1. But I think I can use exact logistic to make it up, all right? I got a PHD in Statistics, Trust me, my professor is OK with it. I just called her. 
  17.  Build two-stage model without Heckman adjustment 
  18. Use global mean over the WHOLE customer base to replace missing value on a much smaller universe/subset. So average networth of a high networth client group has 22% worth only 225K 
  19. I just spent the past two days boosting R-square. Now it is 92. Great. 
  20. Forgot to set descending option in proc logistic in SAS 
  21. I think we should hold out missing values when conducting EDA. 
  22. Without proper separation of ‘treatment and control 
  23. Treat business entities and individuals as equal and mix them in the same universe
  24. Runing clustering without validation 
  25. Running discriminant model without validation. So correct classification rate on development is 89% and that over validation is …35%.(no wonder you finished it in two hours and came here to ask me for a raise) 
  26. Disregard link function in multi-nomil models 
  27. I think this is a better variable: xnew=y*y*y*. It is the top variable dominating others. 
  28. Use standardized coefficient to calculate relative importance, because many people are doing and marketing loves it. 
  29. I tried Goolge Analtyics last Friday. It recommends this variable: click stream density over Thanksgivning weekend, on my web portal, on this item 
  30.  Let us treat this matrix as unary so we can apply Euclidean, since that runs faster and has a lot of optimal properties. It makes our life easier 
  31. Let us use score from that model to boost this model and use score from this model to boost it back. Is that what they call neural nets, Jia? 

Enough?

 

31 Ways to get a model wrong – and Hats off to a fellow mate in suffering -Jia

Coming up – One Way to get a scoring model correct

Ten ways to build a wrong scoring model

 

Some ways to build a wrong scoring model are below- The author doesn’t take any guarantee if your modeling team is using one of these and still getting a correct model.

1) Over fit the model to the sample. This over fitting can be checked by taking a random sample again and fitting the scoring equation and compared predicted conversion rates versus actual conversion rates. The over fit model does not rank order – deciles with lower average probability may show equal or more conversions than deciles with higher probability scores.

2) Choose non random samples for building and validating the scoring equation. Read over fitting above.

3) Use Multicollinearity (http://en.wikipedia.org/wiki/Multicollinearity ) without business judgment to remove variables which may make business sense.Usually happens a few years after you studied and forgot Multicollinearity.

If you don’t know the difference between Multicollinearity , Heteroskedasticity http://en.wikipedia.org/wiki/Heteroskedasticity this could be the real deal breaker for you

4) Using legacy codes for running scoring usually with step wise forward and backward  regression .Happens usually on Fridays and when in a hurry to make models.

5) Ignoring signs or magnitude of parameter estimates ( that’s the output or the weightage of the variable in the equation).

6) Not knowing the difference between Type 1 and Type 2 error especially when rejecting variables based on P value. ( Not knowing P value means you may kindly stop reading and click the You Tube video in the right margin )

7) Excessive zeal in removing variables. Why ? Ask yourself this question every time you are removing a variable.

8) Using the wrong causal event (like mailings for loans) for predicting the future with scoring model (for mailings of deposit accounts) . or using the right causal event in the wrong environment ( rapid decline/rise of sales due to factors not present in model like competitor entry/going out of business ,oil prices, credit shocks sob sob sigh)

9) Over fitting

10) Learning about creating models from blogs and not  reading and refreshing your old statistics textbooks

Guns and No Glory

Beginning henceforth here is the policy on comments and posts.It is in response of the comments on my “A Farewell to Guns “ post in which I projected my Gandhian non violence too far in suggesting the remote possibility of  a ban on guns based on Alabama and Germany events.

  1. No more political posts (including on India) or poems (including funny) will be imposed on this blog or unsuspecting readers.*
  2. I use an offline blogging system called Windows Live Writer and do not go to approve the comments online (requires me to login to word press and manually do it). All comments are read via email settings and feedback incorporated.Comments sometimes get deleted within 7 days because of auto settings (like me not going and logging in to WordPress for 7 days) not because I am ignoring anything
  3. You didn’t like the “Guns” article – you have the right to say so on the comments page. You didn’t like the R article or the package–comments page please.
  4. Read Page “Fine Print” . I use a professional analytics tracking system which I pay for every month- it tracks Ip address ,Ip provider, organization, location, time, country, with an integrated Google maps that allows me to see which block the material entered the net.Writing offensive comments from your work computer is not a great idea- not with the angry one.Not if you are……
  5. I never use the analytics system for individuals unless the comments are ghastly. Then I delete the comments and don’t use the analytics system for individuals.
  6. Akismet for spam will catch and has caught multiple attempts at malware linking and spamming.It will do so.
  7. Anonymous comments are not anonymous as explained above.

A blog on Decision Stats should quote political statements only when accompanied with statistics. or better still nothing but the statistics.So it will be.

*They will be posted on www.iwannacrib.com

Interview- BI Dashboards dMINE Sanjay Patel

If you have ever felt frustrated in  knowing business metrics in your or your client organization, negotiated with a host of either legacy applications that don’t talk to each other or good solutions that cost more than the benefit they bring – a young man from India has a solution for you. With a total implementation time range of 1-6 weeks and costs to as low as 10,000 USD for Enterprise WIDE implementation , Dmine promises to shake things up. Here is an interview with the co founder of this startup.

 

Ajay- Describe your career journey. What advice would you give to  new entrepreneurs in this recession.

Sanjay- Geared with an M. Tech from BITS Pilani, and MBA from Jamnalal Bajaj Institute of Management, Mumbai, in 2000 I teamed up with Praveen Wicliff and  ventured to start our own  Product company with a focus to deliver critical enterprise solutions  and the company has now grown to a strong team of 100 innovative minds. My current mission is to make Icicle a strong leader in business driven Software Products across all segments with the best of the delivery capabilities.
Recession is a trying time for most people and this is one phase which brings out the best in everyone as most of the innovative solutions are floated during this phase. My only suggestion would be to move from emotional connect with customers to direct tangible benefits, stay focussed on cash flow, aim high, set your goals & targets and never give up on any of these. Even in the darkest moments, find the faith to keep going.

Ajay- One more Dashboard Solution. How is dMine different from it’s competitors. What are the principal competitors.

Sanjay- If you just plainly look at dMine you would see that dMine is just another Dashboard solution but what makes dMine different from all other competitive products is its intuitiveness and User friendly features which help even the Business users to use the product most effectively.

dMine_logo_72dpi_11032008

dMine is positioned as a product for Business Users and not for IT team. Unlike other Dashboard or BI products dMine can create Dashboards in just 3 easy steps:

1- IT team connects to Data-sources & Creates Business Views

2- Users can create Dashboards & Charts with dMine’s Intuitive interface

3- Users can share Dashboards & schedule Emails in PDF or PPT

The potent combination of best-of-class looking Comprehensive Graphical and Analytical reports, Easier representation and Interpretation of Key Business Data, Integrates data from multiple systems on a single chart and / or Dashboard for real-time Analysis, all these, with minimized IT overheads is a unique proposition from us. See dMine-in-action on  www.dminebi.com and you will know the difference.

Ajay- What is the area where dMine would not be suited for dashboards.Suppose I have data for 200,000 rows x 40 columns – would dMine work for me .

Sanjay-
dMine is positioned as a pure Dashboard product that does not implement a complete BI stack which requires to work on Transactional Data to create cubes and universe.

We look beyond Data warehouses and Datamarts and emphasize on summarized data to deliver key business performance metrics with high focus on Data Visualization.

The idea here is to target the Business Executives who would see these Dashboards and they are not interested in Transactional Data but the overall performance hence the summarized Data.

Here the summarized Data could be in the form of any RDBMS Database, Flat Files, Spreadsheets, Analytics output, Cubes or Universe. We Support almost 16 Database vendors in the market starting from as small as MS Access to as big as DB2.

Ajay- What is the pricing strategy of dMine . Any other products or complements that you are thing about. Name some customer case studies or big wins.

Sanjay- dMine pricing strategy is very simple and is based keeping in mind that the product can be used by customers in the SME/SMB segment or even at the Enterprise level.

Currently we are offering dMine in two forms

  • firstly On-Premise and
  • second one as a Hosted service.

In an on-premise version, dMine has a Product License fee and per user license fees issued separately.

At additional cost you can have loads of Add-on goodies catering to various needs of the customers. All the cost mentioned above are just one time.

The Hosted version, is on a monthly subscription model where the cost is decided based on the various parameters like the usage of Bandwidth, Server configuration, Disk space etc. Very soon we will be enabled to the Amazon cloud service.

Typically for an on-premise version a smaller implementation just at a corporate level with Dashboard access to only few top executives would cost anywhere between 10,000 USD to 16,000 USD plus the implementation cost which is on actuals plus Applicable taxes.

For larger implementation like for e.g in BFSI segment where you need to roll-out user licenses to all the branch managers in addition to the top executives the total User licenses goes to a few 100 licenses. Usually the implementation period is as low as a week and not more than 6 weeks.

Few of the Customers using dMine are some of the Marketing Analytics companies that use the product for submitting the final report of their Marketing / Customer Analysis to their end customers.

image

Case Study – Summary

We recently implemented dMine at a leading FM Radio channel to monitor the performance across its radio stations spread all over India.

The client being a major player in the media and communication space has to constantly monitor all their stations for their entire  operations like Revenues from sponsors, Peak and Non-Peak time Inventory & Sales, Market Share & Channel Ranking, P&L, Forecasts and other critical informations.

Currently the client uses multiple Applications for supporting these business functions which capture data in different databases and sources. The MIS-reports were manually created by extracting data from multiple sources in spreadsheets. These spreadsheets are distributed among the management, with a turn around time of about 15 days.

The dMine solution implemented, collates information from multiple data sources like ERP and Sales systems including lots of Spreadsheets. More than 70 Metrics (KPI’s) and Analytics are defined for the Client and now the management has access to these information whenever they require.

All these metrics are identified as critical and are categorized under 5 Dashboards  – Organizational KPI, Financial Dashboard, Sales Dashboard, Market Share
& Metrics Dashboard and Operational Dashboard. The Metrics are parameterized and drill-downs allows the management to get the source of issue/problem rapidly.

The implementation of dMine Business Dashboard product helped the client in effectively monitoring business operations, KPIs, and organizational performance.Making actionable and Real-time information available on-demand for the decision Makers and Operational Managers, has also helped in taking any timely critical business decisions, all these while minimizing IT overhead cost. The short implementation time lines also allows the users to see the benefits quite quickly and achieve the ROI within 1 to 3 months time. The detailed case study is readily available for your reference at our website www.dminebi.com .

Ajay-  Do you read or write blogs. What do you think about the Web 2.0 paradigm for social and community marketing.

Sanjay –I do read and write lots of blogs and am myself a member with quite a few groups that share interest in the virtual community. Web 2.0 provides a platform of many-to-many communications and in its social sense is based on the principles of collaboration & sharing, information & content putting social interaction at heart of it all.

A recent study by Fox Interactive Media reveals that 40% of social network users rely on social media outlets to learn more about brands and products. Whether you’re a freelancer promoting your own brand or part of a company, social media marketing is an essential component of an integrated campaign. If you are looking to startup new business, launch a new product & services or even expand your presence you cannot miss-out on the eMarketing process focusing on three prong strategy i.e. social networking sites, your own website, and the blogosphere these will help empower your brand and positively convey your message.

Ajay – Sanjay Patel is an experienced entrepreneur with Icicle Technologies and the Dmine dashboard is currently winning rave reviews ( see http://www.dminebi.com/ibm-nominates-icicle-as-isv-on-ibm-smart-business-platform/ ) . Here’s wishing luck to Mr Patel for the summarized data dashboard that can be a game changer at www.dminebi.com.

IcicleTech-logo_72

Interview with Anne Milley, SAS II

Anne Milley is director of product marketing, SAS Institute . In part 2 of the interview Anne talks of immigration in technology areas, open source networks ,how she misses coding and software as a service especially SAS Institute’s offering . She also reveals some preview on SAS ‘s involvement with R and mentions cloud computing.

Anne_Milley

Ajay – Labor arbitrage outsourcing versus virtual teams located globally. What is the SAS Inst position and your opinion on this. What do you feel about the recent debate on HB1 visas and job cuts. How many jobs if at all is SAS planning to cut in 2009-2010.

Anne – SAS is a global company, with customers in more than 100 countries around the world.  We hire employees in these countries to help us better serve our global customers.  Our workforce decisions are based on our business needs.  We also employ virtual teams–the feedback and insights from our global workforce help us improve and develop new products to meet the evolving needs of our customers.  (As someone who works from her home office in Connecticut, I am a fan of virtual teaming!)  We see these approaches as complementary.

The issue of the H-1B visa is a different discussion entirely.  H-1B visas, although capped, permit US employers to bring foreign employees in “specialty occupations” into this country.   The better question, though, is what is necessitating the need for H-1B visas.  We would submit that the reason the U.S. has to look outside its borders for highly qualified technical workers is because we are not producing a sufficient number of workers with the right skill sets to meet U.S. demand.  In turn, that means that our educational system is not producing students interested or qualified to pursue the STEM (science, technology, engineering or mathematics) professions (either at a K-12 or post-secondary level), or developing the workforce improvement programs that may allow workers to pursue these “specialty occupations.”  Further, any discussion about H-1B visas (or any other type of visa) should include a more comprehensive review of our nation’s immigration policies—are they working, are they not working, how or why are they, are we able to limit illegal immigration and if not, why not, etc.

I am not aware of any planned job cuts at SAS.  In fact, I am aware of a few groups which are actively hiring.

Ajay- What open source softwares have SAS Institute worked in the past and it continues to support financially as well as technologically.  Any exciting product releases in 2009-2010 that you can tell us about.

Anne- Open source software provides many options and benefits.  We see many (SAS included) embracing open source for different things.  Our software runs on Linux and we use some open-source tools in development. There are different aspects of open source software in developing SAS software:

-Development with open source tools such as Eclipse, Ant, NAnt, JUnit, etc. to build, test, and package our software

-Using open source software in our products; examples include Apache/Jakarta products such as the Apache Web Server.

-Developing open source software, making changes to an open source codebase, and optionally contributing that source back to the open source project, to adapt an open source project for use in a SAS product or for internal use. Example: Eclipse.

And we plan to do more with open source in the future.  The first step of SAS integrating with R will be shown at SAS Global Forum coming up in DC later this month.  Other announcements for new offerings are also planned at this event. 

Ajay- What do you feel about adopting Software as a service for any of  SAS Institute’s products. Any new initiatives from SAS on the cloud computing front especially in terms of helping customers cut down on hardware costs.

Anne- SAS Solutions OnDemand, the division which oversees the infrastructure and support of all our hosted offerings, is expanding in this rapidly growing market.  SAS Solutions OnDemand Drug Development was our first SaaS offering announced in January.  Additional news on new hosted offerings will be announced at SAS Global Forum later this month.  SAS doesn’t currently offer any external cloud computing options, but we’re actively looking at this area.

AjayWhich software do you personally find best to write code into and why. Do you miss writing code, if so why ?

Anne- In my current role, I have limited opportunity to write code.  At times, I do miss the logical thought process coding forces you to adopt (to do the job as elegantly as possible).  I had the opportunity to do a long-term assignment at a major financial services company in the UK last year and did get to use some SAS and JMP, including a little JSL (JMP scripting language).  There’s nothing like real-world, noisy, messy data to make you thankful for the power of writing code!  Even though I don’t write code on a regular basis, I am happy to see continued investment in the languages SAS provides—among the most recent, the addition of an algebraic optimization modeling language in our SAS/OR module contained within the SAS language as “PROC OPTMODEL.”

I have great respect for people who invest in learning (or even getting exposure to) more than one language and who appreciate the strengths of different languages for certain tasks and applications.

Ajay- It is great to see passionate people at work on both sides of the open source as well as packaged software teams- and even better for them to collaborate once in a while.Most of our work is based on scientists who came before us (especially in math theory).

Ultimately we are all just students of science anyway.

SAS Global Forum –http://support.sas.com/events/sasglobalforum/2009/

Annual event of SAS language practitioners.SAS language consists of data step and proc steps for input and output thus simplifying syntax for users.

SAS Institute – The leader of analytics software since 1970’s , it grew out of the North Carolina University, and provides jobs to thousands of people. The world’s largest privately held company, admired for it’s huge investments in Research and Development and criticized for its premium price  on packaged software solutions.A recent entrant in corporate users who are willing to support R language.

Weathering the Stormy Economy

Here is a conference you may want to visit. At first glance it may look like one of those self-help “free” webinars but it is a very relevant topic with a great speaker. Plus it is on the web.

 

Free Seminar Hosted by SAP Business Objects
Thursday, March 12, 2009
11 a.m. PST / 2 p.m. EST
Robin Fray Carey, CEO of Social Media Today will discuss the best ideas gathered from MyVenturePad.com, SMT’s online community for growth companies. Plus, two fast-growing companies, Fresh Direct, and The Life is good Company, will share their practical recommendations on how to manage business and IT priorities in these challenging times. Register today.
http://events.businessobjects.com/forms/Q109/ideas/?source=SMtoday1

Social Media Today builds Wordframe based communities like Smart Data Collective ( for data ,BI,Analytics people) Best of the Blogs ( for progressive bloggers), Energy Collective ( for Green energy enthusiasts,thinkers and researchers) ,Social Media Today ( for understanding and leveraging Social Media and Networks) and My VenturePad (for Entrepreneurs).

These communities basically work as online newspapers by aggregating and moderating the RSS feeds of thousands of bloggers (for some sites) and their sites. I have written on Wordframe’s concept of content driven communities and Ning’s concept of community driven content earlier.

Disclaimer- I have worked as an evangelist to SMT , have been awarded the Blogger of the Week once (for my article on R).

For other conferences you may also want to see AnalyticBridge ‘s page on conferences.

http://www.analyticbridge.com/group/conferences/forum/topics/best-ideas-for-weathering-the

Disclaimer -I have been awarded the Member of the Month twice by them.

I like the third party apps of Ning better than the old outdated format and themes. One Ning application can actually serve as a competitor to Wordframe – that is the RSS application ( see feed on my page ).Wordframe has capabilities for even category level filters so Analytics category  feed goes to Smart Data ,Internet category feed gets published on Social Media (when i am lucky) and my attempts at poetry go to Best of The Blogs.

The Decision Stats group (on Linkedin) also has a group on AnalyticBridge.

But why join so many communities and go to webinars ? Because knowledge is useful and productive and fun – and I have a personal motto of learning one new thing a day .

Where do you get the time ?Just sleep one hour less and devote that one hour purely to your self learning for yourself.8 hours to the boss, 4-5 hours to the family.

1 hour to yourself ??

Sounds reasonable, eh  🙂

So try this one –

http://events.businessobjects.com/forms/Q109/ideas/?source=SMtoday1