Does the Internet need its own version of credit bureaus

Data Miners love data. The more data they have the better model they can build. Consumers do not love data so much and find sharing data generally a cumbersome task. They need to be incentivize for filling out survey forms , and for signing to loyalty programs. Lawyers, and privacy advocates love to use examples of improper data collection and usage as the harbinger of an ominous scenario. George Orwell’s 1984 never “mentioned” anything about Big Brother trying to sell you one more loan, credit card or product.

Data generated by customers is now growing without their needing to fill out forms and surveys. This data is about their preferences , tastes and choices and is growing in size and depth because it is generated from social media channels on the Internet.It is this data that can be and is captured by social media analytics.

Mobile data is also growing, including usage of location based applications and usage of Internet from the mobile phone is leading to further increases in data about consumers.Increasingly , location based applications help to provide a much more relevant context to the data generated. Just mobile data is expected to grow to 15 exabytes by 2015.

People want to have more and more conversations online publicly , share pictures , activity and interact with a large number of people whom  they have never met. But resent that information being used or abused without their knowledge.

Also the Internet is increasingly being consolidated into a few players like Microsoft, Amazon, Google  and Facebook, who are unable to agree on agreements to share that data between themselves. Interestingly you can use Yahoo as a data middleman between Google and Facebook.

At the same time, more and more purchases are being done online by customers and Internet advertising has grown much above the rate of growth of other mediums of communication.
Internet retail sales have the advantage that better demand predictability can lead to lower inventories as retailers need not stock up displays to look good. An Amazon warehouse need not keep material to simply stock up it shelves like a K-Mart does.

Our Hypothesis – An Analogy with how Financial Data Marketing is managed offline

  1. Financial information regarding spending and saving is much more sensitive yet the presence of credit bureaus alleviates these concerns.
  2. Credit bureaus collect information from all sources, aggregate and anonymize the individual components accordingly.They use SSN as a unique identifier.
  3. The Internet has a unique number too , called the Internet Protocol Address (I.P) 
  4. Should there be a unique identifier like Internet Security Number for the Internet to ensure adequate balance between the need for privacy as well as the need for appropriate targeting? 

After all, no one complains about privacy intrusions if their credit bureau data is aggregated , rolled up, and anonymized and turned into a propensity model for sending them direct mailers.

Advertising using Social Media and Internet

1. A business creates an ad
Let’s say a gym opens in your neighborhood. The owner creates an ad to get people to come in for a free workout.
2. Facebook gets paid to deliver the ad
The owner sends the ad to Facebook and describes who should see it: people who live nearby and like running.
The right people see the ad
3. Facebook only shows you the ad if you live in town and like to run. That’s how advertisers reach you without knowing who you are.

Adding in credit bureau data and legislative regulation for anonymizing  and handling privacy data can expand the internet selling market, which is much more efficient from a supply chain perspective than the offline display and shop models.

Privacy Regulations on Marketing using Internet data
Should laws on opt out and do not mail, do not call, lists be extended to do not show ads , do not collect information on social media. In the offline world, you can choose to be part of direct marketing or opt out of direct marketing by enrolling yourself in various do not solicit lists. On the internet the only option from advertisements is to use the Adblock plugin if you are Google Chrome or Firefox browser user. Even Facebook gives you many more ads than you need to see.

One reason for so many ads on the Internet is lack of central anonymize data repositories for giving high quality data to these marketing companies.Software that can be used for social media analytics is already available off the shelf.

The growth of the Internet has helped carved out a big industry for Internet web analytics so it is a matter of time before social media analytics becomes a multi billion dollar business as well. What new developments would be unleashed in this brave new world is just a matter of time, and of course of the social media data!

SAS Blogs gets a makeover

SAS blogs gets a much needed makeover. Now if only they share some of the social media analytics

some more rather than just social media on analytics 🙂 One of the more professionally designed , managed and

passionate corporate blog series in my opinion

Seriously 27 blogs and not one blog on social media analytics despite

being the leading maker of such software!





Common Analytical Tasks

Image via Wikipedia


Some common analytical tasks from the diary of the glamorous life of a business analyst-

1) removing duplicates from a dataset based on certain key values/variables
2) merging two datasets based on a common key/variable/s
3) creating a subset based on a conditional value of a variable
4) creating a subset based on a conditional value of a time-date variable
5) changing format from one date time variable to another
6) doing a means grouped or classified at a level of aggregation
7) creating a new variable based on if then condition
8) creating a macro to run same program with different parameters
9) creating a logistic regression model, scoring dataset,
10) transforming variables
11) checking roc curves of model
12) splitting a dataset for a random sample (repeatable with random seed)
13) creating a cross tab of all variables in a dataset with one response variable
14) creating bins or ranks from a certain variable value
15) graphically examine cross tabs
16) histograms
17) plot(density())
18)creating a pie chart
19) creating a line graph, creating a bar graph
20) creating a bubbles chart
21) running a goal seek kind of simulation/optimization
22) creating a tabular report for multiple metrics grouped for one time/variable
23) creating a basic time series forecast

and some case studies I could think of-


As the Director, Analytics you have to examine current marketing efficiency as well as help optimize sales force efficiency across various channels. In addition you have to examine multiple sales channels including inbound telephone, outgoing direct mail, internet email campaigns. The datawarehouse is an RDBMS but it has multiple data quality issues to be checked for. In addition you need to submit your budget estimates for next year’s annual marketing budget to maximize sales return on investment.

As the Director, Risk you have to examine the overdue mortgages book that your predecessor left you. You need to optimize collections and minimize fraud and write-offs, and your efforts would be measured in maximizing profits from your department.

As a social media consultant you have been asked to maximize social media analytics and social media exposure to your client. You need to create a mechanism to report particular brand keywords, as well as automated triggers between unusual web activity, and statistical analysis of the website analytics metrics. Above all it needs to be set up in an automated reporting dashboard .

As a consultant to a telecommunication company you are asked to monitor churn and review the existing churn models. Also you need to maximize advertising spend on various channels. The problem is there are a large number of promotions always going on, some of the data is either incorrectly coded or there are interaction effects between the various promotions.

As a modeller you need to do the following-
1) Check ROC and H-L curves for existing model
2) Divide dataset in random splits of 40:60
3) Create multiple aggregated variables from the basic variables

4) run regression again and again
5) evaluate statistical robustness and fit of model
6) display results graphically
All these steps can be broken down in little little pieces of code- something which i am putting down a list of.
Are there any common data analysis tasks that you think I am missing out- any common case studies ? let me know.




An Introduction to Data Mining-online book

I was reading David Smith’s blog

where he mentioned this interview of Norman Nie, at TDWI

where I saw this link (its great if you want to study Data Mining btw)

and I c/liked the U Toronto link

Best of All- I really liked this online book created by Professor S. Sayad

Its succinct and beautiful and describes all of the Data Mining you want to read in one Map (actually 4 images painstakingly assembled with perfection)

The best thing is- in the original map- even the sub items are click-able for specifics like Pie Chart and Stacked Column chart are not in one simple drop down like Charts- but rather by nature of the kind of variables that lead to these charts. For doing that- you would need to go to the site itself- ( see


Again- there is no mention of the data visualization software used to create the images but I think I can take a hint from the Software Page which says software used are-


See it on your own-online book (c)Professor S. Sayad

Really good DIY tutorial

Quantifying Analytics ROI

Japanese House Crest “Go-Shichi no Kiri”
Image via Wikipedia

I had a brief twitter exchange with Jim Davis, Chief Marketing Officer, SAS Institute on Return of Investment on Business Analytics Projects for customers. I have interviewed Jim Davis before last year

Now Jim Davis is a big guy, and he is rushing from the launch of SAS Institute’s Social Media Analytics in Japan- to some arguably difficult flying conditions in time to be home in America for Thanksgiving. That and and I have not been much of a good Blog Boy recently, more swayed by love of open source, than love of software per se. I love equally, given I am bad at both equally.

Anyways, Jim’s contention  ( ) was customers should go in business analytics only if there is Positive Return on Investment.  I am quoting him here-

What is important is that there be a positive ROI on each and every BA project. Otherwise don’t do it.

That’s not the marketing I was taught in my business school- basically it was sell, sell, sell.

However I see most BI sales vendors also go through -let me meet my sales quota for this quarter- and quantifying customer ROI is simple maths than predictive analytics but there seems to be some information assymetry in it.

Here is a paper from North Western University on ROI in IT projects-.

but overall it would be in the interest of customers and Business Analytics Vendors to publish aggregated ROI.

The opponents to this transparency in ROI would be market leaders in market share, who have trapped their customers by high migration costs (due to complexity) or contractually.

A recent study listed Oracle having a large percentage of unhappy customers who would still renew!, SAP had problems when it raised prices for licensing arbitrarily (that CEO is now CEO of HP and dodging legal notices from Oracle).

Indeed Jim Davis’s famous unsettling call for focusing on Business Analytics,as Business Intelligence is dead- that call has been implemented more aggressively by IBM in analytical acquisitions than even SAS itself which has been conservative about inorganic growth. Quantifying ROI, should theoretically aid open source software the most (since they are cheapest in up front licensing) or newer technologies like MapReduce /Hadoop (since they are quite so fast)- but I think that market has a way of factoring in these things- and customers are not as foolish neither as unaware of costs versus benefits of migration.

The contrary to this is Business Analytics and Business Intelligence are imperfect markets with duo-poly  or big players thriving in absence of customer regulation.

You get more protection as a customer of $20 bag of potato chips, than as a customer of a $200,000 software. Regulators are wary to step in to ensure ROI fairness (since most bright techies are qither working for private sector, have their own startup or invested in startups)- who in Govt understands Analytics and Intelligence strong enough to ensure vendor lock-ins are not done, and market flexibility is done. It is also a lower choice for embattled regulators to ensure ROI on enterprise software unlike the aggressiveness they have showed in retail or online software.

Who will Analyze the Analysts and who can quantify the value of quants (or penalize them for shoddy quantitative analytics)- is an interesting phenomenon we expect to see more of.