Timo Elliott on 2012

Continuing the DecisionStats series on  trends for 2012, Timo Elliott , Technology Evangelist  at SAP Business Objects, looks at the predictions he made in the beginning of  2011 and follows up with the things that surprised him in 2011, and what he foresees in 2012.

You can read last year’s predictions by Mr Elliott at http://www.decisionstats.com/brief-interview-timo-elliott/

Timo- Here are my comments on the “top three analytics trends” predictions I made last year:

(1) Analytics, reinvented. New DW techniques make it possible to do sub-second, interactive analytics directly against row-level operational data. Now BI processes and interfaces need to be rethought and redesigned to make best use of this — notably by blurring the distinctions between the “design” and “consumption” phases of BI.

I spent most of 2011 talking about this theme at various conferences: how existing BI technology israpidly becoming obsolete and how the changes are akin to the move from film to digital photography. Technology that has been around for many years (in-memory, column stores, datawarehouse appliances, etc.) came together to create exciting new opportunities and even generally-skeptical industry analysts put out press releases such as “Gartner Says Data Warehousing Reaching Its Most Significant Inflection Point Since Its Inception.” Some of the smaller BI vendors had been pushing in-memory analytics for years, but the general market started paying more attention when megavendors like SAP started painting a long-term vision of in-memory becoming a core platform for applications, not just analytics. Database leader Oracle was forced to upgrade their in-memory messaging from “It’s a complete fantasy” to “we have that too”.

(2) Corporate and personal BI come together. The ability to mix corporate and personal data for quick, pragmatic analysis is a common business need. The typical solution to the problem — extracting and combining the data into a local data store (either Excel or a departmental data mart) — pleases users, but introduces duplication and extra costs and makes a mockery of information governance. 2011 will see the rise of systems that let individuals and departments load their data into personal spaces in the corporate environment, allowing pragmatic analytic flexibility without compromising security and governance.

The number of departmental “data discovery” initiatives continued to rise through 2011, but new tools do make it easier for business people to upload and manipulate their own information while using the corporate standards. 2012 will see more development of “enterprise data discovery” interfaces for casual users.

(3) The next generation of business applications. Where are the business applications designed to support what people really do all day, such as implementing this year’s strategy, launching new products, or acquiring another company? 2011 will see the first prototypes of people-focused, flexible, information-centric, and collaborative applications, bringing together the best of business intelligence, “enterprise 2.0”, and existing operational applications.

2011 saw the rise of sophisticated, user-centric mobile applications that combine data from corporate systems with GPS mapping and the ability to “take action”, such as mobile medical analytics for doctors or mobile beauty advisor applications, and collaborative BI started becoming a standard part of enterprise platforms.

And one that should happen, but probably won’t: (4) Intelligence = Information + PEOPLE. Successful analytics isn’t about technology — it’s about people, process, and culture. The biggest trend in 2011 should be organizations spending the majority of their efforts on user adoption rather than technical implementation.

Unsurprisingly, there was still high demand for presentations on why BI projects fail and how to implement BI competency centers.  The new architectures probably resulted in even more emphasis on technology than ever, while business peoples’ expectations skyrocketed, fueled by advances in the consumer world. The result was probably even more dissatisfaction in the past, but the benefits of the new architectures should start becoming clearer during 2012.

What surprised me the most:

The rapid rise of Hadoop / NoSQL. The potentials of the technology have always been impressive, but I was surprised just how quickly these technology has been used to address real-life business problems (beyond the “big web” vendors where it originated), and how quickly it is becoming part of mainstream enterprise analytic architectures (e.g. Sybase IQ 15.4 includes native MapReduce APIs, Hadoop integration and federation, etc.)

Prediction for 2012:

As I sat down to gather my thoughts about BI in 2012, I quickly came up with the same long laundry list of BI topics as everybody else: in-memory, mobile, predictive, social, collaborative decision-making, data discovery, real-time, etc. etc.  All of these things are clearly important, and where going to continue to see great improvements this year. But I think that the real “next big thing” in BI is what I’m seeing when I talk to customers: they’re using these new opportunities not only to “improve analytics” but also fundamentally rethink some of their key business processes.

Instead of analytics being something that is used to monitor and eventually improve a business process, analytics is becoming a more fundamental part of the business process itself. One example is a large telco company that has transformed the way they attract customers. Instead of laboriously creating a range of rate plans, promoting them, and analyzing the results, they now use analytics to automatically create hundreds of more complex, personalized rate plans. They then throw them out into the market, monitor in real time, and quickly cull any that aren’t successful. It’s a way of doing business that would have been inconceivable in the past, and a lot more common in the future.

 

About

 

Timo Elliott

Timo Elliott is a 20-year veteran of SAP BusinessObjects, and has spent the last quarter-century working with customers around the world on information strategy.

He works closely with SAP research and innovation centers around the world to evangelize new technology prototypes.

His popular Business Analytics blog tracks innovation in analytics and social media, including topics such as augmented corporate reality, collaborative decision-making, and social network analysis.

His PowerPoint Twitter Tools lets presenters see and react to tweets in real time, embedded directly within their slides.

A popular and engaging speaker, Elliott presents regularly to IT and business audiences at international conferences, on subjects such as why BI projects fail and what to do about it, and the intersection of BI and enterprise 2.0.

Prior to Business Objects, Elliott was a computer consultant in Hong Kong and led analytics projects for Shell in New Zealand. He holds a first-class honors degree in Economics with Statistics from Bristol University, England

Timo can be contacted via Twitter at https://twitter.com/timoelliott

 Part 1 of this series was from James Kobielus, Forrestor at http://www.decisionstats.com/jim-kobielus-on-2012/

PMML Augustus

Here is a new-old system in open source for

for building and scoring statistical models designed to work with data sets that are too large to fit into memory.

http://code.google.com/p/augustus/

Augustus is an open source software toolkit for building and scoring statistical models. It is written in Python and its
most distinctive features are:
• Ability to be used on sets of big data; these are data sets that exceed either memory capacity or disk capacity, so
that existing solutions like R or SAS cannot be used. Augustus is also perfectly capable of handling problems
that can fit on one computer.
• PMML compliance and the ability to both:
– produce models with PMML-compliant formats (saved with extension .pmml).
– consume models from files with the PMML format.
Augustus has been tested and deployed on serveral operating systems. It is intended for developers who work in the
financial or insurance industry, information technology, or in the science and research communities.
Usage
Augustus produces and consumes Baseline, Cluster, Tree, and Ruleset models. Currently, it uses an event-based
approach to building Tree, Cluster and Ruleset models that is non-standard.

New to PMML ?

Read on http://code.google.com/p/augustus/wiki/PMML

The Predictive Model Markup Language or PMML is a vendor driven XML markup language for specifying statistical and data mining models. In other words, it is an XML language so that Continue reading “PMML Augustus”

Using Opera Unite to defeat SOPA?

Lets assume that the big bad world of American electoral politics forces some kind of modified SOPA to be passed, and the big American companies have to abide by that law (just as they do share data  for National Security under Patriot Act but quitely).

I belive Opera Unite is the way forward to sharing content on the Internet.

From-

http://dev.opera.com/articles/view/opera-unite-developer-primer-revisited/

Opera Unite features a Web server running inside the Opera browser, which allows you to do some amazing things. At the touch of a button, you can share images, documents, video, music, games, collaborative applications and all manner of other things with your friends and colleagues

I can share music, and files , and the web server is actually my own laptop. try beating 2 billion new web servers that sprout!! File system sharing is totally secure- you can create private, public, or password protected files, a messaging system that can be used for drop messages (called fridge), a secure messaging system and your own web server is ready to start at a click. the open web may just use opera instead of chromium, and US regulation would be solely to blame. even URL blocking is of limited appeal thanks to software like MafiaWire Extension

Throw in Ad block, embedded bit torrent sharing and some more  Tor level encryption within the browser and sorry Senator, but the internet belongs to the planet not to your lobbyist.

see-http://dev.opera.com/web

Topic Models

Some stuff on Topic Models-

http://en.wikipedia.org/wiki/Topic_model

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. An early topic model was probabilistic latent semantic indexing (PLSI), created by Thomas Hofmann in 1999.[1] Latent Dirichlet allocation (LDA), perhaps the most common topic model currently in use, is a generalization of PLSI developed by David Blei, Andrew Ng, and Michael Jordan in 2002, allowing documents to have a mixture of topics.[2] Other topic models are generally extensions on LDA, such as Pachinko allocation, which improves on LDA by modeling correlations between topics in addition to the word correlations which constitute topics. Although topic models were first described and implemented in the context of natural language processing, they have applications in other fields such as bioinformatics.

http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

In statistics, latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s creation is attributable to one of the document’s topics. LDA is an example of a topic model

David M Blei’s page on Topic Models-

http://www.cs.princeton.edu/~blei/topicmodeling.html

The topic models mailing list is a good forum for discussing topic modeling.

In R,

Some resources I compiled on Slideshare based on the above- Continue reading “Topic Models”

Does the Internet need its own version of credit bureaus

Data Miners love data. The more data they have the better model they can build. Consumers do not love data so much and find sharing data generally a cumbersome task. They need to be incentivize for filling out survey forms , and for signing to loyalty programs. Lawyers, and privacy advocates love to use examples of improper data collection and usage as the harbinger of an ominous scenario. George Orwell’s 1984 never “mentioned” anything about Big Brother trying to sell you one more loan, credit card or product.

Data generated by customers is now growing without their needing to fill out forms and surveys. This data is about their preferences , tastes and choices and is growing in size and depth because it is generated from social media channels on the Internet.It is this data that can be and is captured by social media analytics.

Mobile data is also growing, including usage of location based applications and usage of Internet from the mobile phone is leading to further increases in data about consumers.Increasingly , location based applications help to provide a much more relevant context to the data generated. Just mobile data is expected to grow to 15 exabytes by 2015.

People want to have more and more conversations online publicly , share pictures , activity and interact with a large number of people whom  they have never met. But resent that information being used or abused without their knowledge.

Also the Internet is increasingly being consolidated into a few players like Microsoft, Amazon, Google  and Facebook, who are unable to agree on agreements to share that data between themselves. Interestingly you can use Yahoo as a data middleman between Google and Facebook.

At the same time, more and more purchases are being done online by customers and Internet advertising has grown much above the rate of growth of other mediums of communication.
Internet retail sales have the advantage that better demand predictability can lead to lower inventories as retailers need not stock up displays to look good. An Amazon warehouse need not keep material to simply stock up it shelves like a K-Mart does.

Our Hypothesis – An Analogy with how Financial Data Marketing is managed offline

  1. Financial information regarding spending and saving is much more sensitive yet the presence of credit bureaus alleviates these concerns.
  2. Credit bureaus collect information from all sources, aggregate and anonymize the individual components accordingly.They use SSN as a unique identifier.
  3. The Internet has a unique number too , called the Internet Protocol Address (I.P) 
  4. Should there be a unique identifier like Internet Security Number for the Internet to ensure adequate balance between the need for privacy as well as the need for appropriate targeting? 

After all, no one complains about privacy intrusions if their credit bureau data is aggregated , rolled up, and anonymized and turned into a propensity model for sending them direct mailers.

Advertising using Social Media and Internet

https://www.facebook.com/about/ads/#stories

1. A business creates an ad
Let’s say a gym opens in your neighborhood. The owner creates an ad to get people to come in for a free workout.
2. Facebook gets paid to deliver the ad
The owner sends the ad to Facebook and describes who should see it: people who live nearby and like running.
The right people see the ad
3. Facebook only shows you the ad if you live in town and like to run. That’s how advertisers reach you without knowing who you are.

Adding in credit bureau data and legislative regulation for anonymizing  and handling privacy data can expand the internet selling market, which is much more efficient from a supply chain perspective than the offline display and shop models.

Privacy Regulations on Marketing using Internet data
Should laws on opt out and do not mail, do not call, lists be extended to do not show ads , do not collect information on social media. In the offline world, you can choose to be part of direct marketing or opt out of direct marketing by enrolling yourself in various do not solicit lists. On the internet the only option from advertisements is to use the Adblock plugin if you are Google Chrome or Firefox browser user. Even Facebook gives you many more ads than you need to see.

One reason for so many ads on the Internet is lack of central anonymize data repositories for giving high quality data to these marketing companies.Software that can be used for social media analytics is already available off the shelf.

The growth of the Internet has helped carved out a big industry for Internet web analytics so it is a matter of time before social media analytics becomes a multi billion dollar business as well. What new developments would be unleashed in this brave new world is just a matter of time, and of course of the social media data!

Ads Alliance on Internet

Just saw

the Digital Advertising Alliance’s (DAA) Self-Regulatory Program for Online Behavioral Advertising.

Multi-Site Data Collection Principles Broaden Self Regulation Beyond Online Behavioral Advertising
WASHINGTON, D.C., NOVEMBER 7, 2011

The new Principles consist of the following specific requirements:

  1. Transparency and consumer control for purposes other than OBA – The Multi-Site Data Principles call for organizations that collect Multi-Site Data for purposes other than OBA to provide transparency and control regarding Internet surfing across unrelated Websites.
  2. Collection / use of data for eligibility determination – The Multi-Site Data Principles prohibit the collection, use or transfer of Internet surfing data across Websites for determination of a consumer’s eligibility for employment, credit standing, healthcare treatment and insurance.
  3. Collection / use of children’s data – The Multi-Site Data Principles state that organizations must comply with the Children’s Online Privacy Protection Act (COPPA).
  4. Meaningful accountability – The Multi-Site Data Principles are subject to enforcement through strong accountability mechanisms.

http://www.aboutads.info/principles

The DAA Self-Regulatory Principles

 

The cross-industry Self-Regulatory Principles for Multi-Site Data augment the Self-Regulatory   Principles for Online Behavioral Advertising  (OBA)  by covering the prospective  collection of Web site   data beyond that collected for OBA purposes.  The existing OBA  Principles and definitions  remain in   full force and effect and are not limited by the new  principles.

The cross-industry Self-Regulatory Principles for Online Behavioral Advertising was developed by   leading industry associations to apply  consumer-friendly standards to online  behavioral advertising  across the Internet. Online behavioral advertising increasingly supports the convenient access to  content, services, and applications over the Internet that consumers have come to expect at no cost   to them.

The Education Principle calls for organizations to participate in efforts to educate individuals and businesses about online behavioral advertising and the Principles.

The Transparency Principle calls for clearer and easily accessible disclosures to consumers about data collection and use practices associated with online behavioral advertising. It will result in new, enhanced notice on the page where data is collected through links embedded in or around advertisements, or on the Web page itself.

The Consumer Control Principle provides consumers with an expanded ability to choose whether data is collected and used for online behavioral advertising purposes. This choice will be available through a link from the notice provided on the Web page where data is collected.

The Consumer Control Principle requires “service providers”, a term that includes Internet access service providers and providers of desktop applications software such as Web browser “tool bars” to obtain the consent of users before engaging in online behavioral advertising, and take steps to de-identify the data used for such purposes.

The Data Security Principle calls for organizations to provide appropriate security for, and limited retention of data, collected and used for online behavioral advertising purposes.

The Material Changes Principle calls for obtaining consumer consent before a Material Change is made to an entity’s Online Behavioral Advertising data collection and use policies unless that change will result in less collection or use of data.

The Sensitive Data Principle recognizes that data collected from children and used for online behavioral advertising merits heightened protection, and requires parental consent for behavioral advertising to consumers known to be under 13 on child-directed Web sites. This Principle also provides heightened protections to certain health and financial data when attributable to a specific individual.

The Accountability Principle calls for development of programs to further advance these Principles, including programs to monitor and report instances of uncorrected non-compliance with these Principles to appropriate government agencies. The CBBB and DMA have been asked and agreed to work cooperatively to establish accountability mechanisms under the Principles.

 

Ajay- So why the self regulations?

Answer- Shoddy Maths in behaviorally targeted ads is leading to a very high glut in targeted ads, more than can be reasonably expected to click based on consumer spending. On the internet- unlike on television- cost is less of a barrrier to OVER ADVERTISING.

 

Data Documentation Initiative

Here is a nice initiative in standardizing data documentation for social sciences (which can be quite a relief to legions of analysts)

http://www.ddialliance.org/what

 

 

 

 

Benefits of DDI

The DDI facilitates:

  • Interoperability. Codebooks marked up using the DDI specification can be exchanged and transported seamlessly, and applications can be written to work with these homogeneous documents.
  • Richer content. The DDI was designed to encourage the use of a comprehensive set of elements to describe social science datasets as completely and as thoroughly as possible, thereby providing the potential data analyst with broader knowledge about a given collection.
  • Single document – multiple purposes. A DDI codebook contains all of the information necessary to produce several different types of output, including, for example, a traditional social science codebook, a bibliographic record, or SAS/SPSS/Stata data definition statements. Thus, the document may be repurposed for different needs and applications. Changes made to the core document will be passed along to any output generated.
  • On-line subsetting and analysis. Because the DDI markup extends down to the variable level and provides a standard uniform structure and content for variables, DDI documents are easily imported into on-line analysis systems, rendering datasets more readily usable for a wider audience.
  • Precision in searching. Since each of the elements in a DDI-compliant codebook is tagged in a specific way, field-specific searches across documents and studies are enabled. For example, a library of DDI codebooks could be searched to identify datasets covering protest demonstrations during the 1960s in specific states or countries.
Also see-
  1. http://www.ddialliance.org/Specification/DDI-Codebook/2.1/DTD/Documentation/DDI2-1-tree.html
  2. http://www.ddialliance.org/Specification/DDI-Lifecycle/3.1/