An open source and better Search Engine

Search Engines are difficult subjects to talk about – there are multiple experts and there are multiple vendors from Microsoft, Yahoo, Google, Cuil and there are newer innovations like Cosmix -Blended Search and there are Wiki Search (Including the Digg bar and same features introduced in Google now). Content itself has exploded from websites in 1999 to websites, blog posts, RSS feeds, tweets, Facebook profiles, online communities, voice, podcast and video. Quantitative measures to measure, index and rank the new types of content require that the algorithms behind search be made open source but with strict creative commons licensing and using third party developers to create search algorithm extensions.

This idea seems difficulty to implement but it has been there and done before. No one creates Palo Alto like research labs anymore- all scientist and researchers have to first sign away “copy"cat”rights before beginning their research.

The year 2009 is different from the year 1999, and PageRank is no longer a maths based algol- it is a marketing brand. Time for the Stanford dropouts to go back to school and get some more math and some less marketing (and less pranks on Wolfram please) in their search engine. And Paul Allen who created the building in Stanford where the Google algol was first thought, he needs to spend some Bills and venture fund a new wave of innovation in search engines. Is this wishful thinking? Maybe. I just need a better search engine than Google right now.- Perhaps Herr Schmidt take some time from viewing mountains in Mountain View and measure customer satisfaction instead of just measuring market share in non competitive and likely to face anti trust scrutiny in US and Europe very soon. So better give some of the ranking algol features open so all websites implement the SEO tactics magically revealed and create a better world wide web- thus negating the information asymmetry in a closed source search engine.

An E Book Review

Here is a nice e book I got from my colleagues at The Customer Collective. I really really loved the friendly design making this a very easy e-book to read unlike other self help books. It has tips from 11 top sales experts in how to sell in that specific sector during a recession- and these are not essays but nice bullet point specific action items. Hat tip to the editor and the authors here.

You can download the ebook here

image image

Working Together -Yuuguu

Here is a nice software called Yuuguu. I really liked the software because it enables sharing of screens so we can have secure virtual meetings, it is quite light on processor and memory , and best of all it works across Ubuntu 64 bit, Windows XP, Mac OS .

So if you want to work on a project team that sit across the seven seas and the big pond, and you feel the best way is to talk through a demonstration rather than give  a documentation – then Yuuguu is the right software for you. The worst part of this software — is probably the name. And yes it is free and has a paid version as well.

See www.Yuuguu.com

image

KXEN and a Data Mining Survey

Recently KXEN, the data mining and modeling automation company which has also pioneered social network analytics software came in for a bit of customer love in a data mining survey.

kxen

As per the site 

KXEN’s next generation automated data mining software is a strategic solution for 90% of user organizations and has won their support and praise in a new customer satisfaction survey, the findings of which are revealed today. 

Of almost 2,000 users polled 90% of those responding said the company’s advanced analytics software was strategic to their activity, 87% were highly or very highly satisfied and 85% agreed the software had met or exceeded all of their expectations. The results underscore KXEN’s growing importance in a market traditionally dominated by more costly, harder to use first generation offerings.

KXEN’s analytic software was also highly rated for its simple, clear interface with all respondents agreeing that KXEN solutions were easy to use, and 90% stating its new graphical front end had brought yet more usability benefits.  Confirming these findings, users responding included sales, marketing and other line of business staff as well as specialist analysts, data miners, academics and statisticians.

Turning to the results of using KXEN’s software, 98% of all those responding stated it had improved their overall business with the same number agreeing it had speeded up their data modeling activities. 96% said KXEN had increased the value of predictive analytics in their companies.

Of course there are numerous surveys (including probably the best is from KD Nuggets) and I am trying to find the raw data and samples for this survey as I write. But it is a promising step up for a company I have admired since 2004, when I first tested it, and as late as last year I was building online models with it. Predictably Roger Hadaad whom we interviewed in January 2009 was all praise for his team and its splendid product. Well Done, guys take a bow- it is about time ! A great example of a company that builds innovative analytics quitely without getting into any tangles with open source or business intellgence sentiments.

Ajay- I am a consultant to KXEN for Social Networks Analysis.

Interview SPSS Olivier Jouve

SPSS recently launched a major series of products in it’s text mining and data mining product portfolio and rebranded data mining to the PASW series. In an exclusive and extensive interview, Oliver Jouve Vice President,Corporate Development at SPSS Inc talks of science careers, the recent launches, open source support to R by SPSS, Cloud Computing and Business Intelligence.

Ajay: Describe your career in Science. Are careers in science less lucrative than careers in business development? What advice would you give to people re-skilling in the current recession on learning analytical skills?

Olivier: I have a Master of Science in Geophysics and Master of Science in Computer Sciences, both from Paris VI University. I have always tried to combine science and business development in my career as I like to experience all aspects ďż˝ from idea to concept to business plan to funding to development to marketing to sales.

There was a study published earlier this year that said two of the three best jobs are related to math and statistics. This is reinforced by three societal forces that are converging ďż˝ better uses of mathematics to drive decision making, the tremendous growth and storage of data, and especially in this economy, the ability to deliver ROI. With more and more commercial and government organizations realizing the value of Predictive Analytics to solve business problems, being equipped with analytical skills can only enhance your career and provide job security.

Ajay: So SPSS has launched new products within its Predictive Analytics Software (PASW) portfolio ďż˝ Modeler 13 and Text Analytics 13? Is this old wine in a new bottle? What is new in terms of technical terms? What is new in terms of customers looking to mine textual information?

Olivier: Our two new products — PASW Modeler 13 (formerly Clementine) and PASW Text Analytics 13 (formerly Text Mining for Clementine) ďż˝ extend and automate the power of data mining and text analytics to the business user, while significantly enhancing the productivity, flexibility and performance of the expert analyst.

PASW Modeler 13 data mining workbench has new and enhanced functionality that quickly takes users through the entire data mining process ďż˝ from data access and preparation to model deployment. Some the newest features include Automated Data Preparation that conditions data in a single step by automatically detecting and correcting quality errors; Auto Cluster that gives users a simple way to determine the best cluster algorithm for a particular data set; and full integration with PASW Statistics (formerly SPSS Statistics).

With PASW Text Analytics 13, SPSS provides the most complete view of the customer through the combined analysis of text, web and survey data.   While other companies only provide the text component, SPSS couples text with existing structured data, permitting more accurate results and better predictive modeling. The new version includes pre-built categories for satisfaction surveys, advanced natural language processing techniques, and it supports more than 30 different languages.

Ajay: SPSS has supported open source platforms – Python and R ďż˝ before it became fashionable to do so. How has this helped your company?

Olivier: Open source software helps the democratization of the analytics movement and SPSS is keen on supporting that democratization while welcoming open source users (and their creativity) into the analytics framework.

Ajay: What are the differences and similarities between Text Analytics and Search Engines? Can we mix the two as well using APIs?

Olivier: Search Engines are fundamentally top-down in that you know what you are looking for when launching a query. However, Text Analytics is bottom-up, uncovering hidden patterns, relationships and trends locked in unstructured data ďż˝ including call center notes, open-ended survey responses, blogs and social networks. Now businesses have a way of pulling key concepts and extracting customer sentiments, such as emotional responses, preferences and opinions, and grouping them into categories.

For instance, a call center manager will have a hard time extracting why customers are unhappy and churn by using a search engine for millions of call center notes. What would be the query? But, by using Text Analytics, that same call center agent will discover the main reasons why customers are unhappy, and be able to predict if they are going to churn.

Ajay: Why is Text Analytics so important?  How will companies use it now and into the future?
Olivier –
Actually, the question you should ask is, “Why is unstructured data so important?” Today, more than ever, people love to share their opinions — through the estimated 183 billion emails sent, the 1.6 million blog posts, millions of inquiries captured in call center notes, and thousands of comments on diverse social networking sites and community message boards. And, letďż˝s not forget all data that flows through Twitter. Companies today would be short-sighted to ignore what their customers are saying about their products and services, in their own words. Those opinions ďż˝ likes and dislikes ďż˝ are essential nuggets and bear much more insights than demographic or transactional data to reducing customer churn, improving satisfaction, fighting crime, detecting fraud and increasing marketing campaign results.

Ajay: How is SPSS venturing into cloud computing and SaaS?

Olivier: SPSS has been at the origin of the PMML standard to allow organizations to provision their computing power in a very flexible manner � just like provisioning computing power through cloud computing. SPSS strongly believes in the benefits of a cloud computing environment, which is why all of our applications are designed with Service Oriented Architecture components.  This enables SPSS to be flexible enough to meet the demands of the market as they change with respect to delivery mode.  We are currently analyzing business and technical issues related to SPSS technologies in the cloud, such as the scoring and delivery of analytics.  In regards to SaaS, we currently offer hosted services for our PASW Data Collection (formerly Dimensions) survey research suite of products.

Ajay: Do you think business intelligence is an over used term? Why do you think BI and Predictive Analytics failed in mortgage delinquency forecasting and reporting despite the financial sector being a big spender on BI tools?

Oliver: There is a big difference between business intelligence (BI) and Predictive Analytics. Traditional BI technologies focus on what�s happening now or what�s happened in the past by primarily using financial or product data. For organizations to take the most effective action, they need to know and plan for what may happen in the future by using people data � and that�s harnessed through Predictive Analytics.

Another way to look at it � Predictive covers the entire capture, predict and act continuum � from the use of survey research software to capture customer feedback (attitudinal data), to creating models to predict customer behaviors, and then acting on the results to improve business processes. Predictive Analytics, unlike BI, provides the secret ingredient and answers the question, �What will the customer do next?�

That being said, financial institutions didn�t need to use Predictive Analytics to see
that some lenders sold mortgages to unqualified individuals likely to default. Predictive Analytics is an incredible application used to detect fraud, waste and abuse. Companies in the financial services industry can focus on mitigating their overall risk by creating better predictive models that not only encompass richer data sets, but also better rules-based automation.

Ajay: What do people do at SPSS to have fun when they are not making complex mathematical algorithms?
Oliver: SPSS employees love our casual, friendly atmosphere, our professional and talented colleagues, and our cool, cutting-edge technology. The fun part comes from doing meaningful work with great people, across different groups and geographies. Of course being French, I have ensured that my colleagues are fully educated on the best wine and cuisine. And being based in Chicago, there is always a spirited baseball debate between the Cubs and White Sox. However, I am yet to convince anyone that rugby is a better sport.

Biography

Olivier Jouve is Vice President, Corporate Development, at SPSS Inc. He is responsible for defining SPSS strategic directions, growth opportunities through internal development, merger and acquisitions and/or tactical alliances. As a pioneer in the field of data and text mining for the last 20 years, he has created the foundation of Text Analytics technology for analyzing customer interactions at SPSS. Jouve is a successful serial entrepreneur and has had his works published internationally in the area of Analytical CRM, text mining, search engines, competitive intelligence and knowledge management.

terrific Tr.im trims Tweet time

Okay, the title of the post was bad attempt at a haiku. But the tr.im plugin for Firefox is incredible and helps you tweet interesting reading in matter of seconds. More importantly it shows you the analytics behind how many actual users went to that particular tr.im url. While Tr.im is yet another url shortening service like the tinyurl.com and bit.ly services, what makes Tr.im stand out in a terrific manner are the following innovations –

1) User friendly Firefox Plugin that can be downloaded from https://addons.mozilla.org/en-US/firefox/addon/10232/

See the screenshot of the Tr.im panel which conveniently opens on the left. The Statistics can be seen in the separate window ( note the Twitterfox application which is also open on the right – that is a separate application)

2) Analytics for tracking the locations, of people who click on the url and whether they were human or a bot.

3) Seamless Twitter integration even for multiple accounts

So it seems like you will run out of excuses to run away from Twitter soon, and all the additional social network data being generated could really help the next generation of response and online propensity models.

Tr.im that!!

screenshottrim

KXEN – Automated Regression Modeling

I have used KXEN many times for building and testing propensity models. The regression modeling feature of KXEN is awesome in the sense it can make model building very easy to build and deliver.

The KXEN package K2R is the package responsible for this and uses robust regression. A word of the basic mathematical theory behind KXEN’s automated modeling – the technique is called Structural Risk Minimization. You can read more on the basic mathematical technique here or http://www.svms.org/srm/. The following is an extract from the same source.

Structural risk minimization (SRM) (Vapnik and Chervonekis, 1974) is an inductive principle for model selection used for learning from finite training data sets. It describes a general model of capacity control and provides a trade-off between hypothesis space complexity (the VC dimension of approximating functions) and the quality of fitting the training data (empirical error). The procedure is outlined below.

  1. Using a priori knowledge of the domain, choose a class of functions, such as polynomials of degree n, neural networks having n hidden layer neurons, a set of splines with n nodes or fuzzy logic models having n rules.
  2. Divide the class of functions into a hierarchy of nested subsets in order of increasing complexity. For example, polynomials of increasing degree.
  3. Perform empirical risk minimization on each subset (this is essentially parameter selection).
  4. Select the model in the series whose sum of empirical risk and VC confidence is minimal.

Sewell (2006) SVMs use the spirit of the SRM principle.

Structural risk minimization (SRM) (Vapnik 1995) uses a set of models ordered in terms of their complexities. An example is polynomials of increasing order. The complexity is generally given by the number of free parameters. VC dimension is another measure of model complexity. In equation 4.37, we can have a set of decreasing ?i to get a set of models ordered in increasing complexity. Model selection by SRM then corresponds to finding the model simplest in terms of order and best in terms of empirical error on the data.”
Alpaydin (2004), pages 80-81

Now back to the automated regression modeling.

Robust Regression

(K2R) is a universal solution for Classification, Regression, and Attribute Importance. It enables the prediction of behaviors (nominal targets) or quantities (continuous targets).

Unlike traditional regression algorithms, K2R can safely handle a very high numbers of input attributes (over 10,000) in an automated fashion. K2R provides indicators and graphs to ensure that the quality and robustness of trained models can be easily assessed. K2R graphically displays the attribute importance, which provides the relative importance of each attribute for explaining a given business question. At the same time it gives a clear indication of which attributes either contain no relevant information or are redundant with other attributes.

Benefits: The business value of a data mining project is increased by either training more models or completing the project faster. The ability to train more models allows a larger number of scenarios to be tested at a higher level of granularity. For example, if a direct marketing campaign benefits from separate models trained per region, per customer, segment, per month, the automation of K2R allows all of these models to be trained and safely deployed using the same amount or fewer resources than with traditional tools. learn more

What: K2R is a regression algorithm that allows building models to predict categories or continuous variables.

Why: Traditionally, building robust predictive models required a lot of time and expertise, which prevented companies from using data mining as part of their every day business decisions. K2R makes it easy to build and deploy predictive models in the fraction of the time it takes using classical statistical tools.

How: K2R maps a set of descriptive attributes (model inputs) and target attributes (model output). It uses an algorithm patented by KXEN, which is a derivation of a principle described by V. Vapnik as “Structured Risk Minimization.” Instead of looking for the best performance on a known dataset, K2R automatically finds the best compromise between quality and robustness. The resulting models are expressed as a polynomial expression of the input numbers. The only element specified by the user is the polynomial degree. To improve modeling speed, K2R can also build multi-target models.

Benefits for the business user: K2R allows the business user to easily build and understand advanced predictive models without statistical knowledge. A model can be created in a matter of minutes. Two performance indicators describe model quality (Ki) and model reliability or the ability to produce similar on new data (Kr).

K2R graphically displays the individual variable contribution to the model, which helps to select the most important variables explaining a given business question. At the same time it avoids focusing on data that contains no information.

Models can directly be applied in a simulation mode for a single input dataset predicting the score for an individual business question in real time.

Benefits for the Data Mining expert: K2R frees time for Data Mining professionals to apply their expertise in areas where they add more value instead of spending several days to tune a model. K2R produces results within minutes (less than 15 seconds on a laptop with 50,000 lines and 20 variables).

Here is a case study from the company itself.

Marketing campaign usage scenario

* Send a “Test mailing” to 5000 customers to offer them a new product,
* Collect the results of your test mailing to build a “Training” data set that associates things you know about customers prior to the mailing with the answers to your business question
* Train a model to “predict” the Yes/No answer
* Check the quality and robustness of your model (Ki, Kr)
* Apply the model to the 1,000,000 other customers in your database: this model associates each individual customer with a probability for answering Yes. Because you are using a robust model, the sum of probabilities is a good indicator of how many people will answer yes to this mail
* Send your mailing only to those customers with a high probability to respond positively, or use our built-in profit curves to optimize your return on the campaign

Example: Regression: Dealer evaluation usage scenario

* Collect information about the past performance of your dealers two years ago and associate how much of your product they sold 1 year ago
* Train a model to predict how much a dealer will sell based on the available information
* Check the quality and robustness of the model (Ki, Kr)
* Apply the model to all of your dealers today: the model associates each dealer with an estimation of how many products he will sell,
* Sum up the estimates to predict how much you will sell next year. This is the base line for your sales forecast.

In my next post I would include screenshots on how to build an automated regression model using KXEN.

Ajay Disclaimer- I am a consultant to KXEN for social networks.