Best of Decision Stats- Modeling and Text Mining Part3

Here are some of the top articles by way of views, in an  area I love– of modeling and text mining.

1) Karl Rexer – Rexer Analytics

http://www.decisionstats.com/2009/06/09/interview-karl-rexer-rexer-analytics/

Karl produces one of the most respected surveys that captures emerging trends in data mining and technology. Karl was also one of the most enthusiastic people I have interviewed- and I am thankful for his help in getting me some more interviews.

2) Gregory Piatesky Shapiro

One of the earliest and easily the best Knowledge Discoverer of all times, Gregory produces http://www.kdnuggets.com and the newsletter is easily the must newsletter to be on. Gregory was doing data mining , while the Google boys were still debating whether to drop out of Stanford or not.
Continue reading “Best of Decision Stats- Modeling and Text Mining Part3”

Facebook Text Analytics

Here is a great presentation on Facebook Analytics using text mining.

Citation-
Citation;Text Analytics Summit 2009 – Roddy Lindsay – “Social Media, Happiness, Petabytes and LOLs”
and here is a presentation on HIVE and HADOOP
HIVE: Data Warehousing & Analytics on Hadoop

Facebook sure looks a surprisingly nice analytics company to work for.!!! No wonder they have all but swamped the competition.

Interview SPSS Olivier Jouve

SPSS recently launched a major series of products in it’s text mining and data mining product portfolio and rebranded data mining to the PASW series. In an exclusive and extensive interview, Oliver Jouve Vice President,Corporate Development at SPSS Inc talks of science careers, the recent launches, open source support to R by SPSS, Cloud Computing and Business Intelligence.

Ajay: Describe your career in Science. Are careers in science less lucrative than careers in business development? What advice would you give to people re-skilling in the current recession on learning analytical skills?

Olivier: I have a Master of Science in Geophysics and Master of Science in Computer Sciences, both from Paris VI University. I have always tried to combine science and business development in my career as I like to experience all aspects � from idea to concept to business plan to funding to development to marketing to sales.

There was a study published earlier this year that said two of the three best jobs are related to math and statistics. This is reinforced by three societal forces that are converging � better uses of mathematics to drive decision making, the tremendous growth and storage of data, and especially in this economy, the ability to deliver ROI. With more and more commercial and government organizations realizing the value of Predictive Analytics to solve business problems, being equipped with analytical skills can only enhance your career and provide job security.

Ajay: So SPSS has launched new products within its Predictive Analytics Software (PASW) portfolio � Modeler 13 and Text Analytics 13? Is this old wine in a new bottle? What is new in terms of technical terms? What is new in terms of customers looking to mine textual information?

Olivier: Our two new products — PASW Modeler 13 (formerly Clementine) and PASW Text Analytics 13 (formerly Text Mining for Clementine) � extend and automate the power of data mining and text analytics to the business user, while significantly enhancing the productivity, flexibility and performance of the expert analyst.

PASW Modeler 13 data mining workbench has new and enhanced functionality that quickly takes users through the entire data mining process � from data access and preparation to model deployment. Some the newest features include Automated Data Preparation that conditions data in a single step by automatically detecting and correcting quality errors; Auto Cluster that gives users a simple way to determine the best cluster algorithm for a particular data set; and full integration with PASW Statistics (formerly SPSS Statistics).

With PASW Text Analytics 13, SPSS provides the most complete view of the customer through the combined analysis of text, web and survey data.   While other companies only provide the text component, SPSS couples text with existing structured data, permitting more accurate results and better predictive modeling. The new version includes pre-built categories for satisfaction surveys, advanced natural language processing techniques, and it supports more than 30 different languages.

Ajay: SPSS has supported open source platforms – Python and R � before it became fashionable to do so. How has this helped your company?

Olivier: Open source software helps the democratization of the analytics movement and SPSS is keen on supporting that democratization while welcoming open source users (and their creativity) into the analytics framework.

Ajay: What are the differences and similarities between Text Analytics and Search Engines? Can we mix the two as well using APIs?

Olivier: Search Engines are fundamentally top-down in that you know what you are looking for when launching a query. However, Text Analytics is bottom-up, uncovering hidden patterns, relationships and trends locked in unstructured data � including call center notes, open-ended survey responses, blogs and social networks. Now businesses have a way of pulling key concepts and extracting customer sentiments, such as emotional responses, preferences and opinions, and grouping them into categories.

For instance, a call center manager will have a hard time extracting why customers are unhappy and churn by using a search engine for millions of call center notes. What would be the query? But, by using Text Analytics, that same call center agent will discover the main reasons why customers are unhappy, and be able to predict if they are going to churn.

Ajay: Why is Text Analytics so important?  How will companies use it now and into the future?
Olivier –
Actually, the question you should ask is, “Why is unstructured data so important?” Today, more than ever, people love to share their opinions — through the estimated 183 billion emails sent, the 1.6 million blog posts, millions of inquiries captured in call center notes, and thousands of comments on diverse social networking sites and community message boards. And, let�s not forget all data that flows through Twitter. Companies today would be short-sighted to ignore what their customers are saying about their products and services, in their own words. Those opinions � likes and dislikes � are essential nuggets and bear much more insights than demographic or transactional data to reducing customer churn, improving satisfaction, fighting crime, detecting fraud and increasing marketing campaign results.

Ajay: How is SPSS venturing into cloud computing and SaaS?

Olivier: SPSS has been at the origin of the PMML standard to allow organizations to provision their computing power in a very flexible manner � just like provisioning computing power through cloud computing. SPSS strongly believes in the benefits of a cloud computing environment, which is why all of our applications are designed with Service Oriented Architecture components.  This enables SPSS to be flexible enough to meet the demands of the market as they change with respect to delivery mode.  We are currently analyzing business and technical issues related to SPSS technologies in the cloud, such as the scoring and delivery of analytics.  In regards to SaaS, we currently offer hosted services for our PASW Data Collection (formerly Dimensions) survey research suite of products.

Ajay: Do you think business intelligence is an over used term? Why do you think BI and Predictive Analytics failed in mortgage delinquency forecasting and reporting despite the financial sector being a big spender on BI tools?

Oliver: There is a big difference between business intelligence (BI) and Predictive Analytics. Traditional BI technologies focus on what�s happening now or what�s happened in the past by primarily using financial or product data. For organizations to take the most effective action, they need to know and plan for what may happen in the future by using people data � and that�s harnessed through Predictive Analytics.

Another way to look at it � Predictive covers the entire capture, predict and act continuum � from the use of survey research software to capture customer feedback (attitudinal data), to creating models to predict customer behaviors, and then acting on the results to improve business processes. Predictive Analytics, unlike BI, provides the secret ingredient and answers the question, �What will the customer do next?�

That being said, financial institutions didn�t need to use Predictive Analytics to see
that some lenders sold mortgages to unqualified individuals likely to default. Predictive Analytics is an incredible application used to detect fraud, waste and abuse. Companies in the financial services industry can focus on mitigating their overall risk by creating better predictive models that not only encompass richer data sets, but also better rules-based automation.

Ajay: What do people do at SPSS to have fun when they are not making complex mathematical algorithms?
Oliver: SPSS employees love our casual, friendly atmosphere, our professional and talented colleagues, and our cool, cutting-edge technology. The fun part comes from doing meaningful work with great people, across different groups and geographies. Of course being French, I have ensured that my colleagues are fully educated on the best wine and cuisine. And being based in Chicago, there is always a spirited baseball debate between the Cubs and White Sox. However, I am yet to convince anyone that rugby is a better sport.

Biography

Olivier Jouve is Vice President, Corporate Development, at SPSS Inc. He is responsible for defining SPSS strategic directions, growth opportunities through internal development, merger and acquisitions and/or tactical alliances. As a pioneer in the field of data and text mining for the last 20 years, he has created the foundation of Text Analytics technology for analyzing customer interactions at SPSS. Jouve is a successful serial entrepreneur and has had his works published internationally in the area of Analytical CRM, text mining, search engines, competitive intelligence and knowledge management.

Basic Text Mining :3 Simple Paths

The locals of Punjab (india). These are the tr...
Image via Wikipedia

Text Mining in which you search alpha numeric data for meaningful patterns is relatively more complex than plain numeric variable data crunching. The reason for that is human eye can measure only a few hundred rows of data before getting tired, and analytics software algorithms need to properly programmed else they miss the relevant solution or text. An example, how many Punjabis live in Delhi (Stats needed), suppose you have a Dataset that has all the names in Delhi,in order to send an sms contest (Marketing Decision) on Lohri (Punjabi specific Festival)

Text Manipulation can be done by TRIM and LOWER functions in EXCEL and corresponding functions in SAS. For Mining use the following options-

1)SAS Basic Text Mining -Using Only Base SAS

In SAS you can use the INDEXW function for text mining.

As per SAS Online DOc

INDEXW(source, excerpt)

Arguments

source
specifies the character expression to search.
excerpt
specifies the string of characters to search for in the character expression. SAS removes the leading and trailing blanks from excerpt.

The INDEXW function searches source, from left to right, for the first occurrence of excerpt and returns the position in source of the substring’s first character. If the substring is not found in source, INDEXW returns a value of 0. If there are multiple occurrences of the string, INDEXW returns only the position of the first

occurrence.”

2) MS EXCEL

You can use MS Excel for text mining too. I recommend Office 2007 simply because it can handle more rows.

The function in Excel is SEARCH

image

3) MS ACCESS

In MS Access you can use LIKE Queries to create a different table or append a value to certain columns

.Example

Some problems can?t be solved with comparisons : e.g. ?His name begins with Mc or Mac. In a case like this, wildcards are required, and are represented in SQL with the % sign and the LIKE keyword.

e.g.

SELECT au_lname, city

FROM authors

WHERE au_lname LIKE ?Mc&? or au_lanme LIKE ?Mac%?

UPDATED- The above post is now obsolete- there are easier and better ways to to text mining. That includes weka and R