The Big Data Event- Why am I here?

I am here braving New York’s cold weather, as I prepare for this evening’s events. If you follow this blog closely ( including the poems) ,it is a welcome change— New York is a nice city people are friendly if you ask them nicely and the bus is a great way to watch the city – best of all I like the crowds which I have grown used while living in India.

Why Am I here?

Because the topics that are discussed here are cutting edge to the point that I cannot find anyone willing to teach me Hadoop and Map-Reduce while in University and at the same time teach me statistics on them as well ( as in how do we do a K Means clustering on a 1 terabyte dataset).

I asked the organizers on what makes the event special ( every event promises special Mojo after all).

This is what they said-

What is the unique value proposition of the event that will help developers and both current and potential customers-

The essence of the event is to explore new innovations in massively-parallel processing data warehousing technology and how it can help companies gain more insight from their data.  Applications include fraud detection, behavioral targeting, social network analysis, better predictions/forecasting, bioinformatics, etc.  We are exploring how MapReduce and Hadoop can be integrated into the enterprise IT system to help evolve data warehousing/BI/data mining

and to put it even more nicely’

The industry’s first big data event, Big Data Summit ‘09, being held this evening in New York City, will showcase Hadoop’s fit with MPP data warehouses. Aster Data will be presenting alongside Colin White, President and Founder of BI Research, Mike Brown of comScore Inc., and Jonathan Goldman, who represents LinkedIn.”

That’s good enough for me to drop into Roosevelt Hotel on East 45th Street at around 6 pm for some reluctant networking ( read: beers). 5 years ago whie working for GE , I used to run queries using SAS on a 147 million row database (the size of the DB) and wait 3 hours for it to come back. Today that much data fits very snugly in my laptop. How soon will we have Terabyte level personal computing, and Petabyte level business computing and the challenges it poses to standard statistical assumptions and synching of hardware and software- Big Big Data is an interesting area to watch.

SPSS gets Directions

A link to the Predictive Analytic Conference by SPSS ( the first after the Big Blue announcement) at http://www.spss.com/spssdirections/na/index.htm

Should be interesting for existing clients and SPSS watchers.

spss

Analyzing Monkeys

I once promised a reader long time back that I would not get into politics but something unexpected hit me like a big truck.

At what point do you decide your boss is a racist. How do you analyze the difference between jokes and racial insults.

Another interesting analysis

Citation Emerald

Red R- A new beginning

Check out an interesting new interface to R.

Note I haven’t tested it but plan to do so shortly as I am currently using Ubuntu 9 almost exclusively nowadays.

R fans who are  not quite overjoyed  with the wonderful beauty and charm  of the traditional R GUI may want to give it a try.

Citation-

http://code.google.com/p/r-orange/

Note- This website does not assume responsibilty for any software glitches as R comes with no warranty- unlike other softwares that come loaded with both a warranty and then bug-fix patches.

redr

Losing a Million Bucks: Netflix Prize Interview

I ( and collective pseudo geeks) across the world lost a potential million dollars when the following team won the Netflix prize. In disgust, I just renewed my Netflix subscription and noticed a 10% increase in the way I liked them.

Jokes apart, here is an except ( perhaps one of the few ever) of an interview of the Netflix winners done by the great Eric Siegel, Phd.

Eric is conference chair of the Predictive Analytics Conference ( a King Arthur’s round table conference on all the shining knights of the data analytic’s world)

Citation-http://www.predictiveanalyticsworld.com/layman-netflix-leader.php

[ES] With no relevant background in statistics — let alone product recommendations specifically — what capabilities or background did make your success possible? Do you consider yourselves mathematicians, or at least strong with math?

[MC] I am certainly not a mathematician – I have engineering level skill. I consider Martin Piotte to have an exceptional mathematical mind (he participated successfully in international math contests when he was a student) even though he never formally studied in that field. In the end, the mathematics used in this contest seem very complex, but are really rather simple. Compared to what most people think, this was more of an engineering contest than a mathematical contest [See Martin’s response below for elaboration on this central point. -Ed]. Also, I think that having a perhaps less in-depth but wider array of skills and knowledge helped us.

[ES] You’ve said, when first getting started, you learned many core strategies/techniques from the Netflix Prize discussion board. Did you do much reading or research elsewhere to ramp up?

[MC] Having started late in the competition, the forum was a good starting point as many avenues had already been explored and links had been posted to many interesting papers. In the end though, reading and getting a good understanding of the actual research papers was a very important step. The forum was also a place where people proposed new (sometimes far fetched) ideas; these ideas often inspired us to come up with our own creative innovations.

PAWS is a great place to meet, greet and do business and though it is 5 hours away I have too much homework to do and grade while at University of Tennessee ( for now)-

Here is a very interesting poll that they are carrying it is good to see conferences take feedback in such a transparent manner-

paws poll

A comment on OffShoring

A comment on offshoring was put by a reader- I am re-posting it entirely.

When you use the phrase “labor shortage” or “skills shortage” you’re speaking in a sentence fragment.  What you actually mean to say is:  “There is a labor shortage at the salary level I’m willing to pay.”  That statement is the correct phrase; the complete sentence and the intellectually honest statement.

Employers speak about shortages as though they represent some absolute, readily identifiable lack of desirable services. Price is rarely accorded its proper importance in their discussion.

If you start raising wages and improving working conditions, and continue doing so, you’ll solve your shortage and will have people lining up around the block to work for you even if you need to have huge piles of steaming manure hand-scooped on a blazing summer afternoon.

Re:  Shortage caused by employees retiring out of the workforce:  With the majority of retirement accounts down about 50% or more, most people entering retirement age are working well into their sunset years.  So, you won’t be getting a worker shortage anytime soon due to retirees exiting the workforce.

Okay, fine.  Some specialized jobs require training and/or certification, again, the solution is higher wages and improved benefits. People will self-fund their re-education so that they can enter the industry in a work-ready state.  The attractive wages, working conditions and career prospects of technology during the 1980’s and 1990’s was a prime example of people’s willingness to self-fund their own career re-education.

There is never enough of any good or service to satisfy all wants or desires. A buyer, or employer, must give up something to get something. They must pay the market price and forego whatever else he could have for the same price. The forces of supply and demand determine these prices — and the price of a skilled workman is no exception. The buyer can take it or leave it. However, those who choose to leave it (because of lack of funds or personal preference) must not cry shortage. The good is available at the market price. All goods and services are scarce, but scarcity and shortages are by no means synonymous. Scarcity is a regrettable and unavoidable fact.

Shortages are purely a function of price. The only way in which a shortage has existed, or ever will exist, is in cases where the “going price” has been held below the market-clearing price.

How to use Oracle for Data Mining

Oracle for Data Mining!!!! Thats right I am talking of the same Database company that made waves with acquiring Sun ( and the beloved Java) and has been stealing market share left and right.

Here are some techie specific help- if you know SQL ( or Even Proc SQL) you can learn Oracle Data Mining in less than an hour- good enough to clear that job shortlist.

Check out the attached sample code examples.  They are designed to run on the ODM demo data, but you could change that easily.  They are posted on OTN here

Sample Code Demonstrating Oracle 11.1 Data Mining (230KB)
These files include sample programs in PL/SQL and Java illustrating each of the algorithms supported by Oracle Data Mining 11.1. There are examples of automatic data preparation and data transformations appropriate for each algorithm. Several programs illustrate the text transformation and text mining process.

Oracle Data Mining PL/SQL Sample Programs

The PL/SQL sample programs illustrate each algorithm supported by Oracle Data Mining as well as text transformation and text mining using NMF and SVM classification. Transformations that prepare the data for mining are included in the programs.Execute the PL/SQL sample programs.

Mining Function Algorithm Sample Program
Anomaly Detection One-Class Support Vector Machine dmsvodem.sql
Association Rules Apriori dmardemo.sql
Attribute Importance Minimum Descriptor Length dmaidemo.sql
Classification Adaptive Bayes Network (deprecated) dmabdemo.sql
Classification Decision Tree dmdtdemo.sql
Classification Decision Tree (cross validation) dmdtxvlddemo.sql
Classification Logistic Regression dmglcdem.sql
Classification Naive Bayes dmnbdemo.sql
Classification Support Vector Machine dmsvcdem.sql
Clustering k-Means dmkmdemo.sql
Clustering O-Cluster dmocdemo.sql
Feature Extraction Non-Negative Matrix Factorization dmnmdemo.sql
Regression Linear Regression dmglrdem.sql
Regression Support Vector Machine dmsvrdem.sql
Text Mining Text transformation using Oracle Text dmtxtfe.sql
Text Mining Non-Negative Matrix Factorization dmtxtnmf.sql
Text Mining Support Vector Machine (Classification) dmtxtsvm.sql

And

a particularly cute and nifty example of Fraud ( as in Fraud Detection 😉

drop table CLAIMS_SET;
exec dbms_data_mining.drop_model(‘CLAIMSMODEL’);
create table CLAIMS_SET (setting_name varchar2(30), setting_value varchar2(4000));
insert into CLAIMS_SET values (‘ALGO_NAME’,’ALGO_SUPPORT_VECTOR_MACHINES’);
insert into CLAIMS_SET values (‘PREP_AUTO’,’ON’);
commit;
begin
dbms_data_mining.create_model(‘CLAIMSMODEL’, ‘CLASSIFICATION’,
‘CLAIMS’, ‘POLICYNUMBER’, null, ‘CLAIMS_SET’);
end;
/
— accuracy (per-class and overall)
col actual format a6
select actual, round(corr*100/total,2) percent, corr, total-corr incorr, total from
(select actual, sum(decode(actual,predicted,1,0)) corr, count(*) total from
(select CLAIMS actual, prediction(CLAIMSMODEL using *) predicted
from CLAIMS_APPLY)
group by rollup(actual));
— top 5 most suspicious claims where the number of previous claims is 2 or more:
select * from
(select POLICYNUMBER, round(prob_fraud*100,2) percent_fraud,
rank() over (order by prob_fraud desc) rnk from
(select POLICYNUMBER, prediction_probability(CLAIMSMODEL, ‘0’ using *) prob_fraud
from CLAIMS_APPLY
where PASTNUMBEROFCLAIMS in (‘2 to 4’, ‘more than 4’)
where rnk <= 5
order by percent_fraud desc;

Coming up- a series of tutorials on learning the skills by just sitting in your home.

Hat Tip- Karl Rexer , Rexer Analytics and Charlie Berger, Oracle.