SPSS bought by Big Blue

SPSS Inc maker of PASW series of analytics softwares is being bought by IBM ( unless Oracle spikes this deal too). IBM is seeking a play in the rapidly growing analytics market and is also a strategic partner to WPS ( who makes the Base SAS alternative SAS language software).

In a personal note- I just entered University of Tennessee as a statistics student.

Interesting community event by R/Statistical community

Citation-
http://en.oreilly.com/oscon2009/public/schedule/detail/10432

StackOverflow Flash Mob for the R User Community
Moderated by: Michael E. Driscoll
7:00pm Wednesday, 07/22/2009
Location: Ballroom A2

In concert with users online across the country, this session will lead a flashmob to populate StackOverflow with R language content.

R, the open source statistical language, has a notoriously steep learning curve. The same technical questions tend be asked repeatedly on the R-help mailing lists, to the detriment of both R experts (who tire of repeating themselves) and the learners (who often receive a technically correct, but terse response).

We have developed a list of the most common 100 technical R questions, based on an analysis of (i) queries sent to the RSeek.org web portal, and (ii) an examination of the R-help list archives, and (iii) a survey of members of R Users Groups in San Francisco, LA, and New York City.

In the first hour, participants will pair up to claim a question, formulate it on StackOverflow, and provide a comprehensive answer. In the second hour, participants will rate, review, and comment on the set of submitted questions and answers.

While Stackoverflow currently lacks content for the R language, we believe this effort will provide the spark to attract more R users, and emerge as a valuable resource to the growing R community.

This is an interesting example of a statistical software community using twitter for a tech help event. I hope this trend/ event gets replicated again and again-

Statisticians worldwide unite in the language of maths !!!

Please follow @rstatsmob to participate. See you at 7 PM PST!

twitter.com/Rstatsmob

Growing Rapidly: Rapid Miner 4.5

The Europe based Rapid Miner came out with version 4.5 of their data mining tool ( also known as Yale) with a much promising “Script” tool.

Also, Rapid Miner came in 1st in open source data mining tools in a poll by Industry benchmark www.kdnuggets.com

They have a brilliant video here for people who just want to have a look at the new Rapid Miner

http://rapid-i.com/videos/rapidminer_tour_3_4_en.html

Citation-

http://rapid-i.com/content/view/147/1/

New Operators:

  • FormulaExtractor
  • Trend
  • LagSeries
  • VectorLinearRegression
  • ExampleSetMinus
  • ExampleSetIntersect
  • Partition
  • Script
  • ForwardSelection
  • NeuralNetImproved
  • KernelNaiveBayes
  • ExhaustiveSubgroupDiscovery
  • URLExampleSource
  • NonDominatedSorting
Image

More Features:

  • The new Script operator allows for arbitrary user defined operations based on Groovy script combined with a simplified RapidMiner syntax
  • Improved the join operator and added options for left and right outer joins
  • New notification mail mechanism at the end of processes
  • Most file based data input operators now provide an option to skip error lines
  • Most file based example source operators as well as the IOObjectReader and the new URLExampleSource now accept URLs instead of a filename for the input source location

Managing Twitter Relationships:Refollow.com

If you have more than 100 followers, or people following you on Twitter and want to have some kind of Outlook like manager for managing so much info- this is a great tool from www.refollow.com

Added benefits-
1) Secure Login using Twitter Authorization
2) Visual Click and Easy Follow- Unfollow Blocking based on activities
3) Segmenting groups of people basede on behavior
4) Hidden insights on who all suddenly have un-followed me ( after I accidentally revealed the spoiler end of latest Harry Potter movie- Dumbeldore will sleep with the fishes)

The Screenshot (of my refollow page) below shows you most of the properties-

a1

Personal- Google's bADSENSE

If you notice I removed the ads from this site, the Goggle Ad Sense ads. The reason for that was I found it no corelation at all between what I was writing and what kind of ads I saw.

Maybe it is my location- India, but after watching ads for career job sites, video games, marketing networks and computer training alongside some of my writing- I decided to split with the big G and call it an end to Google’s bad Adsense.

The irony of a data mining blog failing to get relevant data mining ads from a data mining search engine.

(NOT coming up- Decisionstats T Shirts with quotes from interviews)

Now back to coding and research.

R language on the GPU

Here are some nice articles on using R on Graphical Processing Units (GPU) mainly made by NVidia. Think of a GPU as a customized desktop with specialized computing equivalent to much faster computing. i.e. Matlab users can read the webinars here http://www.nvidia.com/object/webinar.html

Now a slightly better definition of GPU computing is from http://www.nvidia.com/object/GPU_Computing.html

GPU computing is the use of a GPU (graphics processing unit) to do general purpose scientific and engineering computing.
The model for GPU computing is to use a CPU and GPU together in a heterogeneous computing model. The sequential part of the application runs on the CPU and the computationally-intensive part runs on the GPU. From the user’s perspective, the application just runs faster because it is using the high-performance of the GPU to boost performance.

rgpu

Citation:

http://brainarray.mbni.med.umich.edu/brainarray/rgpgpu/

R is the most popular open source statistical environment in the biomedical research community. However, most of the popular R function implementations involve no parallelism and they can only be executed as separate instances on multicore or cluster hardware for large data-parallel analysis tasks. The arrival of modern graphics processing units (GPUs) with user friendly programming tools, such as nVidia’s CUDA toolkit (http://www.nvidia.com/cuda), provides a possibility of increasing the computational efficiency of many common tasks by more than one order of magnitude (http://gpgpu.org/). However, most R users are not trained to program a GPU, a key obstacle for the widespread adoption of GPUs in biomedical research.

The research project at the page mentioned above has developed special packages for the above need- R on a GPU.

he initial package is hosted by CRAN as gputools a sorce package for UNIX and Linux systems. Be sure to set the environment variable CUDA_HOME to the root of your CUDA toolkit installation. Then install the package in the usual R manner. The installation process will automatically make use of nVidia’s nvcc compiler and CUBLAS shared library.

and some figures

speedupFigure 1 provides performance comparisons between original R functions assuming a four thread data parallel solution on Intel Core i7 920 and our GPU enabled R functions for a GTX 295 GPU. The speedup test consisted of testing each of three algorithms with five randomly generated data sets. The Granger causality algorithm was tested with a lag of 2 for 200, 400, 600, 800, and 1000 random variables with 10 observations each. Complete hierarchical clustering was tested with 1000, 2000, 4000, 6000, and 8000 points. Calculation of Kendall’s correlation coefficient was tested with 20, 30, 40, 50, and 60 random variables with 10000 observations each

Ajay- For hard core data mining people ,customized GPU’s for accelerated analytics and data mining sounds like fun and common sense. Are there other packages for customization on a GPU – let me know.

Citation:

http://brainarray.mbni.med.umich.edu/brainarray/rgpgpu/

Download

Download the gputools package for R on a Linux platform here: version 0.01.

Interview Jim Harris Data Quality Expert OCDQ Blog

Here is an interview with one of the chief evangelists to data quality in the field of Business Intelligence, Jim Harris who has a renowned blog at http://www.ocdqblog.com/. I asked Jim about his experiences in the field on data quality messing up big budget BI projects, and some tips and methodologies to avoid them.

No one likes to feel blamed for causing or failing to fix the data quality problems- Jim Harris, Data Quality Expert.

Jim Harris Large Photo

Ajay- Why the name OCDQ? What drives your passion for data quality? Name any anecdotes where bad data quality really messed up a big BI project.

Jim Harris – Ever since I was a child, I have had an obsessive-compulsive personality. If you asked my professional colleagues to describe my work ethic, many would immediately respond: “Jim is obsessive-compulsive about data quality…but in a good way!” Therefore, when evaluating the short list of what to name my blog, it was not surprising to anyone that Obsessive-Compulsive Data Quality (OCDQ) was what I chose.

On a project for a financial services company, a critical data source was applications received by mail or phone for a variety of insurance products. These applications were manually entered by data entry clerks. Social security number was a required field and the data entry application had been designed to only allow valid values. Therefore, no one was concerned about the data quality of this field – it had to be populated and only valid values were accepted.

When a report was generated to estimate how many customers were interested in multiple insurance products by looking at the count of applications per social security number, it appeared as if a small number of customers were interested in not only every insurance product the company offered, but also thousands of policies within the same product type. More confusion was introduced when the report added the customer name field, which showed that this small number of highly interested customers had hundreds of different names. The problem was finally traced back to data entry.

Many insurance applications were received without a social security number. The data entry clerks were compensated, in part, based on the number of applications they entered per hour. In order to process the incomplete applications, the data entry clerks entered their own social security number.

On a project for a telecommunications company, multiple data sources were being consolidated into a new billing system. Concerns about postal address quality required the use of validation software to cleanse the billing address. No one was concerned about the telephone number field – after all, how could a telecommunications company have a data quality problem with telephone number?

However, when reports were run against the new billing system, a high percentage of records had a missing telephone number. The problem was that many of the data sources originated from legacy systems that only recently added a telephone number field. Previously, the telephone number was entered into the last line of the billing address.

New records entered into these legacy systems did start using the telephone number field, but the older records already in the system were not updated. During the consolidation process, the telephone number field was mapped directly from source to target and the postal validation software deleted the telephone number from the cleansed billing address.

Ajay- Data Quality – Garbage in, Garbage out for a project. What percentage of a BI project do you think gets allocated to input data quality? What percentage of final output is affected by the normalized errors?

Jim Harris- I know that Gartner has reported that 25% of critical data within large businesses is somehow inaccurate or incomplete and that 50% of implementations fail due to a lack of attention to data quality issues.

The most common reason is that people doubt that data quality problems could be prevalent in their systems. This “data denial” is not necessarily a matter of blissful ignorance, but is often a natural self-defense mechanism from the data owners on the business side and/or the application owners on the technical side.

No one likes to feel blamed for causing or failing to fix the data quality problems.

All projects should allocate time and resources for performing a data quality assessment, which provides a much needed reality check for the perceptions and assumptions about the quality of the data. A data quality assessment can help with many tasks including verifying metadata, preparing meaningful questions for subject matter experts, understanding how data is being used, and most importantly – evaluating the ROI of data quality improvements. Building data quality monitoring functionality into the applications that support business processes provides the ability to measure the effect that poor data quality can have on decision-critical information.

Ajay- Companies talk of paradigms like Kaizen, Six Sigma and LEAN for eliminating waste and defects. What technique would you recommend for a company just about to start a major BI project for a standard ETL and reporting project to keep data aligned and clean?

Jim Harris- I am a big advocate for methodology and best practices and the paradigms you mentioned do provide excellent frameworks that can be helpful. However, I freely admit that I have never been formally trained or certified in any of them. I have worked on projects where they have been attempted and have seen varying degrees of success in their implementation. Six Sigma is the one that I am most familiar with, especially the DMAIC framework.

However, a general problem that I have with most frameworks is their tendency to adopt a one-size-fits-all strategy, which I believe is an approach that is doomed to fail. Any implemented framework must be customized to adapt to an organization’s unique culture. In part, this is necessary because implementing changes of any kind will be met with initial resistance, but an attempt at forcing a one-size-fits-all approach almost sends a message to the organization that everything they are currently doing is wrong, which will of course only increase the resistance to change.

Starting with a framework as a reference provides best practices and recommended options of what has worked for other organizations. The framework should be reviewed to determine what can best be learned from it and to select what will work in the current environment and what simply won’t. This doesn’t mean that the selected components of the framework will be implemented simultaneously. All change comes gradually and the selected components will most likely be implemented in phases.

Fundamentally, all change starts with changing people’s minds. And to do that effectively, the starting point has to be improving communication and encouraging open dialogue. This means more of listening to what people throughout the organization have to say and less of just telling them what to do. Keeping data aligned and clean requires getting people aligned and communicating.

Ajay- What methods and habits would you recommend to young analysts starting in the BI field for a quality checklist?

Jim Harris- I always make two recommendations.

First, never make assumptions about the data. I don’t care how well the business requirements document is written or how pretty the data model looks or how narrowly your particular role on the project has been defined. There is simply no substitute for looking at the data.

Second, don’t be afraid to ask questions or admit when you don’t know the answers. The only difference between a young analyst just starting out and an expert is that the expert has already made and learned from all the mistakes caused by being afraid to ask questions or admitting when you don’t know the answers.

Ajay- What does Jim Harris do to have quality time when not at work?

Jim- Since I enjoy what I do for a living so much, it sometimes seems impossible to disengage from work and make quality time for myself. I have also become hopelessly addicted to social media and spend far too much time on Twitter and Facebook. I have also always spent too much of my free time watching television and movies. I do try to read as much as I can, but I have so many stacks of unread books in my house that I could probably open my own book store. True quality time typically requires the elimination of all technology by going walking, hiking or mountain biking. I do bring my mobile phone in case of emergencies, but I turn it off before I leave.

Biography-

Jim Harris Small PhotoJim Harris is the Blogger-in-Chief here at Obsessive-Compulsive Data Quality (OCDQ), which is an independent blog offering a vendor-neutral perspective on data quality.

He is an independent consultant, speaker, writer and blogger with over 15 years of professional services and application development experience in data quality (DQ), and business intelligence (BI),

Jim has worked with Global 500 companies in finance, brokerage, banking, insurance, healthcare, pharmaceuticals, manufacturing, retail, telecommunications, and utilities. Jim also has a long history with the product that is now known as IBM InfoSphere QualityStage. Additionally, he has some experience with Informatica Data Quality and DataFlux dfPower Studio.

Jim can be followed at twitter.com/ocdqblog and contacted at http://www.ocdqblog.com/contact/