July 2009 – Page 2 – DECISION STATS

Managing Twitter Relationships:Refollow.com

If you have more than 100 followers, or people following you on Twitter and want to have some kind of Outlook like manager for managing so much info- this is a great tool from www.refollow.com

Added benefits-
1) Secure Login using Twitter Authorization
2) Visual Click and Easy Follow- Unfollow Blocking based on activities
3) Segmenting groups of people basede on behavior
4) Hidden insights on who all suddenly have un-followed me ( after I accidentally revealed the spoiler end of latest Harry Potter movie- Dumbeldore will sleep with the fishes)

The Screenshot (of my refollow page) below shows you most of the properties-

R language on the GPU

Here are some nice articles on using R on Graphical Processing Units (GPU) mainly made by NVidia. Think of a GPU as a customized desktop with specialized computing equivalent to much faster computing. i.e. Matlab users can read the webinars here http://www.nvidia.com/object/webinar.html

Now a slightly better definition of GPU computing is from http://www.nvidia.com/object/GPU_Computing.html

GPU computing is the use of a GPU (graphics processing unit) to do general purpose scientific and engineering computing.
The model for GPU computing is to use a CPU and GPU together in a heterogeneous computing model. The sequential part of the application runs on the CPU and the computationally-intensive part runs on the GPU. From the user’s perspective, the application just runs faster because it is using the high-performance of the GPU to boost performance.

rgpu

Citation:

http://brainarray.mbni.med.umich.edu/brainarray/rgpgpu/

R is the most popular open source statistical environment in the biomedical research community. However, most of the popular R function implementations involve no parallelism and they can only be executed as separate instances on multicore or cluster hardware for large data-parallel analysis tasks. The arrival of modern graphics processing units (GPUs) with user friendly programming tools, such as nVidia’s CUDA toolkit (http://www.nvidia.com/cuda), provides a possibility of increasing the computational efficiency of many common tasks by more than one order of magnitude (http://gpgpu.org/). However, most R users are not trained to program a GPU, a key obstacle for the widespread adoption of GPUs in biomedical research.

The research project at the page mentioned above has developed special packages for the above need- R on a GPU.

he initial package is hosted by CRAN as gputools a sorce package for UNIX and Linux systems. Be sure to set the environment variable CUDA_HOME to the root of your CUDA toolkit installation. Then install the package in the usual R manner. The installation process will automatically make use of nVidia’s nvcc compiler and CUBLAS shared library.

and some figures

Figure 1 provides performance comparisons between original R functions assuming a four thread data parallel solution on Intel Core i7 920 and our GPU enabled R functions for a GTX 295 GPU. The speedup test consisted of testing each of three algorithms with five randomly generated data sets. The Granger causality algorithm was tested with a lag of 2 for 200, 400, 600, 800, and 1000 random variables with 10 observations each. Complete hierarchical clustering was tested with 1000, 2000, 4000, 6000, and 8000 points. Calculation of Kendall’s correlation coefficient was tested with 20, 30, 40, 50, and 60 random variables with 10000 observations each

Ajay- For hard core data mining people ,customized GPU’s for accelerated analytics and data mining sounds like fun and common sense. Are there other packages for customization on a GPU – let me know.

Citation:

http://brainarray.mbni.med.umich.edu/brainarray/rgpgpu/

Download

Download the gputools package for R on a Linux platform here: version 0.01.

Interview Jim Harris Data Quality Expert OCDQ Blog

Here is an interview with one of the chief evangelists to data quality in the field of Business Intelligence, Jim Harris who has a renowned blog at http://www.ocdqblog.com/. I asked Jim about his experiences in the field on data quality messing up big budget BI projects, and some tips and methodologies to avoid them.

No one likes to feel blamed for causing or failing to fix the data quality problems- Jim Harris, Data Quality Expert.

Jim Harris Large Photo

Ajay- Why the name OCDQ? What drives your passion for data quality? Name any anecdotes where bad data quality really messed up a big BI project.

Jim Harris – Ever since I was a child, I have had an obsessive-compulsive personality. If you asked my professional colleagues to describe my work ethic, many would immediately respond: “Jim is obsessive-compulsive about data quality…but in a good way!” Therefore, when evaluating the short list of what to name my blog, it was not surprising to anyone that Obsessive-Compulsive Data Quality (OCDQ) was what I chose.

On a project for a financial services company, a critical data source was applications received by mail or phone for a variety of insurance products. These applications were manually entered by data entry clerks. Social security number was a required field and the data entry application had been designed to only allow valid values. Therefore, no one was concerned about the data quality of this field – it had to be populated and only valid values were accepted.

When a report was generated to estimate how many customers were interested in multiple insurance products by looking at the count of applications per social security number, it appeared as if a small number of customers were interested in not only every insurance product the company offered, but also thousands of policies within the same product type. More confusion was introduced when the report added the customer name field, which showed that this small number of highly interested customers had hundreds of different names. The problem was finally traced back to data entry.

Many insurance applications were received without a social security number. The data entry clerks were compensated, in part, based on the number of applications they entered per hour. In order to process the incomplete applications, the data entry clerks entered their own social security number.

On a project for a telecommunications company, multiple data sources were being consolidated into a new billing system. Concerns about postal address quality required the use of validation software to cleanse the billing address. No one was concerned about the telephone number field – after all, how could a telecommunications company have a data quality problem with telephone number?

However, when reports were run against the new billing system, a high percentage of records had a missing telephone number. The problem was that many of the data sources originated from legacy systems that only recently added a telephone number field. Previously, the telephone number was entered into the last line of the billing address.

New records entered into these legacy systems did start using the telephone number field, but the older records already in the system were not updated. During the consolidation process, the telephone number field was mapped directly from source to target and the postal validation software deleted the telephone number from the cleansed billing address.

Ajay- Data Quality – Garbage in, Garbage out for a project. What percentage of a BI project do you think gets allocated to input data quality? What percentage of final output is affected by the normalized errors?

Jim Harris- I know that Gartner has reported that 25% of critical data within large businesses is somehow inaccurate or incomplete and that 50% of implementations fail due to a lack of attention to data quality issues.

The most common reason is that people doubt that data quality problems could be prevalent in their systems. This “data denial” is not necessarily a matter of blissful ignorance, but is often a natural self-defense mechanism from the data owners on the business side and/or the application owners on the technical side.

No one likes to feel blamed for causing or failing to fix the data quality problems.

All projects should allocate time and resources for performing a data quality assessment, which provides a much needed reality check for the perceptions and assumptions about the quality of the data. A data quality assessment can help with many tasks including verifying metadata, preparing meaningful questions for subject matter experts, understanding how data is being used, and most importantly – evaluating the ROI of data quality improvements. Building data quality monitoring functionality into the applications that support business processes provides the ability to measure the effect that poor data quality can have on decision-critical information.

Ajay- Companies talk of paradigms like Kaizen, Six Sigma and LEAN for eliminating waste and defects. What technique would you recommend for a company just about to start a major BI project for a standard ETL and reporting project to keep data aligned and clean?

Jim Harris- I am a big advocate for methodology and best practices and the paradigms you mentioned do provide excellent frameworks that can be helpful. However, I freely admit that I have never been formally trained or certified in any of them. I have worked on projects where they have been attempted and have seen varying degrees of success in their implementation. Six Sigma is the one that I am most familiar with, especially the DMAIC framework.

However, a general problem that I have with most frameworks is their tendency to adopt a one-size-fits-all strategy, which I believe is an approach that is doomed to fail. Any implemented framework must be customized to adapt to an organization’s unique culture. In part, this is necessary because implementing changes of any kind will be met with initial resistance, but an attempt at forcing a one-size-fits-all approach almost sends a message to the organization that everything they are currently doing is wrong, which will of course only increase the resistance to change.

Starting with a framework as a reference provides best practices and recommended options of what has worked for other organizations. The framework should be reviewed to determine what can best be learned from it and to select what will work in the current environment and what simply won’t. This doesn’t mean that the selected components of the framework will be implemented simultaneously. All change comes gradually and the selected components will most likely be implemented in phases.

Fundamentally, all change starts with changing people’s minds. And to do that effectively, the starting point has to be improving communication and encouraging open dialogue. This means more of listening to what people throughout the organization have to say and less of just telling them what to do. Keeping data aligned and clean requires getting people aligned and communicating.

Ajay- What methods and habits would you recommend to young analysts starting in the BI field for a quality checklist?

Jim Harris- I always make two recommendations.

First, never make assumptions about the data. I don’t care how well the business requirements document is written or how pretty the data model looks or how narrowly your particular role on the project has been defined. There is simply no substitute for looking at the data.

Second, don’t be afraid to ask questions or admit when you don’t know the answers. The only difference between a young analyst just starting out and an expert is that the expert has already made and learned from all the mistakes caused by being afraid to ask questions or admitting when you don’t know the answers.

Ajay- What does Jim Harris do to have quality time when not at work?

Jim- Since I enjoy what I do for a living so much, it sometimes seems impossible to disengage from work and make quality time for myself. I have also become hopelessly addicted to social media and spend far too much time on Twitter and Facebook. I have also always spent too much of my free time watching television and movies. I do try to read as much as I can, but I have so many stacks of unread books in my house that I could probably open my own book store. True quality time typically requires the elimination of all technology by going walking, hiking or mountain biking. I do bring my mobile phone in case of emergencies, but I turn it off before I leave.

Biography-

Jim Harris Small Photo Jim Harris is the Blogger-in-Chief here at Obsessive-Compulsive Data Quality (OCDQ), which is an independent blog offering a vendor-neutral perspective on data quality.

He is an independent consultant, speaker, writer and blogger with over 15 years of professional services and application development experience in data quality (DQ), and business intelligence (BI),

Jim has worked with Global 500 companies in finance, brokerage, banking, insurance, healthcare, pharmaceuticals, manufacturing, retail, telecommunications, and utilities. Jim also has a long history with the product that is now known as IBM InfoSphere QualityStage. Additionally, he has some experience with Informatica Data Quality and DataFlux dfPower Studio.

Jim can be followed at twitter.com/ocdqblog and contacted at http://www.ocdqblog.com/contact/

Interview Eric Siegel, Phd President Prediction Impact

An interview with Eric Siegel, Ph.D.President of Prediction Impact, Inc. and founding chair of Predictive Analytics World.

Ajay- What does this round of Predictive Analytics World have —–which was not there in the edition earlier in the year.

Eric- Predictive Analytics World (pawcon.com) – Oct 20-21 in DC delivers a fresh set of 25 vendor-neutral presentations across verticals employing predictive analytics, such as banking, financial services, e-commerce, education, healthcare, high technology, insurance, non-profits, publishing, retail and telecommunications.

PAW features keynote speaker, Stephen Baker, author of The Numerati and Senior writer at BusinessWeek. His keynote is described at www.predictiveanalyticsworld.com/dc/2009/agenda.php#day2-2

A strong representation of leading enterprises have signed up to tell their stories — speakers will present how predictive analytics is applied at Aflac, AT&T Bell South, Amway, The Coca-Cola Company, Financial Times, Hewlett-Packard, IRS, National Center for Dropout Prevention, The National Rifle Association, The New York Times, Optus (Australian telecom), PREMIER Bankcard, Reed Elsevier, Sprint-Nextel, Sunrise Communications (Switzerland), Target, US Bank, U.S. Department of Defense, Zurich — plus special examples from Anheuser-Busch, Disney, HSBC, Pfizer, Social Security Administration, WestWind Foundation and others.

To see the entire agenda at a glance: www.predictiveanalyticsworld.com/dc/2009/agenda_overview.php

We’ve added a third workshop, offered the day before (Oct 19), “Hands-On Predictive Analytics. There’s no better way to dive in than operating real predictive modeling software yourself – hands-on.” For more info: www.predictiveanalyticsworld.com/dc/2009/handson_predictive_analytics.php

Ajay- What do academics, corporations and data miners gain in this conference? list 4 bullet points for the specific gains.

Eric- A. First, PAW’s experienced speakers provide the “how to” of predictive analytics. PAW is a unique conference in its focus on the commercial deployment of predictive analytics, rather than research and development. The core analytical technology is established and proven, valuable as-is without additional R&D — but that doesn’t mean it’s a “cakewalk” to employ it successfully to ensure value is attained. Challenges include data requirements and preparation, integration of predictive models and their scores into existing organizational systems and processes, tracking and evaluating performance, etc. There’s no better way to hone your skills and cultivate an informed plan for your organization’s efforts than hearing how other organizations did it.

B. Second, PAW covers the latest state-of-the-art methods produced by research labs, and how they provide value in commercial deployment. This October, almost all sessions in Track 2 are at the Expert/Practitioner-level. Advanced topics include ensemble models, uplift modeling (incremental modeling), model scoring with cloud computing, predictive text analytics, social network analysis, and more.

PAW’s pre- and post-conference workshops round out the learning opportunities. In addition to the hands-on workshop mentioned above, there is a course covering core methods, “The Best and the Worst of Predictive Analytics: Predictive Modeling Methods and Common Data Mining Mistakes” (www.predictiveanalyticsworld.com/dc/2009/predictive_modeling_methods.php) and a business-level seminar on decision automation and support, “Putting Predictive Analytics to Work” (www.predictiveanalyticsworld.com/dc/2009/predictive_analytics_work.php).

C. Third, the leading predictive analytics software vendors and consulting firms are present at PAW as sponsors and exhibitors, available to provide demos and answer your questions. What do the predictive analytics solutions do, how do they compare, and which is best for your? PAW is the one-stop-shop for selecting the tool or solution most suited to address your needs.

D. Fourth, PAW provides a unique, focused opportunity to network with colleagues and establish valuable contacts in the predictive analytics industry. Mingle, connect and hang out with professionals facing similar challenges (coffee breaks, meals, and at the reception).

Ajay- How do you balance the interests of various competing softwares and companies who sponsor such event?

Eric- As a vendor-neutral event, PAW’s core program of 25 sessions is booked exclusively with enterprise practitioners, thought leaders and adopters, with no predictive analytics software vendors speaking or co-presenting. These sessions provide substantive content with take-aways which provide value that’s independent of any particular software solution — no product pitches! Beyond these 25 sessions are three short sponsor sessions that are demarcated as such, and, despite being branded, generally prove to be quite substantive as well. In this way, PAW delivers a high quality, unbiased program.

Supplementing this vendor-neutral program, the room right next door has an expo where attendees have all the access to software and solution vendors they could want (cf. in my answer to the prior question regarding software vendors, above).

Here are a couple more PAW links:

For informative PAW event updates:
www.predictiveanalyticsworld.com/notifications.php

To sign up for the PAW group on LinkedIn, see:
www.linkedin.com/e/gis/1005097

Ajay- Describe your career in science including research that you specialize in. How would you motivate students today to go for science careers

Eric- Well, first off, my work as a predictive analytics consultant, instructor and conference chair is in the application of established technology, rather than the research and development of new or improved methods.

But the Ph.D. next to my name reveals my secret past as an “academic”. Pure research is something I really enjoyed and I kind of feel like I had a brain transplant in order to change to “real world work”. I’m glad I made the change, although I see good sides to both types of work (really, they’re like two entirely different lifestyles).

In my research I focused on core predictive modeling methods. The ability for a computer to automatically learn from experience (data really is recorded experience, after all), is the best thing since sliced bread. Ever since I realized, as a kid, that space travel would in fact be a huge pain in the neck, nothing in science has ever seemed nearly as exciting.

Predictive analytics is an endeavor in machine learning. A predictive model is the encoding of a set of rules or patterns or regularities at some level. The model is the thing output by automated, number-crunchin’ analysis and, therefore, is the thing “learned” from the “experience” (data). The “magic” here is the ability of these methods to find a model that performs not only over the historical data on your disk drive, but that will perform equally well for tomorrow’s new situations. That ability to generalize from the data at hand means the system has actually learned something.

And indeed the ability to learn and apply what’s been learned turns out to provide plenty of business value, as I imagined back in the lab. The output of a predictive model is a predictive score for each individual customer or prospect. The score in turn directly informs the business decision to be taken with that individual customer (to contact or not to contact; with which offer to contact, etc.) – business intelligence just doesn’t get more actionable than that.

For the impending student, I’d first point out the difference between applied science and research science. Research science is fun in that you have the luxury of abstraction and are usually fairly removed from the need to prove near-term industrial applicability. Applied science is fun for the opposite reason: The tangle of challenges, although less abstract and in that sense more mundane, are the only thing between you and getting the great ideas of the world to actually work, come to fruition, and have an irrefutable impact.

Ajay- What are the top five conferences in analytics and data mining in your opinion in the world including PAW.

Eric- KDD – The leading event for research and development of the core methods behind the commercial deployments covered at PAW (“Knowledge Discovery and Data Mining”).

ICML – Another long-standing research conference on machine learning (core data mining).

eMetrics.org – For online marketing optimization and web analytics

Text Analytics Summit – Text mining can leverage “unstructured data” (text) to augment predictive analytics; the chair of this conference is speaking at PAW on just that topic: www.predictiveanalyticsworld.com/dc/2009/agenda.php#day2-15

Predictive Analytics World, the business-focused event for predictive analytics professionals, managers and commercial practitioners – focused on the commercial deployment of predictive analytics: pawcon.com

Ajay- Would PAW 2009 have video archives, videos as well or podcasts for people not able to attend on site.

Eric- While the PAW conferences emphasize in-person participation, we are in the planning stages for future webcasts and other online content. PAW’s “Predictive Analytics Guide” has a growing list of online resources: www.predictiveanalyticsworld.com/predictive_analytics.php

Ajay- How do you think social media marketing can help in these conferences.

Eric- Like most events, PAW leverages social media to spread the word.

But perhaps most pertinent is the other way around: predictive analytics can help social media by increasing relevancy, dynamically selecting the content to which each reader or viewer is most likely to respond.

Ajay- Do you have any plans to take PAW more international? Any plans for a PAW journal for trainings and papers.

Eric- We’re in discussions on these topics, but for now I can only say, stay tuned!

Biographyy

The president of Prediction Impact, Inc., Eric Siegel is an expert in predictive analytics and data mining and a former computer science professor at Columbia University, where he won awards for teaching, including graduate-level courses in machine learning and intelligent systems – the academic terms for predictive analytics.He has published 13 papers in data mining research and computer science education, has served on 10 conference program committees, has chaired a AAAI Symposium held at MIT, and is the founding chair of Predictive Analytics World.

For more on Predictive Analytic World-

Predictive Analytics World Conference
October 20-21, 2009, Washington, DC
www.predictiveanalyticsworld.com
LinkedIn Group: www.linkedin.com/e/gis/1005097

Interview Gary D. Miner Author and Professor

Here is an interview with Gary Miner, Phd who has been in the data mining business for almost 30 years and a pioneer in healthcare studies pertaining to Alzheimer’s diseases. He is also co author of “the Handbook of Statistical Analysis and Data Mining Applications”. Gary writes on how he has seen data mining change over the years, health care applications as well as his book and quotes from his experience.

GaryMinersmall

Ajay- Describe your career in science starting from college till today. How would you interest young students in science careers today in the mid of the recession

Gary – I knew that I wanted to be in “Science” even before college days, taking all the science and math courses I could in high school. This continued in undergraduate college years at a private college [Hamline University, St. Paul, Minnesota……..older than the State of Minnesota, founded in 1854, and had the first Medical School, later “sold” to the University of Minnesota] as a Biology and Chemistry major, with a minor in education. From there is did a M.S. conducting a “Physiological genetics research project”, and then a Ph.D. at another institution where I worked on Genetic Polymorphisms of Mouse blood enzymes. So through all of this, I had to use statistics to analyze the data. My M.S. was analyzed before the time of even “electronic calculators”, so I used, if you can believe this, a “hand cranked calculator”, rented, one summer to analyze my M.S. dataset. By the time my Ph.D. thesis data was being analyzed, electronic calculators were available, but the big main-frame computers were on college campuses, so I punched the data into CARDS, walked down the hill to the computing center, dropped off the stack of cards, to come back the next day to get “reams of output” on large paper [about 15” by 18”, folded in a stack, if anyone remembers those days …]. I then spent about 30 years doing medical research in academic environments with the emphasis on genetics, biochemistry, and proteonomics in the areas of mental illness and Alzheimer’s Disease, which became my main area of study, publishing the first book in 1989 on the GENETICS OF ALZHEIMER’S DISEASE.

Today, in my “semi-retirement careers”, one side-line outreach is working with medical residents on their research projects, which I’ve been doing for about 7 or 8 years now. This involves design of the research project, data collection, and most importantly “effective and accurate” analysis of the datasets. I find this a way I can reach out to the younger generation to interest them not only in “science”, but in doing “science correctly”. As you probably know, we are in the arena of the “Duming of America”; anti-science, if you wish. I’ve seen this happening for at least 30 years, during the 1980’s, 1990’s, and continuing into this Century. Even the medical residents I get to work with each year have been going “downhill” yearly in their ability to “problem solve”. I believe this is an effect of this “dumning of America”.

There are several books coming out on this Dumning of America this summer; one the first week of June, another on July 12, and another in September [see the attached PPT for slides with the covers of these 3 books}. It is a real problem, as Americans over the past few decades have moved towards “wanting simple answers”, and most things in the “real world”, e.g. reality are not simple………..that’s where Science comes in.

A recent 2008 study done by the School of Public Health at Ohio University showed that up to 88% of the published scientific papers in a top respected cancer journal either used statistics INCORRECTLY, and/or the CONCLUSION was INCORRECT. When I and my wife both did Post-Docs in Psychiatric Epidemiology in 1980-82, basically doing an MPH, the first words out of the mouth of the “Biostats – Epidemiology” professor in the first lecture to the incoming MPH students was “We might as well through out most of the medical research literature of the past 25 years, as it has either not been designed correctly or statistics have been used incorrectly”!!! ……That caught my attention. And following medical research [and medicine in general] I can tell you that “not much has changed in the past 25 years since then”, and thus that puts us “50 years behind in medical research” and medicine. ANALOGY: If some of our major companies, that are successfully using predictive analytics to organize and efficiently run their organizations, took on the “mode of operation” of medicine and medical research, they’d be “bankrupt” in 6 months” …. That’s what I tell my students.

Ajay- Describe some of the exciting things data mining can do to lower health care costs and provide more people with coverage.

Gary- As mentioned above, my personal feeling is that “medicine / health care” is 50 years “behind the times”, compared to the efficiency needed to successfully survive in this Global Economy; corporations and organizations like Wal-Mart, INTEL, many of our Pharmaceutical Companies, have used data mining / predictive analytics to survive successfully. Wal-Mart especially: Wal-Mart has it’s own set of data miners, and were writing their own procedures in the early 1990’s ………..before most of us ever heard of data mining; that is why Wal-Mart can go into China today, and open a store in any location, and know almost to 99% accuracy 1) how many check out stand needed, 2) what products to stock, 3) where in the store to stock them, and 4) what their profit margin will be. They have done this through very accurate “Predictive Analytics” modeling.

Other “ingrained” USA corporations have NOT grabbed onto this “most accurate” technology [e.g. predictive analytics modeling], and reaping the “rewards” of impending bankruptcy and disappearance today. Examples in the news, of course, our our 3 – big automakers in Detroit. If they had engaged effective data mining / modeling in the late 1990’s they could have avoided their current problems. I see the same for many of our oldest and larges USA Insurance Companies………..they are “middle management fat”, and I’ve seen their ratings go down over the past 10 years from an A rating to even a C rating [for the company in which I have my auto insurance ? you might ask me why I stay? …. An agent who is a friend, BUT it is frustrating, and this companies “mode of operation” is completely “customer un-friendly”.], while new insurance companies have “grabbed” onto modern technology, and are rising stars.

So my influence on the younger generation is to have my students do research and DATA ANALYSIS correctly.

Ajay- Describe your book ” HANDBOOK OF STATISTICAL ANALYSIS & DATA MINING APPLICATIONS”. Who would be the target audience of this and can corporate data miners gain from it as well.

Gary- There are several target audiences: The main audience we were writing for, after our Publisher looked at what “niches” had been un-met in data mining literature, was for the professional in smaller and middle sized businesses and organizations that needed to learn about “data mining / predictive analytics” “fast”…..e.g. maybe situations where the company did not have a data anlaysis group using predictive analytics, but the CEO’s and Professionals in the company knew they needed to learn and start using predictive analytics to “stay alive”. This seemed like potentially a very large audience. The book is oriented so that one does NOT have to start at chapter 1, and read sequentially, but instead can START WITH A TUTORIAL. Working through a tutorial, I’ve found in my 40 years of being in education, is the fastest way for a person to learn something new. And this has been confirmed………..I;ve had newcomers to data mining, who have already gotten the HANDBOOK, write me and say: “I’ve gone through a bunch of tutorials, and finding that I am really learning ‘how to do this’……..I’ve ready other books on ‘theory’, but just didn’t get the ‘hang of it’ from those”. My data mining consultants at StatSoft, who travel and work in “real world” situations every day, and who wrote maybe 1/3 of the tutorials in the HANDBOOK, tell me: “A person can go through the TUTORIALS in the HANDBOOK, and know 70% of what we who are doing predictive analytics consulting every day know !!!”

But there are other audiences: Corporate data miners can find it very useful also, as a “way of thinking as a data miner” can be gained from reading the book, as was expressed by one of the Amazon.com 5-STAR reviews: “What I like about this book is that it embeds those methods in a broader context, that of the philosophy and structure of data mining, especially as the methods are used in the corporate world. To me, it was really helpful in thinking like a data miner, especially as it involves the mix of science and art.”

But we’ve had others who have told us they will use is as an extra textbook in their Business Intelligence and Data Minng courses, because of the “richness” of the tutorials. Here’s a comment on the Amazon reviews from a Head of Business School who has maybe over 100 graduate students doing data mining:

“5.0 out of 5 stars. At last, a useable data mining book”

This is one of the few, of many, data mining books that delivers what it promises. It promises many detailed examples and cases. The companion DVD has detailed cases and also has a real 90 day trial copy of Statistica. I have taught data mining for over 10 years and I know it is very difficult to find comprehensive cases that can be used for classroom examples and for students to actually mine data. The price of the book is also very reasonable expecially when you compare the quantity and quality of the material to the typical intro stat book that usually costs twice as much as this data mining book.

The book also addresses new areas of data mining that are under development. Anyone that really wants to understand what data mining is about will find this book infinitively useful.”

So, I think the HANDBOOK will see use in many college classrooms.

Ajay- A question I never get the answer to is which data mining tool is good for what and not so good for what. Could you help me out with this one? What in your opinion, among the data mining and statistical tools used by you in your 40 years in this profession would you recommend for some uses, and what would you not recommend for other uses ( eg SAS,SPSS,KXEN,Statsoft,R etc etc)

Gary- This is a question I can’t answer well; but my book co-author, Robert Nisbet, Ph.D. can. He has used most of these softwares, and in fact has written 2 reviews over the past 6 years in which most of these have been discussed. I like “cutting edge endeavors”, that has been the modus operandi of my ‘career’, so when I took this “semi-retirement postion” as a data mining consultant at StatSoft, I was introduced to DATA MINING, as we started developing STATISTICA Data Miner shortly after I arrived. So most of my experience is with STATISTICA Data Miner, which of course has always been rated NO 1 in all the reviews on data miner software done by Dr. Nisbet – I believe this is primarily due to the fact that STATISTICA was written for the PC from the beginning, thus dos not have any legacy “main frame computer” coding in its history, and secondly StatSoft has been able to move rapidly to make changes as business and government data analysis needs change, and thirdly and most importantly, STATISTICA products have very “open architecture”, “flexibility”, and “customization” with every “built together / workable together” as one package. And of course the graphical output is second to none – that is how STATISTICA originally got its reputation. So I find no need of any other software, as if I need a new algorithm, I can program it to work with the “off the shelf” STATISTICA Data Miner algorithms, and thus get anything I need with the full graphical and other outputs seamlessly available.

Ask Bob Nisbet to answer this question, as he has the background to do so.

Ajay- What are the most interesting trends to watch out for in 2009-2010 in data mining in your opinion.

Gary- Things move so rapidly in this 21st century world, that this is difficult to say. Let me answer this with “hindsight”:

In late October, 2008 I wrote the first draft of Chapter 21 for the HANDBOOK. This was the “future directions of data mining”. You can look in that chapter yourself to find the 4 main areas I decided to focus on. One was on “social networking”, and one of the new examples used was TWITTER. At that time, less than one year ago, no one knew if TWITTER was going to amount to much or not ??? big question? Well, on Jan 14 when the US-AIRWAYS A320 Airbus made an emergency landing in the Hudson River, I got an EMAIL automatic message from CNN [that I subscribe to] telling me that a “plane was down in the Hudson, watch it live” …………I click on the live video: The voice form the Helicopter overhead was saying: “We see a plane, half sunk into the water, but no people? What has happened to the people? Are they all dead?………” Well, as it turned out, the CNN Helicopters had spend nearly one hour searching the river for the plane, as had other news agencies. BUT THE “ENTIRE” WORLD ALREADY KNEW !!! … Why? A person on a ferry that was crossing the river close to the crash landing used his I-Phone, snaped a photo, uploaded it to TWIT-PIX and sent a TWITTER message, and this was re-tweeted around the world. The world knew in “seconds to minutes” to which the traditional NEWS MEDIA was 1 hour late on the scene, when ALL the PEOPLE had been rescued and were on-shore in a warm building within 45 minutes of the landing. THE TRADITIONAL NEWS MEDIA ARRIVED 15 MINUTES AFTER EVERYTHING HAD HAPPENED !!!! ………AT THIS POINT we ALL KNEW that TWITTER was a new phenomenon ……….and it started growing, with 10,00 people an hour joining at one point in last spring of this year, and who knows what the rate is today. TWITTER has become a most important part not only of “social networking” among friends, but for BUSINESS —- companies even sending out ‘Parts Availability” lists to their dealers, etc.

TWITTER affected Chapter 21…………..I immediately re-wrote Chapter 21, including this first photo of the Hudson Plane crash-landing with all the people standing on the wings. BUT, not the end of this story: By the time the book was about to go to press, TWITTER had decided that “ownership” of uploaded photos resided with the photographer, and the person who took this original US-AIRBUS – PEOPLE ON THE WINGS photo wanted $600 for us to publish it in the HANDBOOK. So, I re-wrote again [the chapter was already “set” in page proofs……….so we had to make the changes directly at the printer]………this time finding another photo uploaded to social media, but in this case the person had “checked” the box to put the photo in public domain.

So TWITTER is one that I predicted would become important, but I’d thought it would be months AFTER the HANDBOOK was released in May, not last January!!!

Other things we presented in Chapter 21 about the “future of data mining” involved “photo / image recognition”, among others. The “Image Recognition”, and more importantly “movement recognition / analysis” for things like Physical Therapy and other medical areas may be more slow to evolve and fully develop, but are immensely important. The ability to analyze such “Three-dimensional movement data” is already available in rudimentary form in our version 9 of STATISTICA [just released in June], and anyone could implement it fully with MACROS, but it probably will be some time before it is fully feasible from a business standpoint to develop it with fully automatic “point and click” functionality to make it readily accessible for anyone’s use.

Ajay What would your advice be to a young statistician just starting his research career.

Gary- Make sure you delve in / grab in FULLY to the subject areas……….you need to know BOTH the “domain” of the data you are working with, and “correct” methods of data analysis, especially when using the traditional p-value statistics. Today’s culture is too much on “superficiality”………..good data analysis requires “depth” of understanding. One needs to FOCUS ………good FOCUS can’t be done with elaborate “multi-tasking”. Granted, today’s youth [the “Technology-Inherited”] probably have their brains “wired differently” than the “Technology-Immigrants” like myself [e.g. the older generations], but never-the-less, I see ERRORS all over the place in today’s world, from “typos” in magazine and newspaper, to web page paragraphs, links that don’t work, etc etc ……….and I conclude that this is all do to NON-FOCUSED / MULTI-TASKING people. You can’t drive a car / bus / train and TEXT MESSAGE at the same time ……….the scientific tests that have been conducted show that it takes 20-times as long for a TEXT MESSAGING driver to stop, than a driver fully focused on the road, when given a “danger” warning. [Now, maybe this scientific experiment used ALL TECHNOLOGY-IMMIGRANTS as drivers?? If so, the scientific design was “flawed” ……..they should have used BOTH Technology-Immigrants and Technology-Inheritants as participants in the study. Then we’d have 2 dependent, or target variables: Age and TEXT MESSAGING…..]

Short Bio-

Professor, 30 years medical research in genetics, DNA, Proteins, Neuropsychology of Schizophrenia and Alzheimer’s Disease……….now semi-retired position as DATA MINING CONSULTANT – SENIOR STATISTICIAN