Here is an interview with Gary Miner, Phd who has been in the data mining business for almost 30 years and a pioneer in healthcare studies pertaining to Alzheimer’s diseases. He is also co author of “the Handbook of Statistical Analysis and Data Mining Applications”. Gary writes on how he has seen data mining change over the years, health care applications as well as his book and quotes from his experience.
Ajay- Describe your career in science starting from college till today. How would you interest young students in science careers today in the mid of the recession
Gary – I knew that I wanted to be in “Science” even before college days, taking all the science and math courses I could in high school. This continued in undergraduate college years at a private college [Hamline University, St. Paul, Minnesota……..older than the State of Minnesota, founded in 1854, and had the first Medical School, later “sold” to the University of Minnesota] as a Biology and Chemistry major, with a minor in education. From there is did a M.S. conducting a “Physiological genetics research project”, and then a Ph.D. at another institution where I worked on Genetic Polymorphisms of Mouse blood enzymes. So through all of this, I had to use statistics to analyze the data. My M.S. was analyzed before the time of even “electronic calculators”, so I used, if you can believe this, a “hand cranked calculator”, rented, one summer to analyze my M.S. dataset. By the time my Ph.D. thesis data was being analyzed, electronic calculators were available, but the big main-frame computers were on college campuses, so I punched the data into CARDS, walked down the hill to the computing center, dropped off the stack of cards, to come back the next day to get “reams of output” on large paper [about 15” by 18”, folded in a stack, if anyone remembers those days …]. I then spent about 30 years doing medical research in academic environments with the emphasis on genetics, biochemistry, and proteonomics in the areas of mental illness and Alzheimer’s Disease, which became my main area of study, publishing the first book in 1989 on the GENETICS OF ALZHEIMER’S DISEASE.
Today, in my “semi-retirement careers”, one side-line outreach is working with medical residents on their research projects, which I’ve been doing for about 7 or 8 years now. This involves design of the research project, data collection, and most importantly “effective and accurate” analysis of the datasets. I find this a way I can reach out to the younger generation to interest them not only in “science”, but in doing “science correctly”. As you probably know, we are in the arena of the “Duming of America”; anti-science, if you wish. I’ve seen this happening for at least 30 years, during the 1980’s, 1990’s, and continuing into this Century. Even the medical residents I get to work with each year have been going “downhill” yearly in their ability to “problem solve”. I believe this is an effect of this “dumning of America”.
There are several books coming out on this Dumning of America this summer; one the first week of June, another on July 12, and another in September [see the attached PPT for slides with the covers of these 3 books}. It is a real problem, as Americans over the past few decades have moved towards “wanting simple answers”, and most things in the “real world”, e.g. reality are not simple………..that’s where Science comes in.
A recent 2008 study done by the School of Public Health at Ohio University showed that up to 88% of the published scientific papers in a top respected cancer journal either used statistics INCORRECTLY, and/or the CONCLUSION was INCORRECT. When I and my wife both did Post-Docs in Psychiatric Epidemiology in 1980-82, basically doing an MPH, the first words out of the mouth of the “Biostats – Epidemiology” professor in the first lecture to the incoming MPH students was “We might as well through out most of the medical research literature of the past 25 years, as it has either not been designed correctly or statistics have been used incorrectly”!!! ……That caught my attention. And following medical research [and medicine in general] I can tell you that “not much has changed in the past 25 years since then”, and thus that puts us “50 years behind in medical research” and medicine. ANALOGY: If some of our major companies, that are successfully using predictive analytics to organize and efficiently run their organizations, took on the “mode of operation” of medicine and medical research, they’d be “bankrupt” in 6 months” …. That’s what I tell my students.
Ajay- Describe some of the exciting things data mining can do to lower health care costs and provide more people with coverage.
Gary- As mentioned above, my personal feeling is that “medicine / health care” is 50 years “behind the times”, compared to the efficiency needed to successfully survive in this Global Economy; corporations and organizations like Wal-Mart, INTEL, many of our Pharmaceutical Companies, have used data mining / predictive analytics to survive successfully. Wal-Mart especially: Wal-Mart has it’s own set of data miners, and were writing their own procedures in the early 1990’s ………..before most of us ever heard of data mining; that is why Wal-Mart can go into China today, and open a store in any location, and know almost to 99% accuracy 1) how many check out stand needed, 2) what products to stock, 3) where in the store to stock them, and 4) what their profit margin will be. They have done this through very accurate “Predictive Analytics” modeling.
Other “ingrained” USA corporations have NOT grabbed onto this “most accurate” technology [e.g. predictive analytics modeling], and reaping the “rewards” of impending bankruptcy and disappearance today. Examples in the news, of course, our our 3 – big automakers in Detroit. If they had engaged effective data mining / modeling in the late 1990’s they could have avoided their current problems. I see the same for many of our oldest and larges USA Insurance Companies………..they are “middle management fat”, and I’ve seen their ratings go down over the past 10 years from an A rating to even a C rating [for the company in which I have my auto insurance ? you might ask me why I stay? …. An agent who is a friend, BUT it is frustrating, and this companies “mode of operation” is completely “customer un-friendly”.], while new insurance companies have “grabbed” onto modern technology, and are rising stars.
So my influence on the younger generation is to have my students do research and DATA ANALYSIS correctly.
Ajay- Describe your book ” HANDBOOK OF STATISTICAL ANALYSIS & DATA MINING APPLICATIONS”. Who would be the target audience of this and can corporate data miners gain from it as well.
Gary- There are several target audiences: The main audience we were writing for, after our Publisher looked at what “niches” had been un-met in data mining literature, was for the professional in smaller and middle sized businesses and organizations that needed to learn about “data mining / predictive analytics” “fast”…..e.g. maybe situations where the company did not have a data anlaysis group using predictive analytics, but the CEO’s and Professionals in the company knew they needed to learn and start using predictive analytics to “stay alive”. This seemed like potentially a very large audience. The book is oriented so that one does NOT have to start at chapter 1, and read sequentially, but instead can START WITH A TUTORIAL. Working through a tutorial, I’ve found in my 40 years of being in education, is the fastest way for a person to learn something new. And this has been confirmed………..I;ve had newcomers to data mining, who have already gotten the HANDBOOK, write me and say: “I’ve gone through a bunch of tutorials, and finding that I am really learning ‘how to do this’……..I’ve ready other books on ‘theory’, but just didn’t get the ‘hang of it’ from those”. My data mining consultants at StatSoft, who travel and work in “real world” situations every day, and who wrote maybe 1/3 of the tutorials in the HANDBOOK, tell me: “A person can go through the TUTORIALS in the HANDBOOK, and know 70% of what we who are doing predictive analytics consulting every day know !!!”
But there are other audiences: Corporate data miners can find it very useful also, as a “way of thinking as a data miner” can be gained from reading the book, as was expressed by one of the Amazon.com 5-STAR reviews: “What I like about this book is that it embeds those methods in a broader context, that of the philosophy and structure of data mining, especially as the methods are used in the corporate world. To me, it was really helpful in thinking like a data miner, especially as it involves the mix of science and art.”
But we’ve had others who have told us they will use is as an extra textbook in their Business Intelligence and Data Minng courses, because of the “richness” of the tutorials. Here’s a comment on the Amazon reviews from a Head of Business School who has maybe over 100 graduate students doing data mining:
“5.0 out of 5 stars. At last, a useable data mining book”
This is one of the few, of many, data mining books that delivers what it promises. It promises many detailed examples and cases. The companion DVD has detailed cases and also has a real 90 day trial copy of Statistica. I have taught data mining for over 10 years and I know it is very difficult to find comprehensive cases that can be used for classroom examples and for students to actually mine data. The price of the book is also very reasonable expecially when you compare the quantity and quality of the material to the typical intro stat book that usually costs twice as much as this data mining book.
The book also addresses new areas of data mining that are under development. Anyone that really wants to understand what data mining is about will find this book infinitively useful.”
So, I think the HANDBOOK will see use in many college classrooms.
Ajay- A question I never get the answer to is which data mining tool is good for what and not so good for what. Could you help me out with this one? What in your opinion, among the data mining and statistical tools used by you in your 40 years in this profession would you recommend for some uses, and what would you not recommend for other uses ( eg SAS,SPSS,KXEN,Statsoft,R etc etc)
Gary- This is a question I can’t answer well; but my book co-author, Robert Nisbet, Ph.D. can. He has used most of these softwares, and in fact has written 2 reviews over the past 6 years in which most of these have been discussed. I like “cutting edge endeavors”, that has been the modus operandi of my ‘career’, so when I took this “semi-retirement postion” as a data mining consultant at StatSoft, I was introduced to DATA MINING, as we started developing STATISTICA Data Miner shortly after I arrived. So most of my experience is with STATISTICA Data Miner, which of course has always been rated NO 1 in all the reviews on data miner software done by Dr. Nisbet – I believe this is primarily due to the fact that STATISTICA was written for the PC from the beginning, thus dos not have any legacy “main frame computer” coding in its history, and secondly StatSoft has been able to move rapidly to make changes as business and government data analysis needs change, and thirdly and most importantly, STATISTICA products have very “open architecture”, “flexibility”, and “customization” with every “built together / workable together” as one package. And of course the graphical output is second to none – that is how STATISTICA originally got its reputation. So I find no need of any other software, as if I need a new algorithm, I can program it to work with the “off the shelf” STATISTICA Data Miner algorithms, and thus get anything I need with the full graphical and other outputs seamlessly available.
Ask Bob Nisbet to answer this question, as he has the background to do so.
Ajay- What are the most interesting trends to watch out for in 2009-2010 in data mining in your opinion.
Gary- Things move so rapidly in this 21st century world, that this is difficult to say. Let me answer this with “hindsight”:
In late October, 2008 I wrote the first draft of Chapter 21 for the HANDBOOK. This was the “future directions of data mining”. You can look in that chapter yourself to find the 4 main areas I decided to focus on. One was on “social networking”, and one of the new examples used was TWITTER. At that time, less than one year ago, no one knew if TWITTER was going to amount to much or not ??? big question? Well, on Jan 14 when the US-AIRWAYS A320 Airbus made an emergency landing in the Hudson River, I got an EMAIL automatic message from CNN [that I subscribe to] telling me that a “plane was down in the Hudson, watch it live” …………I click on the live video: The voice form the Helicopter overhead was saying: “We see a plane, half sunk into the water, but no people? What has happened to the people? Are they all dead?………” Well, as it turned out, the CNN Helicopters had spend nearly one hour searching the river for the plane, as had other news agencies. BUT THE “ENTIRE” WORLD ALREADY KNEW !!! … Why? A person on a ferry that was crossing the river close to the crash landing used his I-Phone, snaped a photo, uploaded it to TWIT-PIX and sent a TWITTER message, and this was re-tweeted around the world. The world knew in “seconds to minutes” to which the traditional NEWS MEDIA was 1 hour late on the scene, when ALL the PEOPLE had been rescued and were on-shore in a warm building within 45 minutes of the landing. THE TRADITIONAL NEWS MEDIA ARRIVED 15 MINUTES AFTER EVERYTHING HAD HAPPENED !!!! ………AT THIS POINT we ALL KNEW that TWITTER was a new phenomenon ……….and it started growing, with 10,00 people an hour joining at one point in last spring of this year, and who knows what the rate is today. TWITTER has become a most important part not only of “social networking” among friends, but for BUSINESS —- companies even sending out ‘Parts Availability” lists to their dealers, etc.
TWITTER affected Chapter 21…………..I immediately re-wrote Chapter 21, including this first photo of the Hudson Plane crash-landing with all the people standing on the wings. BUT, not the end of this story: By the time the book was about to go to press, TWITTER had decided that “ownership” of uploaded photos resided with the photographer, and the person who took this original US-AIRBUS – PEOPLE ON THE WINGS photo wanted $600 for us to publish it in the HANDBOOK. So, I re-wrote again [the chapter was already “set” in page proofs……….so we had to make the changes directly at the printer]………this time finding another photo uploaded to social media, but in this case the person had “checked” the box to put the photo in public domain.
So TWITTER is one that I predicted would become important, but I’d thought it would be months AFTER the HANDBOOK was released in May, not last January!!!
Other things we presented in Chapter 21 about the “future of data mining” involved “photo / image recognition”, among others. The “Image Recognition”, and more importantly “movement recognition / analysis” for things like Physical Therapy and other medical areas may be more slow to evolve and fully develop, but are immensely important. The ability to analyze such “Three-dimensional movement data” is already available in rudimentary form in our version 9 of STATISTICA [just released in June], and anyone could implement it fully with MACROS, but it probably will be some time before it is fully feasible from a business standpoint to develop it with fully automatic “point and click” functionality to make it readily accessible for anyone’s use.
Ajay What would your advice be to a young statistician just starting his research career.
Gary- Make sure you delve in / grab in FULLY to the subject areas……….you need to know BOTH the “domain” of the data you are working with, and “correct” methods of data analysis, especially when using the traditional p-value statistics. Today’s culture is too much on “superficiality”………..good data analysis requires “depth” of understanding. One needs to FOCUS ………good FOCUS can’t be done with elaborate “multi-tasking”. Granted, today’s youth [the “Technology-Inherited”] probably have their brains “wired differently” than the “Technology-Immigrants” like myself [e.g. the older generations], but never-the-less, I see ERRORS all over the place in today’s world, from “typos” in magazine and newspaper, to web page paragraphs, links that don’t work, etc etc ……….and I conclude that this is all do to NON-FOCUSED / MULTI-TASKING people. You can’t drive a car / bus / train and TEXT MESSAGE at the same time ……….the scientific tests that have been conducted show that it takes 20-times as long for a TEXT MESSAGING driver to stop, than a driver fully focused on the road, when given a “danger” warning. [Now, maybe this scientific experiment used ALL TECHNOLOGY-IMMIGRANTS as drivers?? If so, the scientific design was “flawed” ……..they should have used BOTH Technology-Immigrants and Technology-Inheritants as participants in the study. Then we’d have 2 dependent, or target variables: Age and TEXT MESSAGING…..]
Professor, 30 years medical research in genetics, DNA, Proteins, Neuropsychology of Schizophrenia and Alzheimer’s Disease……….now semi-retired position as DATA MINING CONSULTANT – SENIOR STATISTICIAN