Interview Dr Usama Fayyad Founder Open Insights LLC

Here is an interview with Dr Usama Fayyad, founder of Open Insights LLC (www.open-insights.com). Prior to this he was Yahoo’s Chief Data Officer. In his prior role as Chief Data Officer of Yahoo! he built the data teams and infrastructure to manage the 25 terabytes of data per day that resulted from the company’s operations.

 

Picture_004_(2)

Ajay-     Describe your career in science. How would you motivate young people today to take science careers rather than other careers
Dr Fayyad-
My career started out in science and engineering. My original plan was to be in research and to become a university professor. Indeed, my first few jobs were strictly in basic Research. After doing summer internships at place like GM Research Labs and JPL, my first full-time position was at the NASA – Jet Propulsion Laboratory, California Institute of Technology.

I started in research in Artificial Intelligence for autonomous monitoring and control and in Machine Learning and data mining. The first major success was with Caltech Astronomers on using machine learning classification techniques to automatically recognize objects in a large sky survey (POSS-II – the 2nd Palomar Observatory Sky Survey).  The Survey consists of taking high resolution images of the entire northern sky. The images, when digitized, contain over 2 billion sky objects. The main problem is to recognize if an object is a star of galaxy. For “faint objects” – which constitute the majority of objects, this was an exceedingly hard problem that people wrestled with for 30 years. I was surprised how well the algorithms could do at solving it.

This was a real example of data sets where the dimensionality is so high that algorithms are better suited at solving it than humans – even well-trained astronomers. Our methods had over 94% accuracy on faint objects that no one could reliably classify before at better than 75% accuracy. This additional accuracy made all the difference in enabling all sort of new science, discoveries and theories about formation of large scale structure in the Universe.
The success of this work and its wide recognition in scientific and engineering communities let to the creation of a new group – I founded and managed the Machine Learning Systems group at JPL which went on to address hard problems in object recognition in scientific data – mostly from remote sensing instruments – like Magellan images of the planet Venus (we recognized and classified over a million small volcanoes on the planet in collaboration with geologists at Brown University) and Earth Observing System data, including Atmospherics and storm data.
At the time, Microsoft was interested in figuring out data mining applications in the corporate world and after a long recruiting cycle they got me to join the newly formed Microsoft Research as a Senior Researcher in late 1995. My work there focus on algorithms, database systems, and basic science issues in the newly formed field of Data Mining and Knowledge Discovery. We had just finished publishing a nice edited collection of chapters in a book that became very popular, and I had agreed to become the founding Editor-in-Chief of a brand new journal called: Data Mining and Knowledge Discovery. This journal today is the premier scientific journal in the field. My research work at Microsoft led to several applications – especially in databases. I founded the Data Mining & Exploration group at MSR and later a product group in SQL Server that built and shipped the first integrated data mining product in a large-scale commercial DBMS  – SQL Server 2000 (analysis Services). We created extensions to the SQL language (that we called DMX) and tried to make data mining mainstream. I really enjoyed the life of doing basic research as well as having a real product group that built and shipped components in a major DBMS.
That’s when I learned that the real challenging problems in the real-world where really not in data mining but in getting the data ready and available for analysis – Data Warehousing was a field littered with failures and data stores that were write-only (meaning data never came out!)  — I used to call these Data Tombs at the time and I likened them to the pyramids in Ancient Egypt: great engineering feats to build, but really just tombs.

In 2000 I decided to leave the world of Research at Microsoft to do my first venture-backed start-up company – digiMine. The company wanted to solve the problem of managing the data and performing data mining and analysis over data sets, and we targeted a model of hosted data warehouses and mining applications as an ASP – one of the first Software as a Service (SaaS) firms in that arena. This began my transition from the world of research and science to business and technology.  We focused on on-line data and web analytics since the data volumes their were about 10x the size of transactional databases and most companies did not know how to deal with all that data. The business grew fast and so did the company – reaching 120 employees in about 1 year.

After 3 years of doing high-growth start-up and raising some $50 million in venture capital for the company, I was beginning to feel the itch again to do technical work.
In June 2003, we had a chance to spin-off part of the business that was focused on difficult high-end data mining problems. This opportunity was exactly what I needed and we formed DMX Group as a spinoff company that had a solid business from its first day. At DMX Group I got to work on some of the hardest data mining problems in predicting sales of automobiles, churn of wireless users, financial scoring and credit risk analysis, and many related deep business Intelligence problems.

Our client list included many of the Fortune 500 companies. One of these clients was Yahoo!  — After 6 months of working with Yahoo! As a client they decided to acquire DMX Group and use the people to build a serious data team for Yahoo!  We negotiated a deal that got about half the employees into Yahoo! And we spun-off the rest of DMX Group to continue focusing on consulting work in data mining and BI.  I thus became the industry’s first Chief Data Officer. 

 The original plan was to spend 2 years or so to help Yahoo! Form the right data teams and build the data processing and targeting technology to deliver high value from its inventory of ads.
Yahoo! Proved to be a wonderful experience and I learned so much about the Internet. I also learned that even someone like me who worked on Internet data from the early days of MSN (in 1996) and who ran a web analytics firm still did not scratch the service on the depth of the area. I learned a lot about the Internet from Jerry Yaang (Yahoo! Co-founder) and much about advertising/media business from Dan Rosensweig (COO) and mTerry Semel (then CEO) and lots about technology management and strategic deal-making from Farzad (Zod) Nazem who was the CTO. As Executive VP at Yahoo!

I built one of the industry’s largest and best data teams and we were able to to process over 25 terabytes of data per year and power several hundred million Dollars of new revenue for Yahoo! Resulting from these data systems. A year after joining Yahoo! I was asked to form a new Research Lab to study much of what we did not understand about the Internet. This was yet another return of basic research into my life. I founded Yahoo! Research to invent the new sciences of the Internet, and I wanted them to be focused on only 4 areas (the idea of focus came from my exposure to Caltech and its philosophy in picking few areas of excellence). The goal was the become the best research lab in the world in these new focused areas. Surprisingly we did it within 2 years. I hired Prabhakar Raghavan to run Research and he did a phenomenal job in building out the Research organization. The four areas we chose were: Search and information navigation, Community Systems, Micro-economics of the Web, and Computational Advertising.  We were able to attract the top talent in the world to lead or work on these emerging areas. Yahoo! Research was a success in basic research but also in influencing product. The chief scientists for all the major areas of company products all came from Yahoo! Research and all owned the product development agenda and plans: Raghu Ramakrishnan (CS for Audience), Andrew Tomkins (CS for Search), Anrei Broder (CS for Monetization) and Preston McCaffee (CS for Marketplaces/Exchanges). I consider this an unprecendented achievement in the world of Research in general: excellence in basic research and huge impact on company products, all within 3-4 years.
I have recently left Yahoo! And started Open Insights (www.open-insights.com) to focus on data strategy and helping enterprises realize the value of data, develop the right data strategies, and create new business models. Sort of an ‘outsourced version” of my Chief Data Officer job at Yahoo!
Finally, on my advice to young people: it is not just about science careers, I would call it engineering careers. My advice to any young person in fact, whether they plan to become a business person, a medical doctor, and artist, a lawyer, or a scientist – basic training in engineering and abstract problem solving will be a huge assets. Some of the best lawyers, doctors, and even CEO’s started out with engineering training.
For those young people who want to become scientists, my advice is always look for real-world applications where the research can be conducted in their context. The reason for that is technical and sociological. From a technical perspective, the reality of an application and the fact that things have to work force a regiment of technical discipline and make sure that the new ideas are tested and challenged. Socially, working on a real application forces interactions with people who care about the problem and provides continuous feedback which is really crucial in guiding good work (even if scientists deny this, social pressure is a big factor) – it also ensures that your work will be relevant and will evolve in relevant directions. I always tell people who are seeking basic research: “some of the deepest fundamental science problems can often be found lurking in the most mundane of applications”. So embrace applied work but always look for the abstract deep problems – that summarizes my advice.
Ajay- What are the challenges of running data mining for a big big website.
Dr Fayyad-
There are many challenges. Most algorithms will not work due to scale. Also, most of the problems have an unusually high dimensionality – so simple tricks like sampling won’t work. You need to be very clever on how to sample and how to reduce dimensionality by applying the right variable transformations.

The variety of problems is huge, and the fact that the Internet is evolving and growing rapidly, means that the problems are not fixed or stationary. A solution that works well today will likely fail in a few months – so you need to always innovate and always look at new approaches. Also, you need to build automated tools to help detect changes and address them as soon as they arise. 

Problems with 1000 10,000 or millions of variables are very common in web challenges. Finally, whatever you do needs to work fast or else you will not be able to keep up with the data flux. Imagine falling behind on processing 25 Terabytes of data per day. If you fall behind by two days, you will never be able to catch up again! Not within any reasonable budget constraint. So you try never to go down.
Ajay-      What are the 5 most important things that the data miner should avoid in doing analysis.

Dr Fayyad-I never thought about this in terms of top 5, but here are the big ones that come to mind, not necessarily in any order
a.       The algorithms knows nothing about the data, and the knowledge of the domain is in the head of the domain experts. As I always say, an ounce of knowledge is worth a ton of data – so seek and model what the experts know or your results will look silly
b.      Don’t let an algorithm fish blindly when you have lots of data. Use what you know to reduce the dimensionality quickly. The curse of dimensionality is never to be under-estimated
c.       Resist the temptation to cheat: selecting training and test sets can easily fool you into thinking you have something that works. Test it honestly against new data, never “peek” at the test data – what you see will force you to cheat without knowing it.
d.      Business rules typically dominate data mining accuracy, so be sure to incorporate the business and legal constraints into your mining.
e.       I have never seen a large database in my life that came from a static distribution that was sampled independently. Real databases grow to be big through lots of systematic changes and biases, and they are collected over years from changing underlying distribution: segmentation is a pre-requisite to any analysis. Most algorithms assume that data is IID (independent and identically distributed)

Ajay-   Do you think softwares like Hadoop and MapReduce will change the online database permanently. What further developments do you see in this area.


Dr Fayyad-
I think they will (and have) changed the landscape dramatically, but they do not address everything. Many problems lend themselves naturally to Map-Reduce and many new approaches are enabled by Map-Reduce. However, there are many problems where M-R does not do much. I see a lot of problems being addressed by a large grid nowadays when they don’t need it. This is often a huge waste of computational resources. We need to learn how to deal with a mix of tools and platforms. I think M-R will be with us for a long time and will be a staple tool – but not a universal one.
Ajay-    I look forward to the day when I have just a low priced netbook and fast internet connection, and upload a Gigabyte of data and run advanced analytics on the browser. How far or soon do you think it is possible?
Dr Fayyad- Well, I thnk the day is already here. In fact, much of our web search today is conducted exactly in that model. A lot of web analysis, and much of scientific analysis is done like this today.
Ajay-    Describe some of the conferences you are currently involved with and the research areas that excites you the most.
Dr Fayyad-
I am still very involved in knowledge discovery and data mining conferences (especially the KDD series), machine learning, some statistics, and some conferences on search and internet.  Most exciting conferences for me are ones that cover a mix of topics but that address real problems. Examples include understanding how social networks evolve and behave, understanding dimensionality reductions (like random projections in very high-D spaces) and generally any work that gives us insight into why a particular technique works better and where the open challenges are.
Ajay-  What are the next breakthrough areas in data mining. Can we have a  Google or Yahoo in fields of business intelligence as well given their huge market potential and uncertain ROI.
Dr Fayyad- We already have some large and healthy businesses in BI and quite a huge industry in consulting. If you are asking particularly about the tools market then I think that market is very limited. The users of analysis tools are always going to be small in number. However, once the BI and Data Mining tools are embedded in vertical applications, then the number of users will be tremendous. That’s where you will see success.
Consider the examples of Google or Yahoo! – and now Microsoft with BING search engine.  Search engines today would not be good without machine learning/data mining technology. In fact MLR (Machine Learned Ranking) is at the core of the ranking methodology that decides which search results bubble to the top of the list. The typical web query is 2.6 keywords long and has about a billion matches. What matters are the top 10. The function that determines these is a relevance ranking algrorithm that uses machine learning to tune a formula that considers hundreds or thousands of variables about each document. So in many ways, you have a great example of this technology being used by hundreds of millions of people every day – without knowing it!
Success will be in applications where the technology becomes invisible – much like the internal combustion engine in your car or the electric motor in your coffee grinder or power supply fan. I think once people start building verticalized solutions that embed data mining and BI, we will hit success. This already has happened in web search, in direct marketing, in advertising targeting, in credit scoring, in fraud detection, and so on…

Ajay-  What do you do to relax. What publications would you recommend for staying up to date for the data mining people especially the younger analysts.
Dr Fayyad-
My favorite activity is sleep when I can get it J.  But more seriously, I enjoy reading books, playing chess, skiing (on water or snow – downhill or x-country), or any activities with my kids.  I swim a lot and that gives me much time to think and sort things out.
I think for keeping up with the technical advances in data mining: the KDD conferences, some of the applied analytics conferences, the WWW conferences, and the data mining journals. The ACM SIGKDD publishes a nice newsletter called SIGKDD explorations. It is free with a very low membership fee and it has a lot of announcements and survey papers on new topics and important areas (www.kdd.org).  Also, a good list to keep up with is an email list called KDNuggets edited by Gregory Piatetsky-Shapiro.
 

Biography (www.fayyad.com/usama )-

Usama Fayyad founded Open Insights (www.open-insights.com) to deliver on the vision of bridging the gap between data and insights and to help companies develop strategies and solutions not only to turn data into working business assets, but to turn the insights available from the growing amounts of data into critical components of an enterprise’s strategy for approaching markets, dealing with competitors, and acquire and retain customers.

In his prior role as Chief Data Officer of Yahoo! he built the data teams and infrastructure to manage the 25 terabytes of data per day that resulted from the company’s operations. He also built up the targeting systems and the data strategy for how to utilize data to enhance revenue and to create new revenue sources for the company.

In addition, he was the founding executive for Yahoo! Research, a scientific research organization that became the top research place in the world working on inventing the new sciences of the Internet.

Interview Karen Lopez Data Modeling Expert

Zachman Framework
Image via Wikipedia

Here is an interview with Karen Lopez who has worked in data modeling for almost three decades and is a renowned data management expert in her field.

Data professionals need to know about the data domain in addition to the data structure domain – Karen Lopez

Ajay- Describe your career in science. How would you persuade younger students to take more science courses.

Karen- I’ve always had an interest in science and I attribute that to the great science teachers I had. I studied information systems at Purdue University though a unique program that focuses on systems analysis and computer technologies. I’m one of the few who studied data and process modeling in an undergraduate program 25+ years ago.

I believe that it is very important that we find a way of attracting more scientists to teach. In both the natural and computer sciences, it’s difficult for institutions to tempt scientists away from professional positions that offer much greater compensation. So I support programs that find ways to make that happen.

Ajay- If you had to give advice to a young person starting their career in BI and had to give them advice in just three points – what would they be?

Karen- Wow. It’s tough to think of just three things, but these are recommendations that I make often:

– Remember that every design decision should be made based on cost, benefit, and risk. If you can’t clearly describe these for every side of a decision, then you aren’t doing design; you are guessing.

– No one beside you is responsible for advancing your skills and keeping an eye on emerging practices. Don’t expect your employer to lay out a career plan that is in your best interest. That’s not their job. Data professionals need to know about the data domain in addition to the data structure domain. The best database or data warehouse design in the world is worse than uses useless if the how the data is processed is wrong. Remember to expand your knowledge about data, not just the data structures and tools.

– All real-world work involves collaboration and negotiation. There is no one right answer that works for every situation. Building your skills in these areas will pay off significantly.

Ajay- What do you think is the best way for a technical consultant and client to be on the same page regarding requirements. Which methodology or template have you used, and which has given you the most success.

Karen- While I’m a huge fan of modeling (data modeling and other modeling), I still think that giving clients a prototype or mockup of something that looks real to them goes a long way. We need to build tools and competencies to develop these prototypes quickly. It’s a lost art in the data world.

Ajay- What are the special incentives that make Canada a great place for tech entrepreneurs rather than say go to the United States. ( Note- Disclaimer I have family in Canada and study in the US)

Karen- I prefer not to think of this as an either-or decision. I immigrated to Canada from the US about 15 years ago, but most of our business is outside of Canada. I have enjoyed special incentives here in Canada for small businesses as well as special programs that allowed me to work in Canada as a technical professional before I moved here permanently.

Overall, I have found Canadian employers more open to sponsoring foreign workers and it is easier for them to do so than what my US clients experience. Having said that, a significant portion of my work over the last few years has been on global projects where we leverage online collaboration tools to meet our goals. The advent of these tools has made it much easier to work from wherever I am and to work with others regardless of their visa statuses.

Where a company forms is less tied to where one lives or works these days.

Ajay- Could you tell us more about the Zachman framework (apart from the wikipedia reference)? A practical example on how you used it on an actual project would be great.

Karen- Of course the best resource for finding out about the Zachman framework is from John Zachman himself http://www.zachmaninternational.com/index.php/home-article/13 . He offers some excellent courses and does a great deal of public speaking at government and DAMA events. I highly recommend anyone interested in the Framework to hear about it directly from him.

There are many misunderstandings about John’s intent, such as the myth that he requires big upfront modeling (he doesn’t), that the Framework is a methodology (it isn’t), or that it can only be used to build computer systems (it can be used for more than that).

I have used the Zachman Framework to develop a joint Business-IT Strategic Information Systems Plan as well as to inventory and track progress of multi-project programs. One interesting use was a paper I authored for the Canadian Information Processing Society (CIPS) on how various educational programs, specializations, and certifications map to the Zachman Framework. I later developed a presentation about this mapping for a Zachman conference.

For a specific project, the Zachman Framework allows business to understand where their enterprise assets are being managed – and how well they are managed. It’s not an IT thing; it’s an enterprise architecture thing.

Ajay- What does Karen Lopez do for fun when not at work, traveling, speaking or blogging.

Karen- Sometimes it seems that’s all I do. I enjoy volunteering for IT-related organizations such as DAMA and CIPS. I participate in the accreditation of college and university educational programs in Canada and abroad. As a member of data-related standards bodies, namely the Association for Retail Technology Standards and the American Dental Association, I help develop industry standard data models. I’ve also been a spokesperson for a CIPS program to encourage girls to take more math and science courses throughout their student careers so that they may have access to great opportunities in the future.

I like to think of myself as a runner; last year I completed my first half marathon, which I’d never thought was possible. I am studying Hindi and Sanskrit. I’m also a addicted to reading and am thankful that some of it I actually get paid to do.

Biography

Karen López is a Senior Project Manager at InfoAdvisors, Inc. Karen is a frequent speaker at DAMA conferences and DAMA Chapters. She has 20+ years of experience in project and data management on large, multi-project programs. Karen specializes in the practical application of data management principles. Karen is also the ListMistress and moderator of the InfoAdvisors Discussion Groups at http://www.infoadvisors.com. You can reach her at www.twitter.com/datachick

Interview John Sall Founder JMP/SAS Institute

Here is an interview with John Sall, inventor of SAS and JMP and co-founder and co-owner of SAS Institute, the largest independent business intelligence and analytics software firm. In a free wheeling and exclusive interview, John talks of the long journey within SAS and his experiences in helping make JMP the data visualization software of choice.
JMP is perfect for anyone who wants to do exploratory data analysis and modeling in a visual and interactive way – John Sall

untitled2

Ajay- Describe your early science career. How would you encourage today’s generation to take up science and math careers?

John- I was a history major in college, but I graduated into a weak job market. So I went to graduate school and discovered statistics and computer science to be very captivating. Of course, I grew up in the moon-race science generation and was always a science enthusiast.

Ajay- Archimedes leapt out the bath shouting “Eureka” when he discovered his principle. Could you describe a “Eureka” moment while creating the SAS language when you and Jim Goodnight were working on it?

John- I think that the moments of discovery were more like “Oh, we were idiots” as we kept having to rewrite much of the product to handle emerging environments, like CMS, minicomputers, bitmap workstations, personal computers, Windows, client-server, and now the cloud. Several of the rewrites were even changing the language we implemented it in. But making the commitment to evolve led to an amazing sequence of growth that is still going on after 35 years.

Ajay- Describe the origins of JMP. What specific market segments does the latest release of JMP target?

John- JMP emerged from a recognition of two things: size and GUI. SAS’ enterprise footprint was too big a commitment for some potential users, and we needed a product to really take advantage of graphical interactivity. It was a little later that JMP started being dedicated more to the needs of engineering and science users, who are most of our current customers.

Ajay- What other non-SAS Institute software do you admire or have you worked with? Which areas is JMP best suited for? For which areas would you recommend software other than JMP to customers?

John- My favorite software was the Metrowerks CodeWarrior development environment. Sadly, it was abandoned among various Macintosh transitions, and now we are stuck with the open-source GCC and Xcode. It’s free, but it’s not as good.

JMP is perfect for anyone who wants to do exploratory data analysis and modeling in a visual and interactive way. This is something organizations of all kinds want to do. For analytics beyond what JMP can do, I recommend SAS, which has unparalleled breadth, depth and power in its analytic methods.

Ajay- I have yet to hear of a big academic push for JMP distribution in Asia. Are there any plans to distribute JMP for free or at very discounted prices in academic institutions in countries like India, China or even the rest of the USA?

John- We are increasing our investment in supporting academic institutions, but it has not been an area of strength for us. Professors seem to want the package they learned long ago, the language that is free or the spreadsheet program their business students already have. JMP’s customers do tell us that they wish the universities would train their prospective future employees in JMP, but the universities haven’t been hearing them. Fortunately, JMP is easy enough to pick up after you enter the work world. JMP does substantially discount prices for academic users.

Ajay- What are your views on tech offshoring, given the recession in the United States?

John- As you know, our products are mostly made in the USA, but we do have growing R&D operations in Pune and Beijing that have been performing very well. Even when the software is authored in the US, considerable work happens in each country to localize, customize and support our local users, and this will only increase as we become more service-oriented. In this recession, JMP has still been growing steadily.

Ajay-  What advice would you give to young graduates in this recession? How does learning JMP enhance their prospect of getting a job?

John- Quantitative fields have been fairly resistant to the recession. North Carolina State University, near the SAS campus, even has a Master of Science in Analytics < http://analytics.ncsu.edu/ > to get people job-ready. JMP experience certainly helps get jobs at our major customers.

Ajay- What does John Sall do in his free time, when not creating world-class companies or groovy statistical discovery software?

John- I lead the JMP division, which has been a fairly small part of a large software company (SAS), but JMP is becoming bigger than the whole company was when JMP was started. In my spare time, I go to meetings and travel with the Nature Conservancy <http://www.nature.org/ >, North Carolina State University <http:// http://ncsu.edu/ >, WWF <http://wwf.org/ >, CARE <http://www.care.org/ > and several other nonprofit organizations that my wife or I work with.

Official Biography

John Sall is a co-founder and Executive Vice President of SAS, the world’s largest privately held software company. He also leads the JMP business division, which creates interactive and highly visual data analysis software for the desktop.

Sall joined Jim Goodnight and two others in 1976 to establish SAS. He designed, developed and documented many of the earliest analytical procedures for Base SAS® software and was the initial author of SAS/ETS® software and SAS/IML®. He also led the R&D effort that produced SAS/OR®, SAS/QC® and Version 6 of Base SAS.

Sall was elected a Fellow of the American Statistical Association in 1998 and has held several positions in the association’s Statistical Computing section. He serves on the board of The Nature Conservancy, reflecting his strong interest in international conservation and environmental issues. He also is a member of the North Carolina State University (NCSU) Board of Trustees. In 1997, Sall and his wife, Ginger, contributed to the founding of Cary Academy, an independent college preparatory day school for students in grades 6 through 12.

Sall received a bachelor’s degree in history from Beloit College in Beloit, WI, and a master’s degree in economics from Northern Illinois University in DeKalb, IL. He studied graduate-level statistics at NCSU, which awarded him an honorary doctorate in 2003.

About JMP-

Originally nicknamed as John’s Macintosh Program, JMP is a leading software program in data visualization for statistical software. Researchers and engineers – whose jobs didn’t revolve solely around statistical analysis – needed an easy-to-use and affordable stats program. A new software product, today known as JMP®, was launched in 1989 to dynamically link statistical analysis with the graphical capabilities of Macintosh computers. Now running on all platforms, JMP continues to play an important role in modeling processes across industries as a desktop data visualization tool. It also provides a visual interface to SAS in an expanding line of solutions that includes SAS Visual BI and SAS Visual Data Discovery. Sall remains the lead architect for JMP.

Citation- http://www.sas.com/presscenter/bios/jsall.html

Ajay- I am thankful to John and his marketing communication specialist Arati for this interview.With an increasing focus on data to drive more rational decision making, SAS remains an interesting company to watch for in the era of mega- vendors and any SAS Institute deal and alliance will be  making potential investment bankers as well as newer customers drool. For previous interviews and coverage of SAS please use www.decisionstats.com/tag/sas

Interview Jim Harris Data Quality Expert OCDQ Blog

Here is an interview with one of the chief evangelists to data quality in the field of Business Intelligence, Jim Harris who has a renowned blog at http://www.ocdqblog.com/. I asked Jim about his experiences in the field on data quality messing up big budget BI projects, and some tips and methodologies to avoid them.

No one likes to feel blamed for causing or failing to fix the data quality problems- Jim Harris, Data Quality Expert.

Jim Harris Large Photo

Ajay- Why the name OCDQ? What drives your passion for data quality? Name any anecdotes where bad data quality really messed up a big BI project.

Jim Harris – Ever since I was a child, I have had an obsessive-compulsive personality. If you asked my professional colleagues to describe my work ethic, many would immediately respond: “Jim is obsessive-compulsive about data quality…but in a good way!” Therefore, when evaluating the short list of what to name my blog, it was not surprising to anyone that Obsessive-Compulsive Data Quality (OCDQ) was what I chose.

On a project for a financial services company, a critical data source was applications received by mail or phone for a variety of insurance products. These applications were manually entered by data entry clerks. Social security number was a required field and the data entry application had been designed to only allow valid values. Therefore, no one was concerned about the data quality of this field – it had to be populated and only valid values were accepted.

When a report was generated to estimate how many customers were interested in multiple insurance products by looking at the count of applications per social security number, it appeared as if a small number of customers were interested in not only every insurance product the company offered, but also thousands of policies within the same product type. More confusion was introduced when the report added the customer name field, which showed that this small number of highly interested customers had hundreds of different names. The problem was finally traced back to data entry.

Many insurance applications were received without a social security number. The data entry clerks were compensated, in part, based on the number of applications they entered per hour. In order to process the incomplete applications, the data entry clerks entered their own social security number.

On a project for a telecommunications company, multiple data sources were being consolidated into a new billing system. Concerns about postal address quality required the use of validation software to cleanse the billing address. No one was concerned about the telephone number field – after all, how could a telecommunications company have a data quality problem with telephone number?

However, when reports were run against the new billing system, a high percentage of records had a missing telephone number. The problem was that many of the data sources originated from legacy systems that only recently added a telephone number field. Previously, the telephone number was entered into the last line of the billing address.

New records entered into these legacy systems did start using the telephone number field, but the older records already in the system were not updated. During the consolidation process, the telephone number field was mapped directly from source to target and the postal validation software deleted the telephone number from the cleansed billing address.

Ajay- Data Quality – Garbage in, Garbage out for a project. What percentage of a BI project do you think gets allocated to input data quality? What percentage of final output is affected by the normalized errors?

Jim Harris- I know that Gartner has reported that 25% of critical data within large businesses is somehow inaccurate or incomplete and that 50% of implementations fail due to a lack of attention to data quality issues.

The most common reason is that people doubt that data quality problems could be prevalent in their systems. This “data denial” is not necessarily a matter of blissful ignorance, but is often a natural self-defense mechanism from the data owners on the business side and/or the application owners on the technical side.

No one likes to feel blamed for causing or failing to fix the data quality problems.

All projects should allocate time and resources for performing a data quality assessment, which provides a much needed reality check for the perceptions and assumptions about the quality of the data. A data quality assessment can help with many tasks including verifying metadata, preparing meaningful questions for subject matter experts, understanding how data is being used, and most importantly – evaluating the ROI of data quality improvements. Building data quality monitoring functionality into the applications that support business processes provides the ability to measure the effect that poor data quality can have on decision-critical information.

Ajay- Companies talk of paradigms like Kaizen, Six Sigma and LEAN for eliminating waste and defects. What technique would you recommend for a company just about to start a major BI project for a standard ETL and reporting project to keep data aligned and clean?

Jim Harris- I am a big advocate for methodology and best practices and the paradigms you mentioned do provide excellent frameworks that can be helpful. However, I freely admit that I have never been formally trained or certified in any of them. I have worked on projects where they have been attempted and have seen varying degrees of success in their implementation. Six Sigma is the one that I am most familiar with, especially the DMAIC framework.

However, a general problem that I have with most frameworks is their tendency to adopt a one-size-fits-all strategy, which I believe is an approach that is doomed to fail. Any implemented framework must be customized to adapt to an organization’s unique culture. In part, this is necessary because implementing changes of any kind will be met with initial resistance, but an attempt at forcing a one-size-fits-all approach almost sends a message to the organization that everything they are currently doing is wrong, which will of course only increase the resistance to change.

Starting with a framework as a reference provides best practices and recommended options of what has worked for other organizations. The framework should be reviewed to determine what can best be learned from it and to select what will work in the current environment and what simply won’t. This doesn’t mean that the selected components of the framework will be implemented simultaneously. All change comes gradually and the selected components will most likely be implemented in phases.

Fundamentally, all change starts with changing people’s minds. And to do that effectively, the starting point has to be improving communication and encouraging open dialogue. This means more of listening to what people throughout the organization have to say and less of just telling them what to do. Keeping data aligned and clean requires getting people aligned and communicating.

Ajay- What methods and habits would you recommend to young analysts starting in the BI field for a quality checklist?

Jim Harris- I always make two recommendations.

First, never make assumptions about the data. I don’t care how well the business requirements document is written or how pretty the data model looks or how narrowly your particular role on the project has been defined. There is simply no substitute for looking at the data.

Second, don’t be afraid to ask questions or admit when you don’t know the answers. The only difference between a young analyst just starting out and an expert is that the expert has already made and learned from all the mistakes caused by being afraid to ask questions or admitting when you don’t know the answers.

Ajay- What does Jim Harris do to have quality time when not at work?

Jim- Since I enjoy what I do for a living so much, it sometimes seems impossible to disengage from work and make quality time for myself. I have also become hopelessly addicted to social media and spend far too much time on Twitter and Facebook. I have also always spent too much of my free time watching television and movies. I do try to read as much as I can, but I have so many stacks of unread books in my house that I could probably open my own book store. True quality time typically requires the elimination of all technology by going walking, hiking or mountain biking. I do bring my mobile phone in case of emergencies, but I turn it off before I leave.

Biography-

Jim Harris Small PhotoJim Harris is the Blogger-in-Chief here at Obsessive-Compulsive Data Quality (OCDQ), which is an independent blog offering a vendor-neutral perspective on data quality.

He is an independent consultant, speaker, writer and blogger with over 15 years of professional services and application development experience in data quality (DQ), and business intelligence (BI),

Jim has worked with Global 500 companies in finance, brokerage, banking, insurance, healthcare, pharmaceuticals, manufacturing, retail, telecommunications, and utilities. Jim also has a long history with the product that is now known as IBM InfoSphere QualityStage. Additionally, he has some experience with Informatica Data Quality and DataFlux dfPower Studio.

Jim can be followed at twitter.com/ocdqblog and contacted at http://www.ocdqblog.com/contact/


Interview Eric Siegel, Phd President Prediction Impact

An interview with Eric Siegel, Ph.D.President of Prediction Impact, Inc. and founding chair of Predictive Analytics World.

Ajay- What does this round of Predictive Analytics World have —–which was not there in the edition earlier in the year.

Eric- Predictive Analytics World (pawcon.com) – Oct 20-21 in DC delivers a fresh set of 25 vendor-neutral presentations across verticals employing predictive analytics, such as banking, financial services, e-commerce, education, healthcare, high technology, insurance, non-profits, publishing, retail and telecommunications.

PAW features keynote speaker, Stephen Baker, author of The Numerati and Senior writer at BusinessWeek.  His keynote is described at www.predictiveanalyticsworld.com/dc/2009/agenda.php#day2-2

A strong representation of leading enterprises have signed up to tell their stories — speakers will present how predictive analytics is applied at Aflac, AT&T Bell South, Amway, The Coca-Cola Company, Financial Times, Hewlett-Packard, IRS, National Center for Dropout Prevention, The National Rifle Association, The New York Times, Optus (Australian telecom), PREMIER Bankcard, Reed Elsevier, Sprint-Nextel, Sunrise Communications (Switzerland), Target, US Bank, U.S. Department of Defense, Zurich — plus special examples from Anheuser-Busch, Disney, HSBC, Pfizer, Social Security Administration, WestWind Foundation and others.

To see the entire agenda at a glance: www.predictiveanalyticsworld.com/dc/2009/agenda_overview.php

We’ve added a third workshop, offered the day before (Oct 19), “Hands-On Predictive Analytics.  There’s no better way to dive in than operating real predictive modeling software yourself – hands-on.”  For more info: www.predictiveanalyticsworld.com/dc/2009/handson_predictive_analytics.php

Ajay- What do academics, corporations and data miners gain in this conference? list 4 bullet points for the specific gains.

Eric- A. First, PAW’s experienced speakers provide the “how to” of predictive analytics. PAW is a unique conference in its focus on the commercial deployment of predictive analytics, rather than research and development. The core analytical technology is established and proven, valuable as-is without additional R&D — but that doesn’t mean it’s a “cakewalk” to employ it successfully to ensure value is attained.  Challenges include data requirements and preparation, integration of predictive models and their scores into existing organizational systems and processes, tracking and evaluating performance, etc. There’s no better way to hone your skills and cultivate an informed plan for your organization’s efforts than hearing how other organizations did it.

B. Second, PAW covers the latest state-of-the-art methods produced by research labs, and how they provide value in commercial deployment. This October, almost all sessions in Track 2 are at the Expert/Practitioner-level.  Advanced topics include ensemble models, uplift modeling (incremental modeling), model scoring with cloud computing, predictive text analytics, social network analysis, and more.

PAW’s pre- and post-conference workshops round out the learning opportunities. In addition to the hands-on workshop mentioned above, there is a course covering core methods, “The Best and the Worst of Predictive Analytics: Predictive Modeling Methods and Common Data Mining Mistakes” (www.predictiveanalyticsworld.com/dc/2009/predictive_modeling_methods.php) and a business-level seminar on decision automation and support, “Putting Predictive Analytics to Work” (www.predictiveanalyticsworld.com/dc/2009/predictive_analytics_work.php).

C. Third, the leading predictive analytics software vendors and consulting firms are present at PAW as sponsors and exhibitors, available to provide demos and answer your questions.  What do the predictive analytics solutions do, how do they compare, and which is best for your? PAW is the one-stop-shop for selecting the tool or solution most suited to address your needs.

D. Fourth, PAW provides a unique, focused opportunity to network with colleagues and establish valuable contacts in the predictive analytics industry.  Mingle, connect and hang out with professionals facing similar challenges (coffee breaks, meals, and at the reception).

Ajay- How do you balance the interests of various competing softwares and companies who sponsor such event?

Eric- As a vendor-neutral event, PAW’s core program of 25 sessions is booked exclusively with enterprise practitioners, thought leaders and adopters, with no predictive analytics software vendors speaking or co-presenting. These sessions provide substantive content with take-aways which provide value that’s independent of any particular software solution — no product pitches!  Beyond these 25 sessions are three short sponsor sessions that are demarcated as such, and, despite being branded, generally prove to be quite substantive as well.  In this way, PAW delivers a high quality, unbiased program.

Supplementing this vendor-neutral program, the room right next door has an expo where attendees have all the access to software and solution vendors they could want (cf. in my answer to the prior question regarding software vendors, above).

Here are a couple more PAW links:

For informative PAW event updates:
www.predictiveanalyticsworld.com/notifications.php

To sign up for the PAW group on LinkedIn, see:
www.linkedin.com/e/gis/1005097

Ajay- Describe your career in science including research that you specialize in. How would you motivate students today to go for science careers

Eric- Well, first off, my work as a predictive analytics consultant, instructor and conference chair is in the application of established technology, rather than the research and development of new or improved methods.

But the Ph.D. next to my name reveals my secret past as an “academic”. Pure research is something I really enjoyed and I kind of feel like I had a brain transplant in order to change to “real world work”. I’m glad I made the change, although I see good sides to both types of work (really, they’re like two entirely different lifestyles).

In my research I focused on core predictive modeling methods. The ability for a computer to automatically learn from experience (data really is recorded experience, after all), is the best thing since sliced bread. Ever since I realized, as a kid, that space travel would in fact be a huge pain in the neck, nothing in science has ever seemed nearly as exciting.

Predictive analytics is an endeavor in machine learning. A predictive model is the encoding of a set of rules or patterns or regularities at some level. The model is the thing output by automated, number-crunchin’ analysis and, therefore, is the thing “learned” from the “experience” (data).  The “magic” here is the ability of these methods to find a model that performs not only over the historical data on your disk drive, but that will perform equally well for tomorrow’s new situations. That ability to generalize from the data at hand means the system has actually learned something.

And indeed the ability to learn and apply what’s been learned turns out to provide plenty of business value, as I imagined back in the lab.  The output of a predictive model is a predictive score for each individual customer or prospect.  The score in turn directly informs the business decision to be taken with that individual customer (to contact or not to contact; with which offer to contact, etc.) – business intelligence just doesn’t get more actionable than that.

For the impending student, I’d first point out the difference between applied science and research science. Research science is fun in that you have the luxury of abstraction and are usually fairly removed from the need to prove near-term industrial applicability. Applied science is fun for the opposite reason: The tangle of challenges, although less abstract and in that sense more mundane, are the only thing between you and getting the great ideas of the world to actually work, come to fruition, and have an irrefutable impact.

Ajay- What are the top five conferences in analytics and data mining in your opinion in the world including PAW.

Eric- KDD – The leading event for research and development of the core methods behind the commercial deployments covered at PAW (“Knowledge Discovery and Data Mining”).

ICML – Another long-standing research conference on machine learning (core data mining).

eMetrics.org – For online marketing optimization and web analytics

Text Analytics Summit – Text mining can leverage “unstructured data” (text) to augment predictive analytics; the chair of this conference is speaking at PAW on just that topic: www.predictiveanalyticsworld.com/dc/2009/agenda.php#day2-15

Predictive Analytics World, the business-focused event for predictive analytics professionals, managers and commercial practitioners – focused on the commercial deployment of predictive analytics: pawcon.com

Ajay- Would PAW 2009 have video archives, videos as well or podcasts for people not able to attend on site.

Eric- While the PAW conferences emphasize in-person participation, we are in the planning stages for future webcasts and other online content. PAW’s “Predictive Analytics Guide” has a growing list of online resources: www.predictiveanalyticsworld.com/predictive_analytics.php

Ajay- How do you think social media marketing can help in these conferences.

Eric- Like most events, PAW leverages social media to spread the word.

But perhaps most pertinent is the other way around: predictive analytics can help social media by increasing relevancy, dynamically selecting the content to which each reader or viewer is most likely to respond.

Ajay- Do you have any plans to take PAW more international? Any plans for a PAW journal for trainings and papers.

Eric- We’re in discussions on these topics, but for now I can only say, stay tuned!

Biographyy

The president of Prediction Impact, Inc., Eric Siegel is an expert in predictive analytics and data mining and a former computer science professor at Columbia University, where he won awards for teaching, including graduate-level courses in machine learning and intelligent systems – the academic terms for predictive analytics.He has published 13 papers in data mining research and computer science education, has served on 10 conference program committees, has chaired a AAAI Symposium held at MIT, and is the founding chair of Predictive Analytics World.

For more on Predictive Analytic World-

Predictive Analytics World Conference
October 20-21, 2009, Washington, DC
www.predictiveanalyticsworld.com
LinkedIn Group: www.linkedin.com/e/gis/1005097

Interview Gary D. Miner Author and Professor

Here is an interview with Gary Miner, Phd who has been in the data mining business for almost 30 years and a pioneer in healthcare studies pertaining to Alzheimer’s diseases. He is also co author of “the Handbook of Statistical Analysis and  Data Mining Applications”. Gary writes on how he has seen data mining change over the years, health care applications as well as his book and quotes from his experience.

GaryMinersmall

Ajay- Describe your career in science starting from college till today. How would you interest young students in science careers today in the mid of the recession

Gary – I knew that I wanted to be in “Science” even before college days, taking all the science and math courses I could in high school. This continued in undergraduate college years at a private college [Hamline University, St. Paul, Minnesota……..older than the State of Minnesota, founded in 1854, and had the first Medical School, later “sold” to the University of Minnesota] as a Biology and Chemistry major, with a minor in education. From there is did a M.S. conducting a “Physiological genetics research project”, and then a Ph.D. at another institution where I worked on Genetic Polymorphisms of Mouse blood enzymes. So through all of this, I had to use statistics to analyze the data. My M.S. was analyzed before the time of even “electronic calculators”, so I used, if you can believe this, a “hand cranked calculator”, rented, one summer to analyze my M.S. dataset. By the time my Ph.D. thesis data was being analyzed, electronic calculators were available, but the big main-frame computers were on college campuses, so I punched the data into CARDS, walked down the hill to the computing center, dropped off the stack of cards, to come back the next day to get “reams of output” on large paper [about 15” by 18”, folded in a stack, if anyone remembers those days …]. I then spent about 30 years doing medical research in academic environments with the emphasis on genetics, biochemistry, and proteonomics in the areas of mental illness and Alzheimer’s Disease, which became my main area of study, publishing the first book in 1989 on the GENETICS OF ALZHEIMER’S DISEASE.

Today, in my “semi-retirement careers”, one side-line outreach is working with medical residents on their research projects, which I’ve been doing for about 7 or 8 years now. This involves design of the research project, data collection, and most importantly “effective and accurate” analysis of the datasets. I find this a way I can reach out to the younger generation to interest them not only in “science”, but in doing “science correctly”. As you probably know, we are in the arena of the “Duming of America”; anti-science, if you wish. I’ve seen this happening for at least 30 years, during the 1980’s, 1990’s, and continuing into this Century. Even the medical residents I get to work with each year have been going “downhill” yearly in their ability to “problem solve”. I believe this is an effect of this “dumning of America”.

There are several books coming out on this Dumning of America this summer; one the first week of June, another on July 12, and another in September [see the attached PPT for slides with the covers of these 3 books}. It is a real problem, as Americans over the past few decades have moved towards “wanting simple answers”, and most things in the “real world”, e.g. reality are not simple………..that’s where Science comes in.

A recent 2008 study done by the School of Public Health at Ohio University showed that up to 88% of the published scientific papers in a top respected cancer journal either used statistics INCORRECTLY, and/or the CONCLUSION was INCORRECT. When I and my wife both did Post-Docs in Psychiatric Epidemiology in 1980-82, basically doing an MPH, the first words out of the mouth of the “Biostats – Epidemiology” professor in the first lecture to the incoming MPH students was “We might as well through out most of the medical research literature of the past 25 years, as it has either not been designed correctly or statistics have been used incorrectly”!!! ……That caught my attention. And following medical research [and medicine in general] I can tell you that “not much has changed in the past 25 years since then”, and thus that puts us “50 years behind in medical research” and medicine. ANALOGY: If some of our major companies, that are successfully using predictive analytics to organize and efficiently run their organizations, took on the “mode of operation” of medicine and medical research, they’d be “bankrupt” in 6 months” …. That’s what I tell my students.

Ajay- Describe some of the exciting things data mining can do to lower health care costs and provide more people with coverage.

Gary- As mentioned above, my personal feeling is that “medicine / health care” is 50 years “behind the times”, compared to the efficiency needed to successfully survive in this Global Economy; corporations and organizations like Wal-Mart, INTEL, many of our Pharmaceutical Companies, have used data mining / predictive analytics to survive successfully. Wal-Mart especially: Wal-Mart has it’s own set of data miners, and were writing their own procedures in the early 1990’s ………..before most of us ever heard of data mining; that is why Wal-Mart can go into China today, and open a store in any location, and know almost to 99% accuracy 1) how many check out stand needed, 2) what products to stock, 3) where in the store to stock them, and 4) what their profit margin will be. They have done this through very accurate “Predictive Analytics” modeling.

Other “ingrained” USA corporations have NOT grabbed onto this “most accurate” technology [e.g. predictive analytics modeling], and reaping the “rewards” of impending bankruptcy and disappearance today. Examples in the news, of course, our our 3 – big automakers in Detroit. If they had engaged effective data mining / modeling in the late 1990’s they could have avoided their current problems. I see the same for many of our oldest and larges USA Insurance Companies………..they are “middle management fat”, and I’ve seen their ratings go down over the past 10 years from an A rating to even a C rating [for the company in which I have my auto insurance ? you might ask me why I stay? …. An agent who is a friend, BUT it is frustrating, and this companies “mode of operation” is completely “customer un-friendly”.], while new insurance companies have “grabbed” onto modern technology, and are rising stars.

So my influence on the younger generation is to have my students do research and DATA ANALYSIS correctly.

Ajay- Describe your book ” HANDBOOK OF STATISTICAL ANALYSIS & DATA MINING APPLICATIONS”. Who would be the target audience of this and can corporate data miners gain from it as well.

Gary- There are several target audiences: The main audience we were writing for, after our Publisher looked at what “niches” had been un-met in data mining literature, was for the professional in smaller and middle sized businesses and organizations that needed to learn about “data mining / predictive analytics” “fast”…..e.g. maybe situations where the company did not have a data anlaysis group using predictive analytics, but the CEO’s and Professionals in the company knew they needed to learn and start using predictive analytics to “stay alive”. This seemed like potentially a very large audience. The book is oriented so that one does NOT have to start at chapter 1, and read sequentially, but instead can START WITH A TUTORIAL. Working through a tutorial, I’ve found in my 40 years of being in education, is the fastest way for a person to learn something new. And this has been confirmed………..I;ve had newcomers to data mining, who have already gotten the HANDBOOK, write me and say: “I’ve gone through a bunch of tutorials, and finding that I am really learning ‘how to do this’……..I’ve ready other books on ‘theory’, but just didn’t get the ‘hang of it’ from those”. My data mining consultants at StatSoft, who travel and work in “real world” situations every day, and who wrote maybe 1/3 of the tutorials in the HANDBOOK, tell me: “A person can go through the TUTORIALS in the HANDBOOK, and know 70% of what we who are doing predictive analytics consulting every day know !!!”

But there are other audiences: Corporate data miners can find it very useful also, as a “way of thinking as a data miner” can be gained from reading the book, as was expressed by one of the Amazon.com 5-STAR reviews: “What I like about this book is that it embeds those methods in a broader context, that of the philosophy and structure of data mining, especially as the methods are used in the corporate world. To me, it was really helpful in thinking like a data miner, especially as it involves the mix of science and art.”

But we’ve had others who have told us they will use is as an extra textbook in their Business Intelligence and Data Minng courses, because of the “richness” of the tutorials. Here’s a comment on the Amazon reviews from a Head of Business School who has maybe over 100 graduate students doing data mining:

“5.0 out of 5 stars. At last, a useable data mining book”

This is one of the few, of many, data mining books that delivers what it promises. It promises many detailed examples and cases. The companion DVD has detailed cases and also has a real 90 day trial copy of Statistica. I have taught data mining for over 10 years and I know it is very difficult to find comprehensive cases that can be used for classroom examples and for students to actually mine data. The price of the book is also very reasonable expecially when you compare the quantity and quality of the material to the typical intro stat book that usually costs twice as much as this data mining book.

The book also addresses new areas of data mining that are under development. Anyone that really wants to understand what data mining is about will find this book infinitively useful.”

So, I think the HANDBOOK will see use in many college classrooms.

Ajay- A question I never get the answer to is which data mining tool is good for what and not so good for what. Could you help me out with this one? What in your opinion, among the data mining and statistical tools used by you in your 40 years in this profession would you recommend for some uses, and what would you not recommend for other uses ( eg SAS,SPSS,KXEN,Statsoft,R etc etc)

Gary- This is a question I can’t answer well; but my book co-author, Robert Nisbet, Ph.D. can. He has used most of these softwares, and in fact has written 2 reviews over the past 6 years in which most of these have been discussed. I like “cutting edge endeavors”, that has been the modus operandi of my ‘career’, so when I took this “semi-retirement postion” as a data mining consultant at StatSoft, I was introduced to DATA MINING, as we started developing STATISTICA Data Miner shortly after I arrived. So most of my experience is with STATISTICA Data Miner, which of course has always been rated NO 1 in all the reviews on data miner software done by Dr. Nisbet – I believe this is primarily due to the fact that STATISTICA was written for the PC from the beginning, thus dos not have any legacy “main frame computer” coding in its history, and secondly StatSoft has been able to move rapidly to make changes as business and government data analysis needs change, and thirdly and most importantly, STATISTICA products have very “open architecture”, “flexibility”, and “customization” with every “built together / workable together” as one package. And of course the graphical output is second to none – that is how STATISTICA originally got its reputation. So I find no need of any other software, as if I need a new algorithm, I can program it to work with the “off the shelf” STATISTICA Data Miner algorithms, and thus get anything I need with the full graphical and other outputs seamlessly available.

Ask Bob Nisbet to answer this question, as he has the background to do so.

Ajay- What are the most interesting trends to watch out for in 2009-2010 in data mining in your opinion.

Gary- Things move so rapidly in this 21st century world, that this is difficult to say. Let me answer this with “hindsight”:

In late October, 2008 I wrote the first draft of Chapter 21 for the HANDBOOK. This was the “future directions of data mining”. You can look in that chapter yourself to find the 4 main areas I decided to focus on. One was on “social networking”, and one of the new examples used was TWITTER. At that time, less than one year ago, no one knew if TWITTER was going to amount to much or not ??? big question? Well, on Jan 14 when the US-AIRWAYS A320 Airbus made an emergency landing in the Hudson River, I got an EMAIL automatic message from CNN [that I subscribe to] telling me that a “plane was down in the Hudson, watch it live” …………I click on the live video: The voice form the Helicopter overhead was saying: “We see a plane, half sunk into the water, but no people? What has happened to the people? Are they all dead?………” Well, as it turned out, the CNN Helicopters had spend nearly one hour searching the river for the plane, as had other news agencies. BUT THE “ENTIRE” WORLD ALREADY KNEW !!! … Why? A person on a ferry that was crossing the river close to the crash landing used his I-Phone, snaped a photo, uploaded it to TWIT-PIX and sent a TWITTER message, and this was re-tweeted around the world. The world knew in “seconds to minutes” to which the traditional NEWS MEDIA was 1 hour late on the scene, when ALL the PEOPLE had been rescued and were on-shore in a warm building within 45 minutes of the landing. THE TRADITIONAL NEWS MEDIA ARRIVED 15 MINUTES AFTER EVERYTHING HAD HAPPENED !!!! ………AT THIS POINT we ALL KNEW that TWITTER was a new phenomenon ……….and it started growing, with 10,00 people an hour joining at one point in last spring of this year, and who knows what the rate is today. TWITTER has become a most important part not only of “social networking” among friends, but for BUSINESS —- companies even sending out ‘Parts Availability” lists to their dealers, etc.

TWITTER affected Chapter 21…………..I immediately re-wrote Chapter 21, including this first photo of the Hudson Plane crash-landing with all the people standing on the wings. BUT, not the end of this story: By the time the book was about to go to press, TWITTER had decided that “ownership” of uploaded photos resided with the photographer, and the person who took this original US-AIRBUS – PEOPLE ON THE WINGS photo wanted $600 for us to publish it in the HANDBOOK. So, I re-wrote again [the chapter was already “set” in page proofs……….so we had to make the changes directly at the printer]………this time finding another photo uploaded to social media, but in this case the person had “checked” the box to put the photo in public domain.

So TWITTER is one that I predicted would become important, but I’d thought it would be months AFTER the HANDBOOK was released in May, not last January!!!

Other things we presented in Chapter 21 about the “future of data mining” involved “photo / image recognition”, among others. The “Image Recognition”, and more importantly “movement recognition / analysis” for things like Physical Therapy and other medical areas may be more slow to evolve and fully develop, but are immensely important. The ability to analyze such “Three-dimensional movement data” is already available in rudimentary form in our version 9 of STATISTICA [just released in June], and anyone could implement it fully with MACROS, but it probably will be some time before it is fully feasible from a business standpoint to develop it with fully automatic “point and click” functionality to make it readily accessible for anyone’s use.

Ajay What would your advice be to a young statistician just starting his research career.

Gary- Make sure you delve in / grab in FULLY to the subject areas……….you need to know BOTH the “domain” of the data you are working with, and “correct” methods of data analysis, especially when using the traditional p-value statistics. Today’s culture is too much on “superficiality”………..good data analysis requires “depth” of understanding. One needs to FOCUS ………good FOCUS can’t be done with elaborate “multi-tasking”. Granted, today’s youth [the “Technology-Inherited”] probably have their brains “wired differently” than the “Technology-Immigrants” like myself [e.g. the older generations], but never-the-less, I see ERRORS all over the place in today’s world, from “typos” in magazine and newspaper, to web page paragraphs, links that don’t work, etc etc ……….and I conclude that this is all do to NON-FOCUSED / MULTI-TASKING people. You can’t drive a car / bus / train and TEXT MESSAGE at the same time ……….the scientific tests that have been conducted show that it takes 20-times as long for a TEXT MESSAGING driver to stop, than a driver fully focused on the road, when given a “danger” warning. [Now, maybe this scientific experiment used ALL TECHNOLOGY-IMMIGRANTS as drivers?? If so, the scientific design was “flawed” ……..they should have used BOTH Technology-Immigrants and Technology-Inheritants as participants in the study. Then we’d have 2 dependent, or target variables: Age and TEXT MESSAGING…..]

Short Bio-

Professor, 30 years medical research in genetics, DNA, Proteins, Neuropsychology of Schizophrenia and Alzheimer’s Disease……….now semi-retired position as DATA MINING CONSULTANT – SENIOR STATISTICIAN

Interview John Moore CTO, Swimfish

Here is an interview with John F Moore, VP Engineering and Chief Technology Officer, Swimfish a provider of business solutions and CRM. A well known figure in Technology and CRM circles, John talks of Social CRM, Technology Offshoring, Community Initiatives and his own career.

Too many CRM systems are not usable. They are built by engineers that think of the system as a large database and the systems often look like a database making it difficult to use by the sales, support, and marketing people.

-John F Moore

JohnMoore

Ajay – Describe your career journey from college to CTO. What changes in mindset did you undergo along the journey? What advice would you give to young students to take up science careers ?

John- First, I wanted to take time to thank you for the interview offer. I graduated from Boston University in 1988 with a degree in Electrical Engineering. At the time of my graduation I found myself to be very interested in the advanced taking place on the personal computing front by companies like Lotus with their 1-2-3 product. I knew that I wanted to be involved with these efforts and landed my first job in the software space as a Software Quality Engineer working on 1-2-3 for DOS.

I spent the first few years of my career working at Lotus as a developer, a quality engineer, and manager, on products such as Lotus 1-2-3 and Lotus Notes. Throughout those early career years I learned a lot and focused on taking as many classes as possible.

From Lotus I sought out the start-up environment and by early 2000 and joined a startup named Brainshark (http://www.brainshark.com). Brainshark was, and is, focused on delivering an asynchronous communication platform on the web and was one of the early providers of SAAS. In my seven years at Brainshark I learned a lot about delivering an Enterprise class SAAS solution on top of the Microsoft technology stack. The requirements to pass security audits for Fortune 500 companies, the need to match the performance of in-house solutions, resulted in all of us learning a great deal. These were very fun times.

I now work as the VP of Engineering and CTO at Swimfish, a services and software provider of business solutions. We focus on the financial marketplace where we have the founder has a very deep background, but also work within other verticals as well. Our products are focused on the CRM, document management, and mobile product space and are built on the Microsoft technology stack. Our customers leverage both our SAAS and on-premise solutions which require us to build our products to be more flexible than is generally required for a SAAS-only solution.

The exciting thing for me is the sheer amount of opportunities I see available for science/engineering students graduating in the near future. To be prepared for these opportunities, however, it will be important to not just be technically savvy.

Engineering students should also be looking at:

* Business classes. If you want to build cool products they must deliver business value.

* Writing and speaking classes. You must be able to articulate your ideas or no one will be willing to invest in them.

I would also encourage people to take chances, get in over your head as often as possible.You may fail, you may succeed. Either way you will gain experiences that make it all worthwhile.

Ajay- How do you think social media can help with CRM. What are the basic do’s and don’ts for social media CRM in your opinion?

John- You touch upon a subject that I am very passionate about. When I think of Social CRM I think about a system of processes and products that enable businesses to actively engage with customers in a manner that delivers maximum value to all. Customers should be able to find answers to their questions with minimal friction or effort; companies should find the right customers for their products.

Social CRM should deliver on some of these fronts:

* Analyze the web of relationships that exists to define optimal pathways. These pathways will define relationships that businesses can leverage for finding their customers. These pathways will enable customers to quickly find answers to their questions. For example, I needed an answer to a question about SharePoint and project management. I asked the question on Twitter and within 3 minutes had answers from two different people. Not only did I get the answer I needed but I made two new friends who I still talk to today.

* Monitor conversations to gauge brand awareness, identify customers having problems or asking questions. This monitoring should not be stalking; however, it should be used to provide quick responses to customers to benefit the greater community.

* Usability. Too many CRM systems are not usable. They are built by engineers that think of the system as a large database and the systems often look like a database making it difficult to use by the sales, support, and marketing people.

Finally, when I think of social media I think of these properties:

* Social is about relationship building.

* You should always add more value to the community than you take in return.

* Be transparent and honest. People can tell when you’re not.

Ajay-  You are involved in some noble causes – like using blog space for out of work techies and separately for Alzheimer’s disease. How important do you think is for people especially younger people to be dedicated to community causes?

John- My mother-in-law was diagnosed with Alzheimer’s disease at the age 57. My wife and I moved into their two-family house to help her through the final years of her life. It is a horrible disease and one that it is easy to be passionate about if you have seen it in action.

My motivation on the job front is very similar. I have seen too many people suffer through these poor economic times and I simply want to do what I can to help people get back to work.

It probably sounds corny, but I firmly believe that we must all do what we can for each other. Business is competitive, but it does not mean that we cannot, or should not, help each other out. I think it’s important for everyone to have causes they believe in. You have to find your passions in life and follow them. Be a whole person and help change the world for the better.

Ajay- Describe your daily challenges as head of Engineering of Swimfish, Inc How important is it for the tech team to be integrated with the business and understand it as well.

John- The engineering team at Swimfish works very closely with the business teams. It is important for the team to understand the challenges our customers are encountering and to build products that help the customer succeed. I am not satisfied with the lack of success that many companies encounter when deploying a CRM solution.

We go as deep as possible to understand the business, the processes currently in use, the disparate systems being utilized, and then the underlying technologies currently in use. Only then do we focus on the solutions and deliver the right solution for that company.

On the product front it is the same. We work closely with customers on the features we are planning to add, trying to ensure that the solutions meet their needs as well as the needs of the other customers in the market that we are hoping to serve.

I do expect my engineers to be great at their core job, that goes without question. However, if they cannot understand the business needs they will not work for me very long.My weeks at Swimfish always provide me with interesting challenges and opportunities.

My typical day involves:

* Checking in with our support team to understand if there are any major issues being encountered by any of our customers.

* Challenging the support team to hit their targets. I love sales as without them I cannot deliver products.

* Checking in with my developers and test teams to determine how each of our projects is doing. We have a daily standup as well, but I try and personally check-in with as many people as possible.

* Most days I spend some time developing, mostly in C#. My current focus area is on our next release of our Milestone Tracking Matrix where I have made major revisions to our user interface.

I also spend time interacting on various social platforms, such as Twitter, as it is critical for me to understand the challenges that people are encountering in their businesses, to keep up with the rapid pace of technology, and just to check-in with friends. Keep it real.

Ajay-  What are your views on off shoring work especially science jobs which ultimately made science careers less attractive in the US- at the same time outsourcing companies ( in India) generally pay only 1/3 rd of billing fees to salaries. Do you think concepts like ODesk can help change the paradigm of tech out-sourcing.

John- I have mixed opinions on off-shoring. You should not offshore because of perceived cost savings only. On net you will generally break even, you will not save as much as you might originally think.

I am, however, close to starting a relationship with a good development provider in Costa Rica. The reason for this relationship is not cost based, it is knowledge based. This company has a lot of experience with the primary CRM system that we sell to customers and I have not been successful in finding this experience locally. I will save a lot of money in upfront training on this skill-set; they have done a lot of work in this area already (and have great references). There is real value to our business, and theirs.

Note that Swimfish is already working with a geographically dispersed team as part of the engineering team is in California and part is in Massachusetts. This arrangement has already helped us to better prepare for an offshore relationship and I know we will be successful when we begin.

Ajay- What does John Moore do to have fun when he is not in front of his computer or with a cause.

John- As the father of two teenage daughters I spend a lot of time going to soccer, basketball, and softball games. I also enjoy spending time running, having completed a couple of marathons, and relaxing with a good book. My next challenge will be skydiving as my 17 year old daughter and I are going skydiving when she turns 18.

Brief Bio:

For the last decade I have worked as a senior engineering manager for SAAS applications built upon the Microsoft technology stack. I have established the processes, and hired the teams that delivered hundreds of updates ranging from weekly patches to longer running full feature releases. My background as a hands-on developer combined with my strong QA background has enabled me to deliver high quality software on-time.

You can learn more about me, and my opinions, by reading my blog at http://johnfmoore.wordpress.com/ or joining me on Twitter at http://twitter.com/JohnFMoore