The Top Decisionstats Articles -Part 2 Business Intelligence and Data Quality

I am self convinced novice at business intelligence. I understand the broad concepts, understand reporting tools, and definitely forecasting tools. But the whole systems view baffles me enough. Fortunately I have been learning from some of the best writers in this field. Here in order of circulation are the top Business Intelligence articles.

Business Intelligence


1) Jill Dyche

http://www.decisionstats.com/2009/06/30/interview-jill-dyche-baseline-consulting/

Jill is a fabulously wise and experienced person with a great writing style. Here answers were some of the most educative I have seen in Bi writing.

2) Peter Thomas

http://www.decisionstats.com/2009/07/02/peter-james-thomas-bi/

The best of British BI is epitomized by Peter Thomas, and he is truly a European giant when it comes to the field. His worst weakness is a tendency to disappear when Test cricket is around- but that is

eminently understable. I can relate to the cricket as well.

3) Karen Lopez

http://www.decisionstats.com/2009/07/28/interview-karen-lopez/

Karen gives an excellent insight on creating mock ups or data models before actual implementation. She has worked on it for three decades and her wisdom is clearly visible here.

Data Quality

Data quality is such an overlooked and easy to fix issue, that I belive any BI vendor that builds the best, most robust data quality architechture will gain the maximum Pareto like benefits out of results. Curiously competing BI vendors will often compete on price, grahics appeal, etc etc, but the easy Garbage In Garbage Out rule is something they should consider. The Data Quality Interviews gave me an important tutorial in these aspects of data management.

1) Jim Harris

http://www.decisionstats.com/tag/jim-harris/

Jim is an one man army when it comes to evangelizing data quality and his OCDQ blog is widely read and cited.

2) Steve Sarsfield

http://www.decisionstats.com/2009/08/13/interview-steve-sarsfield-author-the-data-governance-imperative/

His excellent book is the one must read item that people in cost cutting corporations should buy especially if they are considering to go down the Davenport competing on analytics model.

( To be continued- Part 3 Modeling and Text Mining

Part 4 Social Media

Part 5 Humour and Poetry )

The Top DecisionStats Articles -Part 1 Analytics

I was just looking at my web analytics numbers and we seem to have crossed some milestones.

The site has now gotten more than 50,000 views since being launched in Dec 2007.

Thank you everyone for your help in this. More importantly the quality of comments has been fabulous. Since I am out of ideas for the rest of the week- here is a best of posts collection.
Here are some of the most favorite articles as measured by number of page views. I have personal fovurites as well, but these are just the ranks as per page views and how they measure up.

Top 5 Interviews

1) Interviews with SAS Institute leaders- I have found generally great professionalism from SAS Institute people. This is surprising because comin from an open source background, SAS is often looked as a big brother. I find that more of a perception and less of a reality as the company continues to innovate.

a) with John Sall, founder SAS Institute- This is really the biggest interview I did in terms of the person involved. To my surprise ( I wasnt expecting John to say yes) the interview was really frank, and it came very fast. The answers seem to be written by John himself.

Quote- Quantitative fields can be fairly resistant to recession- John Sall.

http://www.decisionstats.com/2009/07/28/interview-john-sall-jmp/

b) Interview with Anne Milley, Director, Product Marketing , SAS Institute- This is a favourite because it came very soon after the NYTimes article on R etc. One of my personal opinions is that the difference between great and good leaders is often the fact that great leaders are humble enough  to learn and then build on their strengths. It ran in two parts- and I was really appreciative of the in-depth answers that Anne wrote.

Quotes-

Analytics continues to be our middle name.

Customers vote with the cheque book.

Continue reading “The Top DecisionStats Articles -Part 1 Analytics”

Interview Gregory Piatetsky KDNuggets.com

Here is an interviw with Gregory Piatetsky, founder and editor of KDNuggets (www.KDnuggets.com ) ,the oldest and biggest independent industry websites in terms of data mining and analytics-

gps6

Ajay- Please describe your career in science, many challenges and rewards that came with it. Name any scientific research, degrees teaching etc.


Gregory-
I was born in Moscow, Russia and went to a top math high-school in Moscow. A unique  challenge for me was that my father was one of leading mathematicians in Soviet Union.  While I liked math (and still do), I quickly realized while still in high school that  I will never be as good as my father, and math career was not for me.

Fortunately, I discovered computers and really liked the process of programming and solving applied problems.  At that time (late 1970s) computers were not very popular and it was not clear that one can make a career in computers.  However I was very lucky that I was able to pursue what I liked and find demand for my skills.

I got my MS in 1979 and PhD in 1984 in Computer Science from New York University.
I was interested in AI (perhaps thanks to a lot of science fiction I read as a kid), but found a job in databases, so I was looking for ways to combine them.

In 1984 I joined GTE Labs where I worked on research in databases and AI, and in 1989 started the first project on Knowledge Discovery in data. To help convince my management that there will be a demand for this thing
called “data mining” (GTE management did not see much future for it), I also organized a AAAI workshop on the topic.

I thought “data mining” is not sexy enough name, and so I called it “Knowledge Discovery in Data”, or KDD.  Since 1989, I was working on KDD and data mining in all aspects – more on my page www.kdnuggets.com/gps.html

Ajay-  How would you encourage a young science entrepreneur in this recession.

Gregory- Many great companies were started or grew in a recession, e.g.
http://www.insidecrm.com/features/businesses-started-slump-111108/

Recession may be compared to a brush fire which removes dead wood and allows new trees to grow.

Ajay- What prompted you to set up KD Nuggets? Any reasons for the name (kNowledge Discovery Nuggets). Describe some key milestones in this iconic website for data mining people.

Gregory- After a third KDD workshop in 1993 I started a newsletter to connect about 50 people who attended the workshop and possibly others who were interested in data mining and KDD.  The idea was that it will have short items or “nuggets” of information. Also, at that time a popular metaphor for data miner was gold miners who were looking for gold “nuggets”.  So, I wanted a newsletter with “nuggets” – short, valuable items about Knowledge Discovery.  Thus, the name KDnuggets.

In 1994 I created a website on data mining at GTE and in 1997, after I left  GTE , I moved it to the current domain name www.kdnuggets.com .

In 1999, I was working for startup which provided data mining services to financial industry.  However, because of Y2K issues, all banks etc froze their systems in the second half of 1999, and we had very little work (and our salaries were reduced as well).  I decided that I will try to get some ads and was able to get companies like SPSS and Megaputer to advertise.

Since 2001, I am an independent consultant and KDnuggets is only part of what I am doing.  I also do data mining consulting, and actively participate in SIGKDD (Director 1998-2005, Chair 2005-2009).

Some people think that KDnuggets is a large company, with publisher, webmaster, editor, ad salesperson, billing dept, etc.  KDnuggets indeed has all this functions, but it is all me and my two cats.

Ajay- I am impressed by the fact KD nuggets is almost a dictionary or encyclopedia for data mining. But apart from advertising you have not been totally commercial- many features of your newsletter remain ad free – you still maintain a minimalistic look and do not take sponsership aligned with one big vendor. What is your vision for KD Nuggets for the years to come to keep it truly independent.

Gregory- My vision for KDnuggets is to be a comprehensive resource for data mining community, and I really enjoyed maintaining such resource for the first 7-8 years completely non-commercially. However, when I became self -employed, I could not do KDnuggets without any income, so I selectively introduced ads, and only those which are relevant to data mining.

I like to think of KDnuggets as a Craiglist for data mining community.

I certainly realize the importance of social media and Web 2.0 (and interested people can follow my tweets at tweeter.com/kdnuggets)  and plan to add more social features to KDnuggets.

Still, just like Wikipedia and Facebook do not make New York Times obsolete, I think there is room and need for an edited website, especially for such a nerdy and not very social group like data miners.

Ajay- What is the worst mistake/error in writing publishing that you did. What is the biggest triumph or high moment in the Nuggets history.

Gregory- My biggest mistake is probably in choosing the name kdnuggets – in retrospect,  I could have used a shorter and easier to spell domain name, but in 1997 I never expected that I will still be publishing www.KDnuggets.com 12 years later.

Ajay- Who are your favourite data mining students ( having known so many people). What qualities do you think set a data mining person apart from other sceinces.

Gregory- I was only an adjunct professor for a short time, so I did not really have data mining students, but I was privileged enough to know many current data mining leaders when they were students.  Among more recent students, I am very impressed with Jure Leskovec, who just finished his PhD and got the best KDD dissertation award.

Ajay- What does Gregory Piatetsky do for fun when he is not informing the world on analytics and knowledge discovery.

Gregory- I enjoy travelling with my family, and in the summer I like biking and windsurfing.
I also read a lot, and currently in the middle of reading Proust (which I periodically dilute by other, lighter books).

Ajay- What is your favourite reading blog and website ? Any India plans to visit.
Gregory
– I visit many blogs on www.kdnuggets.com/websites/blogs.html

and I like especially
– Matthew Hurst blog: Data Mining: Text Mining, Visualization, and Social Media
– Occam’s Razor by Avinash Kaushik, examining web analytics.
– Juice Analytics, blogging about analytics and visualization
– Geeking with Greg, exploring the future of personalized information.

I also like your website decisionstats.com and plan to visit it more frequently

I visited many countries, but not yet India – waiting for the right occasion !

Biography

(http://www.kdnuggets.com/gps.html)

Gregory Piatetsky-Shapiro, Ph.D. is the President of KDnuggets, which provides research and consulting services in the areas of data mining, web mining, and business analytics. Gregory is considered to be one of the founders of the data mining and knowledge discovery field.Gregory edited or co-edited many collections on data mining and knowledge discovery, including two best-selling books: Knowledge Discovery in Databases (AAAI/MIT Press, 1991) and Advances in Knowledge Discovery in Databases (AAAI/MIT Press, 1996), and has over 60 publications in the areas of data mining, artificial intelligence and database research.

Gregory is the founder of Knowledge Discovery in Database (KDD) conference series. He organized and chaired the first three Knowledge Discovery in Databases (KDD) workshops in 1989, 1991, and 1993. He then served as the Chair of KDD Steering committee and guided the conversion of KDD workshops into leading international conferences on data mining. He also was the General Chair of the KDD-98 conference.

Interview Tasso Argyros CTO Aster Data Systems

Here is an interview with Tasso Argyros,the CTO and co-founder of Aster Data Systems (www.asterdata.com ) .Aster Data Systems is one of the first DBMS to tightly integrate SQL with MapReduce.

tassos_argyros

Ajay- Maths and Science students the world over are facing a major decline. What would you recommend to young students to get careers in science.

[TA]My father is a professor of Mathematics and I spent a lot of my college time studying advanced math. What I would say to new students is that Math is not a way to get  a job, it’s a way to learn how to think. As such, a Math education can lead to success in any discipline that requires intellectual abilities. As long as they take the time to specialize at some point – via  postgraduate education or a job where they can learn a new discipline from smart people – they won’t regret the investment.

Ajay- Describe your career in Science particularly your time at Stanford. What made you think of starting up Asterdata. How important is it for a team rather than an individual to begin startups. Could you describe the startup moment when your team came together.

[TA] – While at Stanford I became very familiar with the world of startups through my advisor, David Cheriton (who was an angel investor in VMWare, Google and founder of two successful companies). My research was about processing large amounts of data on large, low-cost computer farms. A year into my research it became obvious that this approach had huge processingpower advantages and it was superior to anything else I could see in the marketplace. I then happened to meet my other two co-founders, Mayank Bawa & George Candea who were looking at a similar technical problem from the database and reliability perspective, respectively.

I distinctly remember George walking into my office one day (I barely knew him back then) and saying “I want talk to you about startups and the future” – the rest has become history.

Ajay- How would you describe your product Aster nCluster Cloud Edition to omebody who does not anything beyond the Traditional Server/ Datawarehouse technologies. Could you rate it against some known vendors and give a price point specific to what level of usage does the Total Cost of Ownership in Asterdata becomes cheaper than a say Oracle or a SAP or a Microsoft Datawarehosuing solution.

[TA]- Aster allows businesses  to reduce the data analytics TCO in two interesting ways. First, it has a much lower hardware cost than any traditional DW technology because of its use of commodity servers or cloud infrastructure like Amazon EC2. Secondly, Aster has implemented a lot of  innovations that simplify the (previously tedious and expensive) management of the system, which includes scaling the system elastically up/down as needed – so they are not paying for capacity they don’t need at a given point in time.

But cutting costs is one side of the equation; what makes me even more excited is the ability to make a business more profitable, competitive and efficient through analyzing more data at greaterdepth. We have customers that have cut their costs and increased their customers and revenue by using Aster to analyze their valuable (and usually underutilized) data. If you have data – and you think you’re not taking full advantage of it – Aster can help.

Ajay- I have always have this one favourite question.When can I analyze 100 giga bytes of data using just a browser and some statistical software like R or advanced forecasting softwares that are available.Describe some of Asterdata ‘s work in enhancing the analytical capabilities of big data.

Can I run R ( free -open source) on an on demand basis for an Asterdata solution. How much would it cost me to crunch 100 gb of data and make segmentations and models with say 50 hours of processing time per month

[TA]- One of the big innovations that Aster does it to allow analytical applications like R to be embedded in the database via our SQL/MapReduce framework. We actually have customers right now that are using R to do advanced analytics over terabytes of data.  100GB is actually on the lower end of what our software can enable and as such the cost would not be significant.

Ajay- What do people at Asterdata do when not making complex software.

[TA]- A lot of Asterites love to travel around the world – we are, after all, a very diverse company. We also love coffee, Indian food as well as international and US sports like soccer, cricket, cycling,and football!

Ajay- Name some competing products to Asterdata and where Asterdata products are more suitable for a TCO viewpoint. Name specific areas where you would not recommend your own products.

[TA]- We go against products like Orace database, Teradata and IBM DB2. If you need to do analytics over 100s of GBs or terabytes of data, our price/performance ratio would be orders of magnitude better.

Ajay- How do you convince named and experienced VC’s Sequia Capital to invest in a start-up ( eg I could do with some server costs coming financing)

[TA]- You need to convince Sequoia of three things. (a) that the market you’re going after is very large (in the billions of dollars, if you’re successful). (b) that your team is the best set of people that could ever come together to solve the particular problem you’re trying to solve. And (c) that the technology you’ve developed gives you an “unfair advantage” over incumbents or new market entrants.  Most importantly, you have to smile a lot! J

Biography

About Tasso:

Tasso (Tassos) Argyros is the CTO and co-founder of Aster Data Systems, where he is responsible for all product and engineering operations of the company. Tasso was recently recognized as one ofBusinessWeek’s Best Young Tech Entrepreneurs for 2009 and was an SAP fellow at the Stanford Computer Science department. Prior to Aster, Tasso was pursuing a Ph.D. in the Stanford Distributed Systems Group with a focus on designing cluster architectures for fast, parallel data processing using large farms of commodity servers. He holds an MsC in Computer Science from Stanford University and a Diploma in Computer and Electrical Engineering from Technical University of Athens.

About Aster:

Aster Data Systems is a proven leader in high-performance database systems for data warehousing and analytics – the first DBMS to tightly integrate SQL with MapReduce – providing deep insights on data analyzed on clusters of low-cost commodity hardware.

The Aster nCluster database cost-effectively powers frontline analytic applications for companies such as MySpace, aCerno (an Akamai company), and ShareThis. Running on low-cost off-the-shelf hardware, and providing ‘hands-free’ administration, Aster enables enterprises to meet their data warehousing needs within their budget.

Aster is headquartered in San Carlos, California and is backed by Sequoia Capital, JAFCO Ventures, IVP, Cambrian Ventures, and First-Round Capital, as well as industry visionaries including David Cheriton, Rajeev Motwani and Ron Conway.

Aster_logo_3.0_red

Interview Steve Sarsfield Author The Data Governance Imperative

Here is an interview with Steve Sarsfield, data quality evangelist and author of Data Quality Imperative.


Ajay- Describe your early career to the present point. At what point did you decide to specialize or focus on data quality and data governance? What were the causes for it?


Steve- When I was growing up, not many normal people had aspirations of becoming data management professionals. Back in those days, we had aspirations to be NFL wide receivers, writers, and engineers,and lawyers.  Data management careers tend to find you.

My career path has wandered through technical support, technical writer and managing editor, consulting,and product management for Lotus development. I’ve been working for the past nine years at a major data quality vendor – the longest job I’ve had to date. The good news is that this latest gig has given me a chance to meet with a LOT of people who have been implementing data quality and data governance projects.

When you get involved with the projects, you’ll begin to realize the power it has. You begin to love data governance for the efficiencies it brings, and for the impact it will have on your organization as it becomes more competitive.


Ajay- Some people think data quality is a boring job and data governance is an abstract philosophy. How would you interest a young high school /college student, with the right aptitude, in taking a business intelligence career and be focused on it.


Steve- In my opinion if you promote a geeky view of data governance the message will tend to fall flat. If there’s one thing I have written most about, it is about bridging the gap between technology and business.Those who succeed in this field now and in the future will be people who are a bit of a jack-of-all-trades.

You need to be a good technologist, critical thinker, marketer, and strategist, and you need to use those skills every day to succeed. Leadership skills are also important, especially if you are trying to bootstrap a data governance program at your corporation. Those job attributes are not boring, they are challenging and exciting.

In terms of being persuasive about getting involved in a data career, it’s clear that data is not likely to decrease in volume in the coming years, quite the contrary, so your job will have a reasonable amount of security.  Nor will there be less of a need in the future for developing accurate business metrics from the data.

In my book, I talk about the fact that the decision of a corporation to move toward data governance is really a choice between optimism and fear. Your company must decide to either be haunted by a never-ending vision that there will only will be more data, more mergers and more complexity in the years to come, orthey will decide to take charge for a more hopeful future that will bring more opportunity, more efficiency and a more agile working environment. When you choose data governance as a career, you choose to provide that optimism for your employer.


Ajay-What are the salient points in your book Data Governance Imperative. Do you think data governance is an idea whose time has come.


Steve-The book is about the increasing importance of data to a business. As your company collects more and more data about customers, products, suppliers, transactions and billing, it becomes more difficult to accurately maintain that information without a centralized approach and a team devoted to the data management mission.

The book comes from discussions with folks in the business who are trying to get a data governance program started in their corporation.  They are the data champions who “get it”, but are yet to convince their management that data is crucial to the success of the company.

The fact is, there are metrics you can follow, processes that you can put in place, conversations that you can have, and technology that you can implement in order to make your managers and co-workers see the importance of data governance.  We know this because it has worked for so many companies who are far more advanced in managing their data than most.

The most evolved companies will have support from executive management and the entire company to define reusable processes for data governance and a center of excellence is formed around it. Much of the book is about garnering support and setting up the processes to prove enterprise data’s importance.  Only when you do that will your company evolve its data governance strategy.


Ajay- Garbage Data In and Garbage Data Analysis Out. What percentage of a BI installation budget goes to input data quality at data entry center. What is the kind of budget you would like it to be.


Steve- I’m sure this varies depending upon many factors, including the number of sources, age and quality of the source data, etc. Anecdotally, the percentage of budget five years ago was near zero. You really only saw realization of the problem LATE in the project, after the first data warehouse loading occurred. What has happened over the years is that we’ve gotten a lot smarter about this, perhaps as a result of our past failures. In the past, if the data worked well in the source systems it was assumed that it would work in the target.

A lot of those projects failed because the team incorrectly scoped the project with regard to the data integration. Today we have the wisdom and experience to know that this is not true.  In order to really assess our needs for data quality, we know we need to profile the data as one of the first tasks in the process.  This will help us create a more accurate timeline and budget and ensure management that weknow what we’re doing with regard to data integration and business intelligence.


Ajay- Do you think Federal Governments can focus stimulus spending smarter with better input data quality?


Steve- Believe it or not, I’m encouraged by the US Government’s plan on data quality. To varying degrees,Presidents Clinton, Bush and Obama have all supported plans for greater transparency and openness. To accomplish that, you have to govern data. In Washington, many government agencies now have a Chief Information Officer. The government is recruiting leading universities like MIT to work toward better data governance in government.  The sheer number of databases even within a single US government agencywill be a huge challenge, but the direction is good.

This year’s MIT Information Quality Symposium, for example, had a very solid government track with speakers from the Army, Air Force, Department of Defense, EPA, HUD, and National Institute of Health to name just a few.

Other than the US, it gets even cloudier.  There are governments ahead of the US, like UK and Germany, and those who still need to catch up.


Ajay- Name some actual anecdotes in which 1) bad data quality led to disaster 2) good data quality gave great insights


Steve- There are certainly plenty of typical examples I always like the unusual examples, like:

A major motorcycle manufacturer used data quality tools to pull out nicknames from their customer records. Many of the names they had acquired for their prospect list were from motorcycle events and contests where the entries were, shall we say, colorful. The name fields contained data like “John the Mad Dog Smith” or “Frank Motor-head Jones”. The client used the tool to separate the name from the nickname, making it a more valuable marketing list.

One major utility company used data quality tools to identify and record notations on meter-reader records that were important to keep for operational uses, but not in the customer billing record. Upon analysis of the data, the company noticed random text like “LDIY” and “MOR” along with the customer records. After somework with the business users, they figured out that LDIY meant “Large Dog in Yard” which was particularly important for meter readers. MOR meant “Meter in Right, which was also valuable. The readers were given their own notes field, so that they could maintain the integrity of the name and address while also keeping this valuable data. IT probably saved a lot of meter readers from dog bite situations.

Financial organizations have used data quality tools to separate items like “John and Judy Smith/221453789 ITF George Smith”. The organization wanted to consider this type of record as three separate records “John Smith” and “Judy Smith” and “George Smith” with obvious linkage between the individuals. This type of data is actually quite common on mainframe migrations.

A food manufacturer standardizes and cleanses ingredient names to get better control of manufacturing costs. In data from their worldwide manufacturing plants, an ingredient might be “carrots” “chopped frozen carrots” “frozen carrots, chopped” “chopped carrots, frozen” and so on. (Not to mention all the possible abbreviations for the words carrots, chopped and frozen.) Without standardization of these ingredients, there was really no way to tell how many carrots the company purchased worldwide.

There was no bargaining leverage with the carrot supplier, and all the other ingredient suppliers, until the data was fixed.In terms of disasters, I’d recommend the IAIDQ’s web site – IQ Trainwrecks.http://www.iqtrainwrecks.com/ The IAIDQ does a great job and I contribute when I can.


Ajay- What are the essential 5 things a CEO should ask his CTO to ensure good data quality in an enterprise.


Steve- What a great question. I can think of more than five, but let’s start with:


1) What is poor quality data costing us?
This should inspire your CTO to go out and seek problem areas in partnership with the business and ways to improve processes.

2) Do I have to make decisions on gut-feel, or should I trust the business intelligence you give our employees?  What confidence level do you have in our BI?

The CEO should be confident in the metrics delivered with BI and he should make sure the CTO has the same concerns.

3) Are we in compliance with all laws regarding our governance of data?

CEOs are often culpable for non-compliance, so he/she should be concerned about any laws that govern the company’s industry. Even in unregulated industries, organizations must comply with spam laws and “do not mail” laws for marketing.

4) Are you working across business units to work towards data governance, or is data quality done in silos?

When possible data quality should be a process that is reusable and able to be implemented in similar manner across business units.

5) Do you have the access to data you need?

The CEO should understand if any office politics are getting in the way of ensuring data quality and this question opens the door to that discussion.

Ajay- What does Steve Sarsfield do when not writing blogs and books.


Steve-These days, when I’m not thinking about data or my blog, I’m thinking about my fantasy football team and the upcoming season. I’ve got a ticket to the New England Patriots opening game vs the Buffalo Bills and I’m looking forward to it. On the weekends, you may find me playing a game of mafia wars on Facebook or cooking up a big pot of chili for the family.


Biography-


Steve Sarsfield is a Data governance business expert, speaker, author of The Data Governance Initiative ( at http://www.itgovernance.co.uk/products/2446 ) and blogger at http://data-governance.blogspot.com/. Product marketing professional  at a major data quality vendor and author of the book “The Data Governance Imperative”.He was Guest speaker at MIT Information Quality Symposium (July 2007 and July 2008),  at the International Association for Information and Data Quality (IAIDQ) Symposium (December 2006) and at SAP CRM 2006 summit.

Interview Dr Usama Fayyad Founder Open Insights LLC

Here is an interview with Dr Usama Fayyad, founder of Open Insights LLC (www.open-insights.com). Prior to this he was Yahoo’s Chief Data Officer. In his prior role as Chief Data Officer of Yahoo! he built the data teams and infrastructure to manage the 25 terabytes of data per day that resulted from the company’s operations.

 

Picture_004_(2)

Ajay-     Describe your career in science. How would you motivate young people today to take science careers rather than other careers
Dr Fayyad-
My career started out in science and engineering. My original plan was to be in research and to become a university professor. Indeed, my first few jobs were strictly in basic Research. After doing summer internships at place like GM Research Labs and JPL, my first full-time position was at the NASA – Jet Propulsion Laboratory, California Institute of Technology.

I started in research in Artificial Intelligence for autonomous monitoring and control and in Machine Learning and data mining. The first major success was with Caltech Astronomers on using machine learning classification techniques to automatically recognize objects in a large sky survey (POSS-II – the 2nd Palomar Observatory Sky Survey).  The Survey consists of taking high resolution images of the entire northern sky. The images, when digitized, contain over 2 billion sky objects. The main problem is to recognize if an object is a star of galaxy. For “faint objects” – which constitute the majority of objects, this was an exceedingly hard problem that people wrestled with for 30 years. I was surprised how well the algorithms could do at solving it.

This was a real example of data sets where the dimensionality is so high that algorithms are better suited at solving it than humans – even well-trained astronomers. Our methods had over 94% accuracy on faint objects that no one could reliably classify before at better than 75% accuracy. This additional accuracy made all the difference in enabling all sort of new science, discoveries and theories about formation of large scale structure in the Universe.
The success of this work and its wide recognition in scientific and engineering communities let to the creation of a new group – I founded and managed the Machine Learning Systems group at JPL which went on to address hard problems in object recognition in scientific data – mostly from remote sensing instruments – like Magellan images of the planet Venus (we recognized and classified over a million small volcanoes on the planet in collaboration with geologists at Brown University) and Earth Observing System data, including Atmospherics and storm data.
At the time, Microsoft was interested in figuring out data mining applications in the corporate world and after a long recruiting cycle they got me to join the newly formed Microsoft Research as a Senior Researcher in late 1995. My work there focus on algorithms, database systems, and basic science issues in the newly formed field of Data Mining and Knowledge Discovery. We had just finished publishing a nice edited collection of chapters in a book that became very popular, and I had agreed to become the founding Editor-in-Chief of a brand new journal called: Data Mining and Knowledge Discovery. This journal today is the premier scientific journal in the field. My research work at Microsoft led to several applications – especially in databases. I founded the Data Mining & Exploration group at MSR and later a product group in SQL Server that built and shipped the first integrated data mining product in a large-scale commercial DBMS  – SQL Server 2000 (analysis Services). We created extensions to the SQL language (that we called DMX) and tried to make data mining mainstream. I really enjoyed the life of doing basic research as well as having a real product group that built and shipped components in a major DBMS.
That’s when I learned that the real challenging problems in the real-world where really not in data mining but in getting the data ready and available for analysis – Data Warehousing was a field littered with failures and data stores that were write-only (meaning data never came out!)  — I used to call these Data Tombs at the time and I likened them to the pyramids in Ancient Egypt: great engineering feats to build, but really just tombs.

In 2000 I decided to leave the world of Research at Microsoft to do my first venture-backed start-up company – digiMine. The company wanted to solve the problem of managing the data and performing data mining and analysis over data sets, and we targeted a model of hosted data warehouses and mining applications as an ASP – one of the first Software as a Service (SaaS) firms in that arena. This began my transition from the world of research and science to business and technology.  We focused on on-line data and web analytics since the data volumes their were about 10x the size of transactional databases and most companies did not know how to deal with all that data. The business grew fast and so did the company – reaching 120 employees in about 1 year.

After 3 years of doing high-growth start-up and raising some $50 million in venture capital for the company, I was beginning to feel the itch again to do technical work.
In June 2003, we had a chance to spin-off part of the business that was focused on difficult high-end data mining problems. This opportunity was exactly what I needed and we formed DMX Group as a spinoff company that had a solid business from its first day. At DMX Group I got to work on some of the hardest data mining problems in predicting sales of automobiles, churn of wireless users, financial scoring and credit risk analysis, and many related deep business Intelligence problems.

Our client list included many of the Fortune 500 companies. One of these clients was Yahoo!  — After 6 months of working with Yahoo! As a client they decided to acquire DMX Group and use the people to build a serious data team for Yahoo!  We negotiated a deal that got about half the employees into Yahoo! And we spun-off the rest of DMX Group to continue focusing on consulting work in data mining and BI.  I thus became the industry’s first Chief Data Officer. 

 The original plan was to spend 2 years or so to help Yahoo! Form the right data teams and build the data processing and targeting technology to deliver high value from its inventory of ads.
Yahoo! Proved to be a wonderful experience and I learned so much about the Internet. I also learned that even someone like me who worked on Internet data from the early days of MSN (in 1996) and who ran a web analytics firm still did not scratch the service on the depth of the area. I learned a lot about the Internet from Jerry Yaang (Yahoo! Co-founder) and much about advertising/media business from Dan Rosensweig (COO) and mTerry Semel (then CEO) and lots about technology management and strategic deal-making from Farzad (Zod) Nazem who was the CTO. As Executive VP at Yahoo!

I built one of the industry’s largest and best data teams and we were able to to process over 25 terabytes of data per year and power several hundred million Dollars of new revenue for Yahoo! Resulting from these data systems. A year after joining Yahoo! I was asked to form a new Research Lab to study much of what we did not understand about the Internet. This was yet another return of basic research into my life. I founded Yahoo! Research to invent the new sciences of the Internet, and I wanted them to be focused on only 4 areas (the idea of focus came from my exposure to Caltech and its philosophy in picking few areas of excellence). The goal was the become the best research lab in the world in these new focused areas. Surprisingly we did it within 2 years. I hired Prabhakar Raghavan to run Research and he did a phenomenal job in building out the Research organization. The four areas we chose were: Search and information navigation, Community Systems, Micro-economics of the Web, and Computational Advertising.  We were able to attract the top talent in the world to lead or work on these emerging areas. Yahoo! Research was a success in basic research but also in influencing product. The chief scientists for all the major areas of company products all came from Yahoo! Research and all owned the product development agenda and plans: Raghu Ramakrishnan (CS for Audience), Andrew Tomkins (CS for Search), Anrei Broder (CS for Monetization) and Preston McCaffee (CS for Marketplaces/Exchanges). I consider this an unprecendented achievement in the world of Research in general: excellence in basic research and huge impact on company products, all within 3-4 years.
I have recently left Yahoo! And started Open Insights (www.open-insights.com) to focus on data strategy and helping enterprises realize the value of data, develop the right data strategies, and create new business models. Sort of an ‘outsourced version” of my Chief Data Officer job at Yahoo!
Finally, on my advice to young people: it is not just about science careers, I would call it engineering careers. My advice to any young person in fact, whether they plan to become a business person, a medical doctor, and artist, a lawyer, or a scientist – basic training in engineering and abstract problem solving will be a huge assets. Some of the best lawyers, doctors, and even CEO’s started out with engineering training.
For those young people who want to become scientists, my advice is always look for real-world applications where the research can be conducted in their context. The reason for that is technical and sociological. From a technical perspective, the reality of an application and the fact that things have to work force a regiment of technical discipline and make sure that the new ideas are tested and challenged. Socially, working on a real application forces interactions with people who care about the problem and provides continuous feedback which is really crucial in guiding good work (even if scientists deny this, social pressure is a big factor) – it also ensures that your work will be relevant and will evolve in relevant directions. I always tell people who are seeking basic research: “some of the deepest fundamental science problems can often be found lurking in the most mundane of applications”. So embrace applied work but always look for the abstract deep problems – that summarizes my advice.
Ajay- What are the challenges of running data mining for a big big website.
Dr Fayyad-
There are many challenges. Most algorithms will not work due to scale. Also, most of the problems have an unusually high dimensionality – so simple tricks like sampling won’t work. You need to be very clever on how to sample and how to reduce dimensionality by applying the right variable transformations.

The variety of problems is huge, and the fact that the Internet is evolving and growing rapidly, means that the problems are not fixed or stationary. A solution that works well today will likely fail in a few months – so you need to always innovate and always look at new approaches. Also, you need to build automated tools to help detect changes and address them as soon as they arise. 

Problems with 1000 10,000 or millions of variables are very common in web challenges. Finally, whatever you do needs to work fast or else you will not be able to keep up with the data flux. Imagine falling behind on processing 25 Terabytes of data per day. If you fall behind by two days, you will never be able to catch up again! Not within any reasonable budget constraint. So you try never to go down.
Ajay-      What are the 5 most important things that the data miner should avoid in doing analysis.

Dr Fayyad-I never thought about this in terms of top 5, but here are the big ones that come to mind, not necessarily in any order
a.       The algorithms knows nothing about the data, and the knowledge of the domain is in the head of the domain experts. As I always say, an ounce of knowledge is worth a ton of data – so seek and model what the experts know or your results will look silly
b.      Don’t let an algorithm fish blindly when you have lots of data. Use what you know to reduce the dimensionality quickly. The curse of dimensionality is never to be under-estimated
c.       Resist the temptation to cheat: selecting training and test sets can easily fool you into thinking you have something that works. Test it honestly against new data, never “peek” at the test data – what you see will force you to cheat without knowing it.
d.      Business rules typically dominate data mining accuracy, so be sure to incorporate the business and legal constraints into your mining.
e.       I have never seen a large database in my life that came from a static distribution that was sampled independently. Real databases grow to be big through lots of systematic changes and biases, and they are collected over years from changing underlying distribution: segmentation is a pre-requisite to any analysis. Most algorithms assume that data is IID (independent and identically distributed)

Ajay-   Do you think softwares like Hadoop and MapReduce will change the online database permanently. What further developments do you see in this area.


Dr Fayyad-
I think they will (and have) changed the landscape dramatically, but they do not address everything. Many problems lend themselves naturally to Map-Reduce and many new approaches are enabled by Map-Reduce. However, there are many problems where M-R does not do much. I see a lot of problems being addressed by a large grid nowadays when they don’t need it. This is often a huge waste of computational resources. We need to learn how to deal with a mix of tools and platforms. I think M-R will be with us for a long time and will be a staple tool – but not a universal one.
Ajay-    I look forward to the day when I have just a low priced netbook and fast internet connection, and upload a Gigabyte of data and run advanced analytics on the browser. How far or soon do you think it is possible?
Dr Fayyad- Well, I thnk the day is already here. In fact, much of our web search today is conducted exactly in that model. A lot of web analysis, and much of scientific analysis is done like this today.
Ajay-    Describe some of the conferences you are currently involved with and the research areas that excites you the most.
Dr Fayyad-
I am still very involved in knowledge discovery and data mining conferences (especially the KDD series), machine learning, some statistics, and some conferences on search and internet.  Most exciting conferences for me are ones that cover a mix of topics but that address real problems. Examples include understanding how social networks evolve and behave, understanding dimensionality reductions (like random projections in very high-D spaces) and generally any work that gives us insight into why a particular technique works better and where the open challenges are.
Ajay-  What are the next breakthrough areas in data mining. Can we have a  Google or Yahoo in fields of business intelligence as well given their huge market potential and uncertain ROI.
Dr Fayyad- We already have some large and healthy businesses in BI and quite a huge industry in consulting. If you are asking particularly about the tools market then I think that market is very limited. The users of analysis tools are always going to be small in number. However, once the BI and Data Mining tools are embedded in vertical applications, then the number of users will be tremendous. That’s where you will see success.
Consider the examples of Google or Yahoo! – and now Microsoft with BING search engine.  Search engines today would not be good without machine learning/data mining technology. In fact MLR (Machine Learned Ranking) is at the core of the ranking methodology that decides which search results bubble to the top of the list. The typical web query is 2.6 keywords long and has about a billion matches. What matters are the top 10. The function that determines these is a relevance ranking algrorithm that uses machine learning to tune a formula that considers hundreds or thousands of variables about each document. So in many ways, you have a great example of this technology being used by hundreds of millions of people every day – without knowing it!
Success will be in applications where the technology becomes invisible – much like the internal combustion engine in your car or the electric motor in your coffee grinder or power supply fan. I think once people start building verticalized solutions that embed data mining and BI, we will hit success. This already has happened in web search, in direct marketing, in advertising targeting, in credit scoring, in fraud detection, and so on…

Ajay-  What do you do to relax. What publications would you recommend for staying up to date for the data mining people especially the younger analysts.
Dr Fayyad-
My favorite activity is sleep when I can get it J.  But more seriously, I enjoy reading books, playing chess, skiing (on water or snow – downhill or x-country), or any activities with my kids.  I swim a lot and that gives me much time to think and sort things out.
I think for keeping up with the technical advances in data mining: the KDD conferences, some of the applied analytics conferences, the WWW conferences, and the data mining journals. The ACM SIGKDD publishes a nice newsletter called SIGKDD explorations. It is free with a very low membership fee and it has a lot of announcements and survey papers on new topics and important areas (www.kdd.org).  Also, a good list to keep up with is an email list called KDNuggets edited by Gregory Piatetsky-Shapiro.
 

Biography (www.fayyad.com/usama )-

Usama Fayyad founded Open Insights (www.open-insights.com) to deliver on the vision of bridging the gap between data and insights and to help companies develop strategies and solutions not only to turn data into working business assets, but to turn the insights available from the growing amounts of data into critical components of an enterprise’s strategy for approaching markets, dealing with competitors, and acquire and retain customers.

In his prior role as Chief Data Officer of Yahoo! he built the data teams and infrastructure to manage the 25 terabytes of data per day that resulted from the company’s operations. He also built up the targeting systems and the data strategy for how to utilize data to enhance revenue and to create new revenue sources for the company.

In addition, he was the founding executive for Yahoo! Research, a scientific research organization that became the top research place in the world working on inventing the new sciences of the Internet.

Interview Karen Lopez Data Modeling Expert

Zachman Framework
Image via Wikipedia

Here is an interview with Karen Lopez who has worked in data modeling for almost three decades and is a renowned data management expert in her field.

Data professionals need to know about the data domain in addition to the data structure domain – Karen Lopez

Ajay- Describe your career in science. How would you persuade younger students to take more science courses.

Karen- I’ve always had an interest in science and I attribute that to the great science teachers I had. I studied information systems at Purdue University though a unique program that focuses on systems analysis and computer technologies. I’m one of the few who studied data and process modeling in an undergraduate program 25+ years ago.

I believe that it is very important that we find a way of attracting more scientists to teach. In both the natural and computer sciences, it’s difficult for institutions to tempt scientists away from professional positions that offer much greater compensation. So I support programs that find ways to make that happen.

Ajay- If you had to give advice to a young person starting their career in BI and had to give them advice in just three points – what would they be?

Karen- Wow. It’s tough to think of just three things, but these are recommendations that I make often:

– Remember that every design decision should be made based on cost, benefit, and risk. If you can’t clearly describe these for every side of a decision, then you aren’t doing design; you are guessing.

– No one beside you is responsible for advancing your skills and keeping an eye on emerging practices. Don’t expect your employer to lay out a career plan that is in your best interest. That’s not their job. Data professionals need to know about the data domain in addition to the data structure domain. The best database or data warehouse design in the world is worse than uses useless if the how the data is processed is wrong. Remember to expand your knowledge about data, not just the data structures and tools.

– All real-world work involves collaboration and negotiation. There is no one right answer that works for every situation. Building your skills in these areas will pay off significantly.

Ajay- What do you think is the best way for a technical consultant and client to be on the same page regarding requirements. Which methodology or template have you used, and which has given you the most success.

Karen- While I’m a huge fan of modeling (data modeling and other modeling), I still think that giving clients a prototype or mockup of something that looks real to them goes a long way. We need to build tools and competencies to develop these prototypes quickly. It’s a lost art in the data world.

Ajay- What are the special incentives that make Canada a great place for tech entrepreneurs rather than say go to the United States. ( Note- Disclaimer I have family in Canada and study in the US)

Karen- I prefer not to think of this as an either-or decision. I immigrated to Canada from the US about 15 years ago, but most of our business is outside of Canada. I have enjoyed special incentives here in Canada for small businesses as well as special programs that allowed me to work in Canada as a technical professional before I moved here permanently.

Overall, I have found Canadian employers more open to sponsoring foreign workers and it is easier for them to do so than what my US clients experience. Having said that, a significant portion of my work over the last few years has been on global projects where we leverage online collaboration tools to meet our goals. The advent of these tools has made it much easier to work from wherever I am and to work with others regardless of their visa statuses.

Where a company forms is less tied to where one lives or works these days.

Ajay- Could you tell us more about the Zachman framework (apart from the wikipedia reference)? A practical example on how you used it on an actual project would be great.

Karen- Of course the best resource for finding out about the Zachman framework is from John Zachman himself http://www.zachmaninternational.com/index.php/home-article/13 . He offers some excellent courses and does a great deal of public speaking at government and DAMA events. I highly recommend anyone interested in the Framework to hear about it directly from him.

There are many misunderstandings about John’s intent, such as the myth that he requires big upfront modeling (he doesn’t), that the Framework is a methodology (it isn’t), or that it can only be used to build computer systems (it can be used for more than that).

I have used the Zachman Framework to develop a joint Business-IT Strategic Information Systems Plan as well as to inventory and track progress of multi-project programs. One interesting use was a paper I authored for the Canadian Information Processing Society (CIPS) on how various educational programs, specializations, and certifications map to the Zachman Framework. I later developed a presentation about this mapping for a Zachman conference.

For a specific project, the Zachman Framework allows business to understand where their enterprise assets are being managed – and how well they are managed. It’s not an IT thing; it’s an enterprise architecture thing.

Ajay- What does Karen Lopez do for fun when not at work, traveling, speaking or blogging.

Karen- Sometimes it seems that’s all I do. I enjoy volunteering for IT-related organizations such as DAMA and CIPS. I participate in the accreditation of college and university educational programs in Canada and abroad. As a member of data-related standards bodies, namely the Association for Retail Technology Standards and the American Dental Association, I help develop industry standard data models. I’ve also been a spokesperson for a CIPS program to encourage girls to take more math and science courses throughout their student careers so that they may have access to great opportunities in the future.

I like to think of myself as a runner; last year I completed my first half marathon, which I’d never thought was possible. I am studying Hindi and Sanskrit. I’m also a addicted to reading and am thankful that some of it I actually get paid to do.

Biography

Karen López is a Senior Project Manager at InfoAdvisors, Inc. Karen is a frequent speaker at DAMA conferences and DAMA Chapters. She has 20+ years of experience in project and data management on large, multi-project programs. Karen specializes in the practical application of data management principles. Karen is also the ListMistress and moderator of the InfoAdvisors Discussion Groups at http://www.infoadvisors.com. You can reach her at www.twitter.com/datachick