Protected: Whats behind that pretty SAS Blog?

This content is password-protected. To view it, please enter the password below.

Common Analytical Tasks

WorldWarII-DeathsByCountry-Barchart
Image via Wikipedia

 

Some common analytical tasks from the diary of the glamorous life of a business analyst-

1) removing duplicates from a dataset based on certain key values/variables
2) merging two datasets based on a common key/variable/s
3) creating a subset based on a conditional value of a variable
4) creating a subset based on a conditional value of a time-date variable
5) changing format from one date time variable to another
6) doing a means grouped or classified at a level of aggregation
7) creating a new variable based on if then condition
8) creating a macro to run same program with different parameters
9) creating a logistic regression model, scoring dataset,
10) transforming variables
11) checking roc curves of model
12) splitting a dataset for a random sample (repeatable with random seed)
13) creating a cross tab of all variables in a dataset with one response variable
14) creating bins or ranks from a certain variable value
15) graphically examine cross tabs
16) histograms
17) plot(density())
18)creating a pie chart
19) creating a line graph, creating a bar graph
20) creating a bubbles chart
21) running a goal seek kind of simulation/optimization
22) creating a tabular report for multiple metrics grouped for one time/variable
23) creating a basic time series forecast

and some case studies I could think of-

 

As the Director, Analytics you have to examine current marketing efficiency as well as help optimize sales force efficiency across various channels. In addition you have to examine multiple sales channels including inbound telephone, outgoing direct mail, internet email campaigns. The datawarehouse is an RDBMS but it has multiple data quality issues to be checked for. In addition you need to submit your budget estimates for next year’s annual marketing budget to maximize sales return on investment.

As the Director, Risk you have to examine the overdue mortgages book that your predecessor left you. You need to optimize collections and minimize fraud and write-offs, and your efforts would be measured in maximizing profits from your department.

As a social media consultant you have been asked to maximize social media analytics and social media exposure to your client. You need to create a mechanism to report particular brand keywords, as well as automated triggers between unusual web activity, and statistical analysis of the website analytics metrics. Above all it needs to be set up in an automated reporting dashboard .

As a consultant to a telecommunication company you are asked to monitor churn and review the existing churn models. Also you need to maximize advertising spend on various channels. The problem is there are a large number of promotions always going on, some of the data is either incorrectly coded or there are interaction effects between the various promotions.

As a modeller you need to do the following-
1) Check ROC and H-L curves for existing model
2) Divide dataset in random splits of 40:60
3) Create multiple aggregated variables from the basic variables

4) run regression again and again
5) evaluate statistical robustness and fit of model
6) display results graphically
All these steps can be broken down in little little pieces of code- something which i am putting down a list of.
Are there any common data analysis tasks that you think I am missing out- any common case studies ? let me know.

 

 

 

Interview Thomas C. Redman Author Data Driven

Here is an interview with Tom Redman, author of Data Driven. Among the first to recognize the need for high-quality data in the information age, Dr. Redman established the AT&T Bell Laboratories Data Quality Lab in 1987 and led it until 1995. He is the author of four books, two patents and leads his own consulting group. In many respects the “Data Doc’ as his nickname is- is also the father of Data Quality Evangelism.

tom redman

Ajay- Describe your career as a science student to an author of science and strategy books.
Redman: I took the usual biology, chemistry, and physics classes in college.  And I worked closely with oceanographers in graduate school.  More importantly, I learned directly from two masters.  First, was Dr. Basu, who was at Florida State when I was.  He thought more deeply and clearly about the nature of data and what we can learn from them than anyone I’ve met since.  And second is people in the Bell Labs’ community who were passionate about making communications better. What I learned there was you don’t always need “scientific proof” to mover forward.


Ajay- What kind of bailout do you think the Government can give to the importance of science education in this country.

Redman: I don’t think the government should bail science education per se. Science departments should compete for students just like the English and anthropology departments do.  At the same time, I do think the government should support some audacious goals, such as slowing global warming or energy independence.  These could well have the effect of increasing demand for scientists and science education.

Ajay- Describe your motivations for writing your book Data Driven-Profiting from your most important business asset.

Redman: Frankly I was frustrated.  I’ve spent the last twenty years on data quality and organizations that improve gain enormous benefit.  But so few do.  I set out to figure out why that was and what to do about it.

Ajay- What can various segments of readers learn from this book-
a college student, a manager, a CTO, a financial investor and a business intelligence vendor.

Redman: I narrowed my focus to the business leader and I want him or her to take away three points.  First, data should be managed as aggressively and professionally as your other assets.  Second, they are unlike other assets in some really important ways and you’ll have to learn how to manage them.  Third, improving quality is a great place to start.

Ajay- Garbage in Garbage out- How much money and time do you believe is given to data quality in data projects.

Redman:   By this I assume you mean data warehouse, BI, and other tech projects.  And the answer is “not near enough.”  And it shows in the low success rate of those projects.

Ajay-Consider a hypothetical scenario- Instead of creating and selling fancy algorithms , a business intelligence vendor uses simple Pareto principle to focus on data quality and design during data projects. How successful do you think that would be?

Redman: I can’t speak to the market, but I do know that if organizations are loaded with problems and opportunities.  They could make great progress on most important ones if could clearly state the problem and bring high-quality data and simple techniques to bear.  But there are a few that require high-powered algorithms.  Unfortunately those require high-quality data as well.

Ajay- How and when did you first earn the nickname “Data Doc”. Who gave it to you and would you rather be known by some other names.

Redman: One of my clients started calling me that about a dozen years ago.  But I felt uncomfortable and didn’t put it on my business card until about five years ago.  I’ve grown to really like it.

Ajay- The pioneering work at AT & T Bell laboratories and at Palo Alto laboratory- who do you think are the 21st century successors of these laboratories. Do you think lab work has become too commercialized even in respected laboratories like Microsoft Research and Google’s research in mathematics.

Redman: I don’t know.  It may be that the circumstances of the 20th century were conducive to such labs and they’ll never happen again.  You have to remember two things about Bell Labs.  First, was the cross-fertilization that stemmed from having leading-edge work in dozens of areas.  Second, the goal is not just invention, but innovation, the end-to-end process which starts with invention and ends with products in the market.  AT&T, Bell Labs’ parent, was quite good at turning invention to product.  These points lead me to think that the commercial aspect of laboratory work is so much the better.

Ajay-What does ” The Data Doc” do to relax and maintain a work life balance. How important do you think is work-life balance for creative people and researchers.

Redman: I think everyone needs a balance, not just creative people.  Two things have made this easier for me.  First, I like what I do.  A lot of days it is hard to distinguish “work” from “play.”  Second is my bride of thirty-three years, Nancy.  She doesn’t let me go overboard too often.

Biography-

Dr. Thomas C. Redman is President of Navesink Consulting Group, based in Little Silver, NJ.  Known by many as “the Data Doc” (though “Tom” works too), Dr. Redman was the first to extend quality principles to data and information.  By advancing the body of knowledge, his innovations have raised the standard of data quality in today’s information-based economy.

Dr. Redman conceived the Data Quality Lab at AT&T Bell Laboratories in 1987 and led it until 1995.  There he and his team developed the first methods for improving data quality and applied them to important business problems, saving AT&T tens of millions of dollars. He started Navesink Consulting Group in 1996 to help other organizations improve their data, while simultaneously lowering operating costs, increasing revenues, and improving customer satisfaction and business relationships.

Since then – armed with proven, repeatable tools, techniques and practical advice – Dr. Redman has helped clients in fields ranging from telecommunications, financial services, and dot coms, to logistics, consumer goods, and government agencies. His work has helped organizations understand the importance of high-quality data, start their data quality programs, and also save millions of dollars per year.

Dr. Redman holds a Ph.D. in statistics from Florida State University.  He is an internationally renowned lecturer and the author of numerous papers, including “Data Quality for Competitive Advantage” (Sloan Management Review, Winter 1995) and “Data as a Resource: Properties, Implications, and Prescriptions” (Sloan Management Review, Fall 1998). He has written four books: Data Driven (Harvard Business School Press, 2008), Data Quality: The Field Guide (Butterworth-Heinemann, 2001), Data Quality for the Information Age (Artech, 1996) and Data Quality: Management and Technology (Bantam, 1992). He was also invited to contribute two chapters to Juran’s Quality Handbook, Fifth Edition (McGraw Hill, 1999). Dr. Redman holds two patents.

About Navesink Consulting Group (http://www.dataqualitysolutions.com/ )

Navesink Consulting Group was formed in 1996 and was the first company to focus on data quality.  Led by Dr. Thomas Redman, “the Data Doc” and former AT&T Bell Labs director, we have helped clients understand the importance of high-quality data, start their data quality programs, and save millions of dollars per year.

Our approach is not a cobbling together of ill-fitting ideas and assertions – it is based on rigorous scientific principles that have been field-tested in many industries, including financial services (see more under “Our clients”).  We offer no silver bullets; we don’t even offer shortcuts. Improving data quality is hard work.

But with a dedicated effort, you should expect order-of-magnitude improvements and, as a direct result, an enormous boost in your ability to manage risk, steer a course through the crisis, and get back on the growth curve.

Ultimately, Navesink Consulting brings tangible, sustainable improvement in your business performance as a result of superior quality data.

Interview Dylan Jones DataQualityPro.com

Here is an interview with Dylan Jones the founder/editor of Dataqualitypro.com , the site to go to for anything related to Data Quality discussions. Dylan is a great charming person and in this interview talks candidly on his views.Dylan Jones

Ajay: Describe your career in science and in business intelligence. How would you convince young students to take more maths and science courses for scientific careers.

Dylan: My main education for the profession was a degree in Information Technology and Software Development. No surprises what my first job entailed – software development for an IT company!

That role took me straight into the trials and tribulations of business intelligence and data quality. After a couple of years I went freelance and have pretty much worked for myself ever since. There has been a constant thread of data quality, business intelligence and data migration throughout my career which culminated in me setting up the more recent social media initiatives to try and pull professionals together in this space.

In all honesty, I’m probably the worst person to give career advice Ajay as I’m a hopeless dreamer. I’ve never really structured my career. I fell into data quality early on and it has led me to work in some wonderful places and with some great people, largely by accident and fate.

I have a simple philosophy, do what you love doing. I’m incredibly lucky to wake up every day with an absolute passion for what I do. In the past, whenever I have found myself working in a situation that I find soul destroying (and in our profession that can happen regularly) I move on to something new.

So, my advice for people starting out would be to first question what makes them happy in life. Don’t simply follow the herd. The internet has totally transformed the rules of the game in terms of finding an outlet for your skills so follow your heart, not conventional wisdom.

That said, I think there are some core skills that will always provide a springboard. Maths is obviously one of those skills that can open many doors but I would also advise people to learn about marketing, sales and other business fundamentals. From a business intelligence perspective it really adds an attractive dimension to your skills if you can link technical ability with a deeper understanding of how businesses operate.

Ajay You are a top expert and publisher on BI topics. Tell us something about

a) http://www.datamigrationpro.com/

b) http://www.dataqualitypro.com/

c) Involvement with the DataFlux community of experts

d) Your latest venture http://www.dqvote.com

Dylan- Data Migration Pro was my first foray into the social media space. I realised that very few people were talking about the challenges and techniques of data migration. On average, large organisations implement around 4 migration projects a year and most end in failure. A lot of this is due to a lack of awareness. Having worked for so long in this space I felt it was time to create a social media site to bring the wider community together. So we now have forums, regular articles, tools and techniques on the site with about 1400 members worldwide plus lots of plans in the pipeline for 2010.

Data Quality Pro followed on from the success of Data Migration Pro and our speed of growth really demonstrates how important data quality is right now. Again, awareness of the basic techniques and best-practices is key. I think many organisations are really starting to recognise the importance of better data quality management practices so a lot of our focus is on giving people practical advice and tools to get started. We are a community publishing platform, I do write regularly but we’ve always had a significant community contribution from expert practitioners and authors.

I didn’t just want to take a corporate viewpoint with these communities. As a result they are very much focused on the individual. That is why we post so many features on how to promote your skills, search for work, gain personal skills and generally get ahead in the profession. Data Quality Pro has just under 2,000 members and about 6,000 regular visitors a month so it demonstrates just how many people are really committed to learning about this discipline as it impacts practically every part of the business. I also think it is an excellent career choice as so many projects are dependent on good quality data there will always be demand.

The DataFlux community of experts is a great resource that I’ve actually admired for some time. I am a big fan of Jill Dyche who used to write on the community and of course there is a great line-up on there now with experts like David Loshin, Joyce Norris-Montanari and Mike Ferguson so I was delighted to be invited to participate. DataFlux have sponsored our sites from the very beginning and without their support we wouldn’t have grown to our current size. So although I’m vendor independent, it’s great to be sharing my thoughts and ideas with people who visit their site.

DQVote.com is a relatively new initiative. I noticed that there was some great data quality content being linked through platforms like Twitter but it would essentially become hard to find after several days. Also, there was no way for the community to vote on what content they found especially useful. DQVote.com allows people to promote their own content but also to vote and share other useful data quality articles, blogs, presentations, videos, tutorials – anything that adds value to the data quality community. It is also a great springboard for emerging data quality bloggers and publishers of useful content.

Ajay- Do you think BI projects can be more successful if we reward data entry people, or at least pay more for better quality data rather than ask them to fill in database tables as fast as they can? Especially in offshore call centres.

Dylan- Data entry is a pet frustration of mine. I regularly visit companies who are investing hundreds of thousands of pounds in data quality technology and consultants but nothing in grass-roots education and cultural change. They would rather create cleansing factories than resolve the issues at source.

So, yes I completely agree, the reward system has to change. I personally suffer from this all the time – call centre staff record incorrect or incomplete information about my service or account and it leads to billing errors, service problems, annoyance and eventually lost business. Call centre staff are not to blame, they are simply rewarded on the volume of customer service calls they can make, they are not encouraged to enter good quality data. The fault ultimately lies with the corporations that use these services and I don’t think offshore or onshore makes a difference. I’ve witnessed terrible data quality in-house also. The key is to have service level agreements on what quality of data is acceptable. I also think a reward structure as opposed to a penalty structure can be a much more progressive way of improving the quality of call-centre data.

Ajay- What are the top 5 things that you can help summarize your views on Business Intelligence – assume you are speaking to a class of freshmen statisticians.

Dylan- Business intelligence is wholly dependent on data quality. Accessibility, timeliness, accuracy, completeness, duplication – data quality dimensions like these can dramatically change the value of business intelligence to the organisation. Take nothing for granted with data, assume nothing. I have never, ever, assessed a dataset in a large business that did not have some serious data defects that were impacting decision making.

As statisticians, they therefore possess the tools to help organisations discover and measure these defects. They can find ways to continuously improve and ensure that future decisions are based on reliable data.

I would also add that business intelligence is not just about technology, it is about interpreting data to determine trends that will enable a company to improve their competitive advantage. Statistics are important but freshmen must also understand how organisations really create value for their customers.

My advice is to therefore step away from the tools and learn how the business operates on the ground. Really listen to workers and customers as they can bring the data to life. You will be able to create far more accurate dashboards and reports of where the issues and opportunities lie within a business if you immerse yourself with the people who create the data and the senior management who depend on the quality of your business intelligence platforms.

Ajay- Which software have you personally coded or implemented. Which one did you like the best and why?

Dylan- I’ve used most of the BI and DQ tools out there, all have strengths and weaknesses so it is very subjective. I have my favourites but I try to remain vendor neutral so I’ll have to gracefully decline on this one Ajay!

However, I did build a data profiling and data quality assessment tool several years ago. To be honest, that is the tool I like best because it had a range of features I still haven’t seen implemented so far in any other tools. If I ever get chance, and if no other vendor comes up with the same concept, I may yet take it to market. For now though, two young kids, two communities and a 12 hour day mean it is something of pipedream.

Ajay-What does Dylan Jones do when not helping data quality of the world go better.

Dylan- I’ve recently had another baby boy so kids take up most of whatever free time I have left. When we do get a break though I like to head to my home town and just hang out on the beach or go up into the mountains. I love travelling and as I effectively work completely online now, we’re really trying to figure out a way of combining travel and work.

Biography-

Dylan Jones is the founder and editor of Data Quality Pro and Data Migration Pro, the leading online expert community resources. Since the early nineties he has been helping large organisations tackle major information management challenges. He now devotes his time to fostering greater awareness, community and education in the fields of data quality and data migration via the use of social media channels. Dylan can be contacted via his profile page at http://www.dataqualitypro.com/data-quality-dylan-jones/ or at http://www.twitter.com/dataqualitypro

Book Review (short) Data Driven-Profiting from your most important business asset -Tom Redman

Once in a whle comes a book that squeezes a lot of common sense in easy to execute paradigms, adds some flavours of anecdotes and adds penetrating insights as the topping. The Book Data Driven by Tom Redman is such a book- and it may rightly called the successor to the now epic Davenport Tome on Competing on Analytics.

Data Driven, the book is divided in 3 parts.

1) Data Quality – Including opportunity costs of bad data management.

2)  Putting Data and Information to work

3) Creating a Management system for Data and Information.

At 218 pages not including the appendix- this is one easy read for someone who needs to refresh their mental batteries with data hygiene perspectives. With terrific wisdom and easy to communicate language and paradigms this would surely mark another important chapter in bring data quality to the forefront rather than the back burner of Business Intelligence and Business Analytics. All the trillion dollar algorthms in the world and software is useless without data qquality. Read this book and it will show you how to use the most important valuable and under used asset- data.

Interview Steve Sarsfield Author The Data Governance Imperative

Here is an interview with Steve Sarsfield, data quality evangelist and author of Data Quality Imperative.


Ajay- Describe your early career to the present point. At what point did you decide to specialize or focus on data quality and data governance? What were the causes for it?


Steve- When I was growing up, not many normal people had aspirations of becoming data management professionals. Back in those days, we had aspirations to be NFL wide receivers, writers, and engineers,and lawyers.  Data management careers tend to find you.

My career path has wandered through technical support, technical writer and managing editor, consulting,and product management for Lotus development. I’ve been working for the past nine years at a major data quality vendor – the longest job I’ve had to date. The good news is that this latest gig has given me a chance to meet with a LOT of people who have been implementing data quality and data governance projects.

When you get involved with the projects, you’ll begin to realize the power it has. You begin to love data governance for the efficiencies it brings, and for the impact it will have on your organization as it becomes more competitive.


Ajay- Some people think data quality is a boring job and data governance is an abstract philosophy. How would you interest a young high school /college student, with the right aptitude, in taking a business intelligence career and be focused on it.


Steve- In my opinion if you promote a geeky view of data governance the message will tend to fall flat. If there’s one thing I have written most about, it is about bridging the gap between technology and business.Those who succeed in this field now and in the future will be people who are a bit of a jack-of-all-trades.

You need to be a good technologist, critical thinker, marketer, and strategist, and you need to use those skills every day to succeed. Leadership skills are also important, especially if you are trying to bootstrap a data governance program at your corporation. Those job attributes are not boring, they are challenging and exciting.

In terms of being persuasive about getting involved in a data career, it’s clear that data is not likely to decrease in volume in the coming years, quite the contrary, so your job will have a reasonable amount of security.  Nor will there be less of a need in the future for developing accurate business metrics from the data.

In my book, I talk about the fact that the decision of a corporation to move toward data governance is really a choice between optimism and fear. Your company must decide to either be haunted by a never-ending vision that there will only will be more data, more mergers and more complexity in the years to come, orthey will decide to take charge for a more hopeful future that will bring more opportunity, more efficiency and a more agile working environment. When you choose data governance as a career, you choose to provide that optimism for your employer.


Ajay-What are the salient points in your book Data Governance Imperative. Do you think data governance is an idea whose time has come.


Steve-The book is about the increasing importance of data to a business. As your company collects more and more data about customers, products, suppliers, transactions and billing, it becomes more difficult to accurately maintain that information without a centralized approach and a team devoted to the data management mission.

The book comes from discussions with folks in the business who are trying to get a data governance program started in their corporation.  They are the data champions who “get it”, but are yet to convince their management that data is crucial to the success of the company.

The fact is, there are metrics you can follow, processes that you can put in place, conversations that you can have, and technology that you can implement in order to make your managers and co-workers see the importance of data governance.  We know this because it has worked for so many companies who are far more advanced in managing their data than most.

The most evolved companies will have support from executive management and the entire company to define reusable processes for data governance and a center of excellence is formed around it. Much of the book is about garnering support and setting up the processes to prove enterprise data’s importance.  Only when you do that will your company evolve its data governance strategy.


Ajay- Garbage Data In and Garbage Data Analysis Out. What percentage of a BI installation budget goes to input data quality at data entry center. What is the kind of budget you would like it to be.


Steve- I’m sure this varies depending upon many factors, including the number of sources, age and quality of the source data, etc. Anecdotally, the percentage of budget five years ago was near zero. You really only saw realization of the problem LATE in the project, after the first data warehouse loading occurred. What has happened over the years is that we’ve gotten a lot smarter about this, perhaps as a result of our past failures. In the past, if the data worked well in the source systems it was assumed that it would work in the target.

A lot of those projects failed because the team incorrectly scoped the project with regard to the data integration. Today we have the wisdom and experience to know that this is not true.  In order to really assess our needs for data quality, we know we need to profile the data as one of the first tasks in the process.  This will help us create a more accurate timeline and budget and ensure management that weknow what we’re doing with regard to data integration and business intelligence.


Ajay- Do you think Federal Governments can focus stimulus spending smarter with better input data quality?


Steve- Believe it or not, I’m encouraged by the US Government’s plan on data quality. To varying degrees,Presidents Clinton, Bush and Obama have all supported plans for greater transparency and openness. To accomplish that, you have to govern data. In Washington, many government agencies now have a Chief Information Officer. The government is recruiting leading universities like MIT to work toward better data governance in government.  The sheer number of databases even within a single US government agencywill be a huge challenge, but the direction is good.

This year’s MIT Information Quality Symposium, for example, had a very solid government track with speakers from the Army, Air Force, Department of Defense, EPA, HUD, and National Institute of Health to name just a few.

Other than the US, it gets even cloudier.  There are governments ahead of the US, like UK and Germany, and those who still need to catch up.


Ajay- Name some actual anecdotes in which 1) bad data quality led to disaster 2) good data quality gave great insights


Steve- There are certainly plenty of typical examples I always like the unusual examples, like:

A major motorcycle manufacturer used data quality tools to pull out nicknames from their customer records. Many of the names they had acquired for their prospect list were from motorcycle events and contests where the entries were, shall we say, colorful. The name fields contained data like “John the Mad Dog Smith” or “Frank Motor-head Jones”. The client used the tool to separate the name from the nickname, making it a more valuable marketing list.

One major utility company used data quality tools to identify and record notations on meter-reader records that were important to keep for operational uses, but not in the customer billing record. Upon analysis of the data, the company noticed random text like “LDIY” and “MOR” along with the customer records. After somework with the business users, they figured out that LDIY meant “Large Dog in Yard” which was particularly important for meter readers. MOR meant “Meter in Right, which was also valuable. The readers were given their own notes field, so that they could maintain the integrity of the name and address while also keeping this valuable data. IT probably saved a lot of meter readers from dog bite situations.

Financial organizations have used data quality tools to separate items like “John and Judy Smith/221453789 ITF George Smith”. The organization wanted to consider this type of record as three separate records “John Smith” and “Judy Smith” and “George Smith” with obvious linkage between the individuals. This type of data is actually quite common on mainframe migrations.

A food manufacturer standardizes and cleanses ingredient names to get better control of manufacturing costs. In data from their worldwide manufacturing plants, an ingredient might be “carrots” “chopped frozen carrots” “frozen carrots, chopped” “chopped carrots, frozen” and so on. (Not to mention all the possible abbreviations for the words carrots, chopped and frozen.) Without standardization of these ingredients, there was really no way to tell how many carrots the company purchased worldwide.

There was no bargaining leverage with the carrot supplier, and all the other ingredient suppliers, until the data was fixed.In terms of disasters, I’d recommend the IAIDQ’s web site – IQ Trainwrecks.http://www.iqtrainwrecks.com/ The IAIDQ does a great job and I contribute when I can.


Ajay- What are the essential 5 things a CEO should ask his CTO to ensure good data quality in an enterprise.


Steve- What a great question. I can think of more than five, but let’s start with:


1) What is poor quality data costing us?
This should inspire your CTO to go out and seek problem areas in partnership with the business and ways to improve processes.

2) Do I have to make decisions on gut-feel, or should I trust the business intelligence you give our employees?  What confidence level do you have in our BI?

The CEO should be confident in the metrics delivered with BI and he should make sure the CTO has the same concerns.

3) Are we in compliance with all laws regarding our governance of data?

CEOs are often culpable for non-compliance, so he/she should be concerned about any laws that govern the company’s industry. Even in unregulated industries, organizations must comply with spam laws and “do not mail” laws for marketing.

4) Are you working across business units to work towards data governance, or is data quality done in silos?

When possible data quality should be a process that is reusable and able to be implemented in similar manner across business units.

5) Do you have the access to data you need?

The CEO should understand if any office politics are getting in the way of ensuring data quality and this question opens the door to that discussion.

Ajay- What does Steve Sarsfield do when not writing blogs and books.


Steve-These days, when I’m not thinking about data or my blog, I’m thinking about my fantasy football team and the upcoming season. I’ve got a ticket to the New England Patriots opening game vs the Buffalo Bills and I’m looking forward to it. On the weekends, you may find me playing a game of mafia wars on Facebook or cooking up a big pot of chili for the family.


Biography-


Steve Sarsfield is a Data governance business expert, speaker, author of The Data Governance Initiative ( at http://www.itgovernance.co.uk/products/2446 ) and blogger at http://data-governance.blogspot.com/. Product marketing professional  at a major data quality vendor and author of the book “The Data Governance Imperative”.He was Guest speaker at MIT Information Quality Symposium (July 2007 and July 2008),  at the International Association for Information and Data Quality (IAIDQ) Symposium (December 2006) and at SAP CRM 2006 summit.

Interview Jim Harris Data Quality Expert OCDQ Blog

Here is an interview with one of the chief evangelists to data quality in the field of Business Intelligence, Jim Harris who has a renowned blog at http://www.ocdqblog.com/. I asked Jim about his experiences in the field on data quality messing up big budget BI projects, and some tips and methodologies to avoid them.

No one likes to feel blamed for causing or failing to fix the data quality problems- Jim Harris, Data Quality Expert.

Jim Harris Large Photo

Ajay- Why the name OCDQ? What drives your passion for data quality? Name any anecdotes where bad data quality really messed up a big BI project.

Jim Harris – Ever since I was a child, I have had an obsessive-compulsive personality. If you asked my professional colleagues to describe my work ethic, many would immediately respond: “Jim is obsessive-compulsive about data quality…but in a good way!” Therefore, when evaluating the short list of what to name my blog, it was not surprising to anyone that Obsessive-Compulsive Data Quality (OCDQ) was what I chose.

On a project for a financial services company, a critical data source was applications received by mail or phone for a variety of insurance products. These applications were manually entered by data entry clerks. Social security number was a required field and the data entry application had been designed to only allow valid values. Therefore, no one was concerned about the data quality of this field – it had to be populated and only valid values were accepted.

When a report was generated to estimate how many customers were interested in multiple insurance products by looking at the count of applications per social security number, it appeared as if a small number of customers were interested in not only every insurance product the company offered, but also thousands of policies within the same product type. More confusion was introduced when the report added the customer name field, which showed that this small number of highly interested customers had hundreds of different names. The problem was finally traced back to data entry.

Many insurance applications were received without a social security number. The data entry clerks were compensated, in part, based on the number of applications they entered per hour. In order to process the incomplete applications, the data entry clerks entered their own social security number.

On a project for a telecommunications company, multiple data sources were being consolidated into a new billing system. Concerns about postal address quality required the use of validation software to cleanse the billing address. No one was concerned about the telephone number field – after all, how could a telecommunications company have a data quality problem with telephone number?

However, when reports were run against the new billing system, a high percentage of records had a missing telephone number. The problem was that many of the data sources originated from legacy systems that only recently added a telephone number field. Previously, the telephone number was entered into the last line of the billing address.

New records entered into these legacy systems did start using the telephone number field, but the older records already in the system were not updated. During the consolidation process, the telephone number field was mapped directly from source to target and the postal validation software deleted the telephone number from the cleansed billing address.

Ajay- Data Quality – Garbage in, Garbage out for a project. What percentage of a BI project do you think gets allocated to input data quality? What percentage of final output is affected by the normalized errors?

Jim Harris- I know that Gartner has reported that 25% of critical data within large businesses is somehow inaccurate or incomplete and that 50% of implementations fail due to a lack of attention to data quality issues.

The most common reason is that people doubt that data quality problems could be prevalent in their systems. This “data denial” is not necessarily a matter of blissful ignorance, but is often a natural self-defense mechanism from the data owners on the business side and/or the application owners on the technical side.

No one likes to feel blamed for causing or failing to fix the data quality problems.

All projects should allocate time and resources for performing a data quality assessment, which provides a much needed reality check for the perceptions and assumptions about the quality of the data. A data quality assessment can help with many tasks including verifying metadata, preparing meaningful questions for subject matter experts, understanding how data is being used, and most importantly – evaluating the ROI of data quality improvements. Building data quality monitoring functionality into the applications that support business processes provides the ability to measure the effect that poor data quality can have on decision-critical information.

Ajay- Companies talk of paradigms like Kaizen, Six Sigma and LEAN for eliminating waste and defects. What technique would you recommend for a company just about to start a major BI project for a standard ETL and reporting project to keep data aligned and clean?

Jim Harris- I am a big advocate for methodology and best practices and the paradigms you mentioned do provide excellent frameworks that can be helpful. However, I freely admit that I have never been formally trained or certified in any of them. I have worked on projects where they have been attempted and have seen varying degrees of success in their implementation. Six Sigma is the one that I am most familiar with, especially the DMAIC framework.

However, a general problem that I have with most frameworks is their tendency to adopt a one-size-fits-all strategy, which I believe is an approach that is doomed to fail. Any implemented framework must be customized to adapt to an organization’s unique culture. In part, this is necessary because implementing changes of any kind will be met with initial resistance, but an attempt at forcing a one-size-fits-all approach almost sends a message to the organization that everything they are currently doing is wrong, which will of course only increase the resistance to change.

Starting with a framework as a reference provides best practices and recommended options of what has worked for other organizations. The framework should be reviewed to determine what can best be learned from it and to select what will work in the current environment and what simply won’t. This doesn’t mean that the selected components of the framework will be implemented simultaneously. All change comes gradually and the selected components will most likely be implemented in phases.

Fundamentally, all change starts with changing people’s minds. And to do that effectively, the starting point has to be improving communication and encouraging open dialogue. This means more of listening to what people throughout the organization have to say and less of just telling them what to do. Keeping data aligned and clean requires getting people aligned and communicating.

Ajay- What methods and habits would you recommend to young analysts starting in the BI field for a quality checklist?

Jim Harris- I always make two recommendations.

First, never make assumptions about the data. I don’t care how well the business requirements document is written or how pretty the data model looks or how narrowly your particular role on the project has been defined. There is simply no substitute for looking at the data.

Second, don’t be afraid to ask questions or admit when you don’t know the answers. The only difference between a young analyst just starting out and an expert is that the expert has already made and learned from all the mistakes caused by being afraid to ask questions or admitting when you don’t know the answers.

Ajay- What does Jim Harris do to have quality time when not at work?

Jim- Since I enjoy what I do for a living so much, it sometimes seems impossible to disengage from work and make quality time for myself. I have also become hopelessly addicted to social media and spend far too much time on Twitter and Facebook. I have also always spent too much of my free time watching television and movies. I do try to read as much as I can, but I have so many stacks of unread books in my house that I could probably open my own book store. True quality time typically requires the elimination of all technology by going walking, hiking or mountain biking. I do bring my mobile phone in case of emergencies, but I turn it off before I leave.

Biography-

Jim Harris Small PhotoJim Harris is the Blogger-in-Chief here at Obsessive-Compulsive Data Quality (OCDQ), which is an independent blog offering a vendor-neutral perspective on data quality.

He is an independent consultant, speaker, writer and blogger with over 15 years of professional services and application development experience in data quality (DQ), and business intelligence (BI),

Jim has worked with Global 500 companies in finance, brokerage, banking, insurance, healthcare, pharmaceuticals, manufacturing, retail, telecommunications, and utilities. Jim also has a long history with the product that is now known as IBM InfoSphere QualityStage. Additionally, he has some experience with Informatica Data Quality and DataFlux dfPower Studio.

Jim can be followed at twitter.com/ocdqblog and contacted at http://www.ocdqblog.com/contact/