Data Munging using #rstats Part 1 -Understanding Data Quality

This is a series of posts on Data Munging using R.

we will examine the various ways to input data and examine errors in the data input stage. We will accordingly study ways to detect errors and rectify them using the R language. People estimate that almost 60-70% of a project’s time goes in the data input, data quality and data validation stage. By the principle of Garbage-In -Garbage -Out, we believe that an analysis is only as good as the input quality of data. Thus data quality is both an integral part as well as one of the first stages in a project before we move to comprehensive statistical analysis.

Data Quality is an important part of studying data manipulation. How do we define Data Quality?

In this chapter, Data quality is defined as manipulating data in the desired shape, size and format. We further elaborate that as follows-

Data that is useful for analysis without any errors is high quality data.

Data that is problematic for accurate analysis because of any errors is low quality data.

Data Quality errors are defined as deviations from actual data, due to systematic, computing or human mistakes.

Rectifying data quality errors involves the steps of error detection, missing value imputation. It also involves using the feedback from these steps to design better data input mechanisms.

The major types of Data Quality errors are-

Missing Data- This is defined as when data is simply missing. It may be represented by a “. “or a blank space or by special notation like NA (not available) . In R , missing data is represented by NA. Missing data is the easiest to detect but it is tough to rectify since most of the time we deal with data collected in real time in the past time and it is difficult and expensive to replace it with actual data. Some methods of replacing missing data is by imputing or inferring what the missing values could be , by looking at measures of central tendency like median , or mean, or by checking correlation with other variables or data points with better data population or by looking at historic data for a particular sub-set. Accordingly missing values for a particular data variable can be divided into sub sets for imputation by various means (like for different Geographic Values, or Different Time Values)

Invalid Data (too high or too low numeric (and date-time) data, character data in invalid format).

Incorrect Data (due to input errors including invalid or obsolete business rules, human input, low quality OCR scans)

The major causes of Data Quality Errors are-

Human Error (due to input, typing )

Machine Error ( due to invalid input readable eg. like by low resolution scanning device)

Syntax Error ( due to invalid logic or assumptions)

Data Format Error (due to a format that is not readable by software reading in data)

Steps for Diagnosis-

Missing Value Detection (using functions related to is.NA) and Missing Value Imputation

Distribution Analysis (using functions like summary,describe, and visualizations like boxplot, histogram)

Outliers (Bonferroni) Detection and Outlier Capping ( Minimum- Maximum)

Correlation with other variables ( using correlation statistics)

Diagnosis of Data Quality


The following functions in R will help us evaluate the quality of data in our data object.

str– gives structure of object for a data frame including class, dimensions, variable names, variable types, first few observations of each variable)

names– gives variable names.

dim– dimensions of object.

length– gives length of data object.

nrow– gives number of rows of data object.

ncol – gives number of columns of data object.

class– gives data class of object. This can be list, matrix or data.frame or other classes.

We use the famous iris dataset and attach it or load it in our R session using the command

data(iris). We then try out each of the functions given above.

> data(iris)

> str(iris)

data.frame': 150 obs. of 5 variables:

$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

> names(iris)

[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"

> dim(iris)

[1] 150 5

> length(iris)

[1] 5

> nrow(iris)

[1] 150

> ncol(iris)

[1] 5

> class(iris)

[1] "data.frame"

It is quite clear that the str function by itself is enough for the first step data quality as it contains all the other parameters.

We now and try and print out a part of the object to check what is stored there. By default we can print the entire object by just writing it’s name. However this may be inconvenient in some cases when there are a large number of rows.

Accordingly we use the head and tail functions to look at the beginning and last rows in a data object.

head – gives first few observations in a data object as specified by parameter in head (objectname, number of rows)

tail -gives last few observations in a data object as specified by parameter in tail (objectname, number of rows)

Here we take the first 7 rows and the last 3 rows of dataset iris. Note that the first column in the output below is the row.number.
> head(iris,7)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

7 4.6 3.4 1.4 0.3 setosa

> tail(iris,3)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

148 6.5 3.0 5.2 2.0 virginica

149 6.2 3.4 5.4 2.3 virginica

150 5.9 3.0 5.1 1.8 virginica

We can also pass negative numbers as parameters to head and tail. Here we are trying to take the first and last 7 rows ( or numbers of rows in object -143 rows). Since the object iris has 150 rows , -143 evaluates to 7 in head and tail functions.

> head(iris,-143)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

7 4.6 3.4 1.4 0.3 setosa

> tail(iris,-143)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

144 6.8 3.2 5.9 2.3 virginica

145 6.7 3.3 5.7 2.5 virginica

146 6.7 3.0 5.2 2.3 virginica

147 6.3 2.5 5.0 1.9 virginica

148 6.5 3.0 5.2 2.0 virginica

149 6.2 3.4 5.4 2.3 virginica

150 5.9 3.0 5.1 1.8 virginica

1.2 Strings

One of the most common errors in data analytics is mismatch in string variables . String variables also known as character variables are non-numeric text, and even a single misplacement in white space, or upper case, lower case can cause discrepancies in the data. One of the most common types of data for which this error attains criticality is address data and name data.

From the perspective of R, the data “virginica” is a different data (or factor-level) from “ virginica” and from “Virginica”.“1600 Penn Avenue” is a different address from “1600 Pennsylvania Avenue” and from “1600 PA”. This can lead to escalation of costs especially since users of business analytics try and create unique and accurate contact details ( names and addresses). This attains even more importance for running credit checks and financial data, since an inaccurate data mismatch can lead to a wrong credit score to a person, leading to liability of the credit provider.

For changing case we use the functions toupper and tolower

> a=c("ajay","vijay","ravi","rahul","bharat")

> toupper(a)


> b=c("Jane","JILL","AMY","NaNCY")

> tolower(b)

[1] "jane" "jill" "amy" "nancy"



grepl can be used to find a part of a string . For example, in cricket we denote a not out score of 250 runs by a star, .i.e. 250* but denote a score of 250 out as 250. This can create a problem if we are trying to read in data. It will either treat it as character level data, or if we coerce it to return numeric values, it will show the not out scores by missing values.

We want to find all instance of “*” in address field and see if they are not out. grepl returns a logical vector (match or not for each element of x). We will further expand on this example in our Case Study for Cricket Analytics


We use sub and gsub to substitute parts of string. While the sub function replaces the first occurrence, the gsub function replaces all occurrences of the matching pattern with the parameter supplied.

Here we are trying to replace white space in a sentence. Notice the sub function seems to work better than gsub in this case.

> newstring=" Hello World We are Experts in Learning R"

> sub(" ","",newstring)

[1] "Hello World We are Experts in Learning R"

> gsub(" ","",newstring)

[1] "HelloWorldWeareExpertsinLearningR"

Let us try to convert currency data into numeric data.For the sake of learning we are using a small data object , a list called “money” with three different inputs.

> money=c("$10,000","20000","32,000")

> money

[1] "$10,000" "20000" "32,000"

We replace a comma (used mainly for thousands in currency data) using gsub as shown before.

> money2=gsub(",","",money)

> money2

[1] "$10000" "20000" "32000"

$ indicates the end of a line in regular expressions. \$ is a dollar sign. So we have to use \\$ as an input in the gsub expression.

> money3=gsub("\\$","",money2)

> money3

[1] "10000" "20000" "32000"

At this point we may be satisfied that we have got the format we wanted. However this is an error, as these are still strings- as we find out by running the mean function

> mean(money3)

[1] NA

Warning message:

In mean.default(money3) : argument is not numeric or logical: returning NA

We then use the as operator to convert one data type (character) into another ( numeric).The as operator is generally used in syntax as.outputdataobject.class. Accordingly we will use as.numeric for the conversion.


> money4=as.numeric(money3)

> money4

[1] 10000 20000 32000

> mean(money4)

[1] 20666.67

Please note , we used many intermediate steps to do the multiple steps of data manipulation and used the = sign to assign this to new objects. We can combine two steps into one by putting them within successive brackets. This is illustrated below, when we are trying to convert character data containing (% Percentages) into Numeric data.

> mean(as.numeric(gsub("%","",percentages)))

[1] 35

> percentages

[1] "%20" "%30" "%40" "50"

Note we have found the mean but the original object is not changed.


Do gsub only one variable at a time

Slight problem is suppose there is data like 1,504 – it will be converted to NA instead of 1504.The way to solve this is use the nice gsub function ONLY on that variable. Since the comma is also the most commonly used delimiter , you dont want to replace all the commas, just only the one in that variable.




Additional– The function setAs creates methods for the as function to use. This is an advanced usage.





Data Quality in R #rstats

Many Data Quality Formats give problems when importing in your statistical software.A statistical software is quite unable to distingush between $1,000, 1000% and 1,000 and 1000 and will treat the former three as character variables while the third as a numeric variable by default. This issue is further compounded by the numerous ways we can represent date-time variables.

The good thing is for specific domains like finance and web analytics, even these weird data input formats are fixed, so we can fix up a list of handy data quality conversion functions in R for reference.


After much muddling about with coverting internet formats (or data used in web analytics) (mostly time formats without date like 00:35:23)  into data frame numeric formats, I found that the way to handle Date-Time conversions in R is

Dataset$Var2= strptime(as.character(Dataset$Var1),”%M:%S”)

The problem with this approach is you will get the value as a Date Time format (02/31/2012 04:00:45-  By default R will add today’s date to it.)  while you are interested in only Time Durations (4:00:45 or actually just the equivalent in seconds).

this can be handled using the as.difftime function


or to get purely numeric values so we can do numeric analysis (like summary)


(#Maybe there is  a more elegant way here- but I dont know)

The kind of data is usually one we get in web analytics for average time on site , etc.







for factor variables

Dataset$Var2= as.numeric(as.character(Dataset$Var1))



Dataset$Var2= as.numeric(paste(Dataset$Var1))


Slight problem is suppose there is data like 1,504 – it will be converted to NA instead of 1504

The way to solve this is use the nice gsub function ONLy on that variable. Since the comma is also the most commonly used delimiter , you dont want to replace all the commas, just only the one in that variable.




Now lets assume we have data in the form of % like 0.00% , 1.23%, 3.5%

again we use the gsub function to replace the % value in the string with  (nothing).





If you simply do the following for a factor variable, it will show you the level not the value. This can create an error when you are reading in CSV data which may be read as character or factor data type.

Dataset$Var2= as.numeric(Dataset$Var1)

An additional way is to use substr (using substr( and concatenate (using paste) for manipulating string /character variables.


iris$sp=substr(iris$Species,1,3) –will reduce the famous Iris species into three digits , without losing any analytical value.

The other issue is with missing values, and na.rm=T helps with getting summaries of numeric variables with missing values, we need to further investigate how suitable, na.omit functions are for domains which have large amounts of missing data and need to be treated.



Jill Dyche on 2012

In part 3 of the series for predictions for 2012, here is Jill Dyche, Baseline Consulting/DataFlux.

Part 2 was Timo Elliot, SAP at and Part 1 was Jim Kobielus, Forrester at

Ajay: What are the top trends you saw happening in 2011?


Well, I hate to say I saw them coming, but I did. A lot of managers committed some pretty predictable mistakes in 2011. Here are a few we witnessed in 2011 live and up close:


1.       In the spirit of “size matters,” data warehouse teams continued to trumpet the volumes of stored data on their enterprise data warehouses. But a peek under the covers of these warehouses reveals that the data isn’t integrated. Essentially this means a variety of heterogeneous virtual data marts co-located on a single server. Neat. Big. Maybe even worthy of a magazine article about how many petabytes you’ve got. But it’s not efficient, and hardly the example of data standardization and re-use that everyone expects from analytical platforms these days.


2.       Development teams still didn’t factor data integration and provisioning into their project plans in 2011. So we saw multiple projects spawn duplicate efforts around data profiling, cleansing, and standardization, not to mention conflicting policies and business rules for the same information. Bummer, since IT managers should know better by now. The problem is that no one owns the problem. Which brings me to the next mistake…


3.       No one’s accountable for data governance. Yeah, there’s a council. And they meet. And they talk. Sometimes there’s lunch. And then nothing happens because no one’s really rewarded—or penalized for that matter—on data quality improvements or new policies. And so the reports spewing from the data mart are still fraught and no one trusts the resulting decisions.


But all is not lost since we’re seeing some encouraging signs already in 2012. And yes, I’d classify some of them as bona-fide trends.


Ajay: What are some of those trends?


Job descriptions for data stewards, data architects, Chief Data Officers, and other information-enabling roles are becoming crisper, and the KPIs for these roles are becoming more specific. Data management organizations are being divorced from specific lines of business and from IT, becoming specialty organizations—okay, COEs if you must—in their own rights. The value proposition for master data management now includes not just the reconciliation of heterogeneous data elements but the support of key business strategies. And C-level executives are holding the data people accountable for improving speed to market and driving down costs—not just delivering cleaner data. In short, data is becoming a business enabler. Which, I have to just say editorially, is better late than never!


Ajay: Anything surprise you, Jill?


I have to say that Obama mentioning data management in his State of the Union speech was an unexpected but pretty powerful endorsement of the importance of information in both the private and public sector.


I’m also sort of surprised that data governance isn’t being driven more frequently by the need for internal and external privacy policies. Our clients are constantly asking us about how to tightly-couple privacy policies into their applications and data sources. The need to protect PCI data and other highly-sensitive data elements has made executives twitchy. But they’re still not linking that need to data governance.


I should also mention that I’ve been impressed with the people who call me who’ve had their “aha!” moment and realize that data transcends analytic systems. It’s operational, it’s pervasive, and it’s dynamic. I figured this epiphany would happen in a few years once data quality tools became a commodity (they’re far from it). But it’s happening now. And that’s good for all types of businesses.



Jill Dyché has written three books and numerous articles on the business value of information technology. She advises clients and executive teams on leveraging technology and information to enable strategic business initiatives. Last year her company Baseline Consulting was acquired by DataFlux Corporation, where she is currently Vice President of Thought Leadership. Find her blog posts on

How to make an analytics project?

Some of the process methodologies I have used and been exposed to while making analytics projects are-1) DMAIC/Six Sigma

While Six Sigma was initially a quality control system, it has also been very succesful in managing projects. The various stages of an analytical project can be divided using the DMAIC methodology.

DMAIC stands for

  • Define
  • Measure
  • Analyze
  • Improve
  • Control

Related to this is DMADV, ( “Design For Six Sigma”)

  • Define
  • Measure and identify CTQs
  • Analyze
  • Design
  • Verify

CRISP-DM stands for Cross Industry Standard Process for Data Mining

CRISP-DM breaks the process of data mining into six major phases- and these can be used for business analytics projects as well.

  • Business Understanding
  • Data Understanding
  • Data Preparation
  • Modeling
  • Evaluation
  • Deployment

SEMMA  stands for

  • sample
  • explore
  • modify
  • model
  • assess

4) ISO 9001

ISO 9001 is a certification as well as a philosophy for making a Quality Management System to measure , reduce and eliminate error and customer complaints. Any customer complaint or followup has to be treated as an error, logged, and investigated for control.

LEAN is a philosophy to eliminate Wastage in a process. Applying LEAN principles to analytics projects helps a lot in eliminating project bottlenecks, technology compatibility issues and data quality resolution. I think LEAN would be great in data quality issues, and IT infrastructure design because that is where the maximum waste is observed in analytics projects.

6) Demings Plan Do Check Act cycle.

Interview Scott Gidley CTO and Founder, DataFlux

Here is an interview with Scott Gidley, CTO and co-founder of leading data quality ccompany DataFlux . DataFlux is a part of SAS Institute and in 2011 acquired Baseline Consulting besides launching the latest version of their Master Data Management  product. Continue reading “Interview Scott Gidley CTO and Founder, DataFlux”

The Amazing Google MapMaker

Literally helping make the world a better place by making it- see real time edits of people

Map Maker’s review process allows users to easily edit content, while ensuring that data quality remains intact.

The first time a Map Maker user makes edits to a map, the edits may require review and approval before the edits will be published. Once a Map Maker user has made a few approved edits, most of the subsequent edits will go live automatically. However, some types of edits or edits in specific regions will always require review, regardless of how experienced the mapper is. In addition, some edits may require multiple reviews before the edits appear on Google Maps.


Why do edits need to be reviewed?


There are several reasons why edits are sent for review:

  • To provide feedback to new editors, and to help them understand the mapping process.
  • To provide feedback on more complex edits.
  • To ensure that sensitive edits meet the Reviewing Guidelines.


Learn how you can review edits in Map Maker.




Review some edits and get your edits reviewed faster!

When you review the edits made by your co-mappers, your edits are prioritized in the reviewing queue so that they appear above other edits. This enables other mappers to review them and your edits get published on Maps faster! from-


HIGHLIGHTS from REXER Survey :R gives best satisfaction

Simple graph showing hierarchical clustering. ...
Image via Wikipedia

A Summary report from Rexer Analytics Annual Survey


HIGHLIGHTS from the 4th Annual Data Miner Survey (2010):


•   FIELDS & GOALS: Data miners work in a diverse set of fields.  CRM / Marketing has been the #1 field in each of the past four years.  Fittingly, “improving the understanding of customers”, “retaining customers” and other CRM goals are also the goals identified by the most data miners surveyed.


•   ALGORITHMS: Decision trees, regression, and cluster analysis continue to form a triad of core algorithms for most data miners.  However, a wide variety of algorithms are being used.  This year, for the first time, the survey asked about Ensemble Models, and 22% of data miners report using them.
A third of data miners currently use text mining and another third plan to in the future.


•   MODELS: About one-third of data miners typically build final models with 10 or fewer variables, while about 28% generally construct models with more than 45 variables.


•   TOOLS: After a steady rise across the past few years, the open source data mining software R overtook other tools to become the tool used by more data miners (43%) than any other.  STATISTICA, which has also been climbing in the rankings, is selected as the primary data mining tool by the most data miners (18%).  Data miners report using an average of 4.6 software tools overall.  STATISTICA, IBM SPSS Modeler, and R received the strongest satisfaction ratings in both 2010 and 2009.


•   TECHNOLOGY: Data Mining most often occurs on a desktop or laptop computer, and frequently the data is stored locally.  Model scoring typically happens using the same software used to develop models.  STATISTICA users are more likely than other tool users to deploy models using PMML.


•   CHALLENGES: As in previous years, dirty data, explaining data mining to others, and difficult access to data are the top challenges data miners face.  This year data miners also shared best practices for overcoming these challenges.  The best practices are available online.


•   FUTURE: Data miners are optimistic about continued growth in the number of projects they will be conducting, and growth in data mining adoption is the number one “future trend” identified.  There is room to improve:  only 13% of data miners rate their company’s analytic capabilities as “excellent” and only 8% rate their data quality as “very strong”.


Please contact us if you have any questions about the attached report or this annual research program.  The 5th Annual Data Miner Survey will be launching next month.  We will email you an invitation to participate.


Information about Rexer Analytics is available at Rexer Analytics continues their impressive journey see

|My only thought- since most data miners are using multiple tools including free tools as well as paid software, Perhaps a pie chart of market share by revenue and volume would be handy.

Also some ideas on comparing diverse data mining projects by data size, or complexity.