Home » Posts tagged 'data'

Tag Archives: data

Top 7 Business Strategy Models

UPDATED POST- Some Models I use for Business Strategy- to analyze the huge reams of qualitative and uncertain data that business generates. I have added a bonus the Business canvas

  1. Porters 5 forces Model-To analyze industries
  2. Business Canvas
  3. BCG Matrix- To analyze Product Portfolios
  4. Porters Diamond Model- To analyze locations
  5. McKinsey 7 S Model-To analyze teams
  6. Gernier Theory- To analyze growth of organization
  7. Herzberg Hygiene Theory- To analyze soft aspects of individuals
  8. Marketing Mix Model- To analyze marketing mix.


Data Munging using #rstats Part 1 -Understanding Data Quality

This is a series of posts on Data Munging using R.

we will examine the various ways to input data and examine errors in the data input stage. We will accordingly study ways to detect errors and rectify them using the R language. People estimate that almost 60-70% of a project’s time goes in the data input, data quality and data validation stage. By the principle of Garbage-In -Garbage -Out, we believe that an analysis is only as good as the input quality of data. Thus data quality is both an integral part as well as one of the first stages in a project before we move to comprehensive statistical analysis.

Data Quality is an important part of studying data manipulation. How do we define Data Quality?

In this chapter, Data quality is defined as manipulating data in the desired shape, size and format. We further elaborate that as follows-

Data that is useful for analysis without any errors is high quality data.

Data that is problematic for accurate analysis because of any errors is low quality data.

Data Quality errors are defined as deviations from actual data, due to systematic, computing or human mistakes.

Rectifying data quality errors involves the steps of error detection, missing value imputation. It also involves using the feedback from these steps to design better data input mechanisms.

The major types of Data Quality errors are-

Missing Data- This is defined as when data is simply missing. It may be represented by a “. “or a blank space or by special notation like NA (not available) . In R , missing data is represented by NA. Missing data is the easiest to detect but it is tough to rectify since most of the time we deal with data collected in real time in the past time and it is difficult and expensive to replace it with actual data. Some methods of replacing missing data is by imputing or inferring what the missing values could be , by looking at measures of central tendency like median , or mean, or by checking correlation with other variables or data points with better data population or by looking at historic data for a particular sub-set. Accordingly missing values for a particular data variable can be divided into sub sets for imputation by various means (like for different Geographic Values, or Different Time Values)

Invalid Data (too high or too low numeric (and date-time) data, character data in invalid format).

Incorrect Data (due to input errors including invalid or obsolete business rules, human input, low quality OCR scans)

The major causes of Data Quality Errors are-

Human Error (due to input, typing )

Machine Error ( due to invalid input readable eg. like by low resolution scanning device)

Syntax Error ( due to invalid logic or assumptions)

Data Format Error (due to a format that is not readable by software reading in data)

Steps for Diagnosis-

Missing Value Detection (using functions related to is.NA) and Missing Value Imputation

Distribution Analysis (using functions like summary,describe, and visualizations like boxplot, histogram)

Outliers (Bonferroni) Detection and Outlier Capping ( Minimum- Maximum)

Correlation with other variables ( using correlation statistics)

Diagnosis of Data Quality


The following functions in R will help us evaluate the quality of data in our data object.

str- gives structure of object for a data frame including class, dimensions, variable names, variable types, first few observations of each variable)

names- gives variable names.

dim- dimensions of object.

length- gives length of data object.

nrow- gives number of rows of data object.

ncol – gives number of columns of data object.

class- gives data class of object. This can be list, matrix or data.frame or other classes.

We use the famous iris dataset and attach it or load it in our R session using the command

data(iris). We then try out each of the functions given above.

> data(iris)

> str(iris)

data.frame': 150 obs. of 5 variables:

$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

> names(iris)

[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"

> dim(iris)

[1] 150 5

> length(iris)

[1] 5

> nrow(iris)

[1] 150

> ncol(iris)

[1] 5

> class(iris)

[1] "data.frame"

It is quite clear that the str function by itself is enough for the first step data quality as it contains all the other parameters.

We now and try and print out a part of the object to check what is stored there. By default we can print the entire object by just writing it’s name. However this may be inconvenient in some cases when there are a large number of rows.

Accordingly we use the head and tail functions to look at the beginning and last rows in a data object.

head – gives first few observations in a data object as specified by parameter in head (objectname, number of rows)

tail -gives last few observations in a data object as specified by parameter in tail (objectname, number of rows)

Here we take the first 7 rows and the last 3 rows of dataset iris. Note that the first column in the output below is the row.number.
> head(iris,7)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

7 4.6 3.4 1.4 0.3 setosa

> tail(iris,3)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

148 6.5 3.0 5.2 2.0 virginica

149 6.2 3.4 5.4 2.3 virginica

150 5.9 3.0 5.1 1.8 virginica

We can also pass negative numbers as parameters to head and tail. Here we are trying to take the first and last 7 rows ( or numbers of rows in object -143 rows). Since the object iris has 150 rows , -143 evaluates to 7 in head and tail functions.

> head(iris,-143)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

7 4.6 3.4 1.4 0.3 setosa

> tail(iris,-143)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

144 6.8 3.2 5.9 2.3 virginica

145 6.7 3.3 5.7 2.5 virginica

146 6.7 3.0 5.2 2.3 virginica

147 6.3 2.5 5.0 1.9 virginica

148 6.5 3.0 5.2 2.0 virginica

149 6.2 3.4 5.4 2.3 virginica

150 5.9 3.0 5.1 1.8 virginica

1.2 Strings

One of the most common errors in data analytics is mismatch in string variables . String variables also known as character variables are non-numeric text, and even a single misplacement in white space, or upper case, lower case can cause discrepancies in the data. One of the most common types of data for which this error attains criticality is address data and name data.

From the perspective of R, the data “virginica” is a different data (or factor-level) from “ virginica” and from “Virginica”.“1600 Penn Avenue” is a different address from “1600 Pennsylvania Avenue” and from “1600 PA”. This can lead to escalation of costs especially since users of business analytics try and create unique and accurate contact details ( names and addresses). This attains even more importance for running credit checks and financial data, since an inaccurate data mismatch can lead to a wrong credit score to a person, leading to liability of the credit provider.

For changing case we use the functions toupper and tolower

> a=c("ajay","vijay","ravi","rahul","bharat")

> toupper(a)


> b=c("Jane","JILL","AMY","NaNCY")

> tolower(b)

[1] "jane" "jill" "amy" "nancy"



grepl can be used to find a part of a string . For example, in cricket we denote a not out score of 250 runs by a star, .i.e. 250* but denote a score of 250 out as 250. This can create a problem if we are trying to read in data. It will either treat it as character level data, or if we coerce it to return numeric values, it will show the not out scores by missing values.

We want to find all instance of “*” in address field and see if they are not out. grepl returns a logical vector (match or not for each element of x). We will further expand on this example in our Case Study for Cricket Analytics


We use sub and gsub to substitute parts of string. While the sub function replaces the first occurrence, the gsub function replaces all occurrences of the matching pattern with the parameter supplied.

Here we are trying to replace white space in a sentence. Notice the sub function seems to work better than gsub in this case.

> newstring=" Hello World We are Experts in Learning R"

> sub(" ","",newstring)

[1] "Hello World We are Experts in Learning R"

> gsub(" ","",newstring)

[1] "HelloWorldWeareExpertsinLearningR"

Let us try to convert currency data into numeric data.For the sake of learning we are using a small data object , a list called “money” with three different inputs.

> money=c("$10,000","20000","32,000")

> money

[1] "$10,000" "20000" "32,000"

We replace a comma (used mainly for thousands in currency data) using gsub as shown before.

> money2=gsub(",","",money)

> money2

[1] "$10000" "20000" "32000"

$ indicates the end of a line in regular expressions. \$ is a dollar sign. So we have to use \\$ as an input in the gsub expression.

> money3=gsub("\\$","",money2)

> money3

[1] "10000" "20000" "32000"

At this point we may be satisfied that we have got the format we wanted. However this is an error, as these are still strings- as we find out by running the mean function

> mean(money3)

[1] NA

Warning message:

In mean.default(money3) : argument is not numeric or logical: returning NA

We then use the as operator to convert one data type (character) into another ( numeric).The as operator is generally used in syntax as.outputdataobject.class. Accordingly we will use as.numeric for the conversion.


> money4=as.numeric(money3)

> money4

[1] 10000 20000 32000

> mean(money4)

[1] 20666.67

Please note , we used many intermediate steps to do the multiple steps of data manipulation and used the = sign to assign this to new objects. We can combine two steps into one by putting them within successive brackets. This is illustrated below, when we are trying to convert character data containing (% Percentages) into Numeric data.

> mean(as.numeric(gsub("%","",percentages)))

[1] 35

> percentages

[1] "%20" "%30" "%40" "50"

Note we have found the mean but the original object is not changed.


Do gsub only one variable at a time

Slight problem is suppose there is data like 1,504 – it will be converted to NA instead of 1504.The way to solve this is use the nice gsub function ONLY on that variable. Since the comma is also the most commonly used delimiter , you dont want to replace all the commas, just only the one in that variable.




Additional- The function setAs creates methods for the as function to use. This is an advanced usage.





Download all your tweets

Now that the Government of the United States of America has the legal power to request your information without a warrant  (The Chinese love this!)

Anyways- you can also download your own twitter data. Liberate your data.

Have you looked at your own data? Go there at https://twitter.com/settings/account and review the changes.

t  t2



WordPress.com Analytics

The Analytics (or stats) dashboard at WordPress.com continues to disappoint, and is a major reason for people to move out of WordPress.com hosting (since they need better analytics like that by Google Analytics which cant be enabled on the default mode)

Its not really beautiful unlike the rest of WordPress Universe!

It can be made better if people try harder! Analytics matters

Here are some points

1) Bar charts and Histograms are not really the best way to visualize trends across time

2) Location Analytics is limited to just country level analysis and the heatmap (?) is aweful in terms of distinguishing gradients 

3) Referrers Tab needs to do a better job on distinguishing between mobile and non mobile traffic, social and non social traffic (and there are better ways to visualize than just a simple list)!

4)  I cant even export my traffic stats (and forget an api !) so I am stuck with the bad data viz here

Data Frame in Python

Exploring some Python Packages and R packages to move /work with both Python and R without melting your brain or exceeding your project deadline


If you liked the data.frame structure in R, you have some way to work with them at a faster processing speed in Python.

Here are three packages that enable you to do so-

(1) pydataframe http://code.google.com/p/pydataframe/

An implemention of an almost R like DataFrame object. (install via Pypi/Pip: “pip install pydataframe”)


        u = DataFrame( { "Field1": [1, 2, 3],
                        "Field2": ['abc', 'def', 'hgi']},
                         ['Field1', 'Field2']
                         ["rowOne", "rowTwo", "thirdRow"])

A DataFrame is basically a table with rows and columns.

Columns are named, rows are numbered (but can be named) and can be easily selected and calculated upon. Internally, columns are stored as 1d numpy arrays. If you set row names, they’re converted into a dictionary for fast access. There is a rich subselection/slicing API, see help(DataFrame.get_item) (it also works for setting values). Please note that any slice get’s you another DataFrame, to access individual entries use get_row(), get_column(), get_value().

DataFrames also understand basic arithmetic and you can either add (multiply,…) a constant value, or another DataFrame of the same size / with the same column names, like this:

#multiply every value in ColumnA that is smaller than 5 by 6.
my_df[my_df[:,'ColumnA'] < 5, 'ColumnA'] *= 6

#you always need to specify both row and column selectors, use : to mean everything
my_df[:, 'ColumnB'] = my_df[:,'ColumnA'] + my_df[:, 'ColumnC']

#let's take every row that starts with Shu in ColumnA and replace it with a new list (comprehension)
select = my_df.where(lambda row: row['ColumnA'].startswith('Shu'))
my_df[select, 'ColumnA'] = [row['ColumnA'].replace('Shu', 'Sha') for row in my_df[select,:].iter_rows()]

Dataframes talk directly to R via rpy2 (rpy2 is not a prerequiste for the library!)


(2) pandas http://pandas.pydata.org/

Library Highlights

  • A fast and efficient DataFrame object for data manipulation with integrated indexing;
  • Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
  • Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
  • Flexible reshaping and pivoting of data sets;
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
  • Columns can be inserted and deleted from data structures for size mutability;
  • Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
  • High performance merging and joining of data sets;
  • Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
  • Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
  • The library has been ruthlessly optimized for performance, with critical code paths compiled to C;
  • Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

Why not R?

First of all, we love open source R! It is the most widely-used open source environment for statistical modeling and graphics, and it provided some early inspiration for pandas features. R users will be pleased to find this library adopts some of the best concepts of R, like the foundational DataFrame (one user familiar with R has described pandas as “R data.frame on steroids”). But pandas also seeks to solve some frustrations common to R users:

  • R has barebones data alignment and indexing functionality, leaving much work to the user. pandas makes it easy and intuitive to work with messy, irregularly indexed data, like time series data. pandas also provides rich tools, like hierarchical indexing, not found in R;
  • R is not well-suited to general purpose programming and system development. pandas enables you to do large-scale data processing seamlessly when developing your production applications;
  • Hybrid systems connecting R to a low-productivity systems language like Java, C++, or C# suffer from significantly reduced agility and maintainability, and you’re still stuck developing the system components in a low-productivity language;
  • The “copyleft” GPL license of R can create concerns for commercial software vendors who want to distribute R with their software under another license. Python and pandas use more permissive licenses.

(3) datamatrix http://pypi.python.org/pypi/datamatrix/0.8

datamatrix 0.8

A Pythonic implementation of R’s data.frame structure.

Latest Version: 0.9

This module allows access to comma- or other delimiter separated files as if they were tables, using a dictionary-like syntax. DataMatrix objects can be manipulated, rows and columns added and removed, or even transposed


Modeling in Python


R and Hadoop #rstats

Lovely ppt from the formidable Jeffrey Bean, whose lucid style in explaining R has made me a big fan of his awesome work!

Take at look at his extensive collection of Big Data with R slides  at http://jeffreybreen.wordpress.com/2012/03/10/big-data-step-by-step-slides/ – they are both very comprehensive and a delightful addition to anyone wishing to go the cloud, hadoop, R  route
His blog at http://jeffreybreen.wordpress.com/ talks of lots of very relevant topics.

Analytics 2012 Conference

from http://www.sas.com/events/analytics/us/index.html

Analytics 2012 Conference

SAS and more than 1,000 analytics experts gather at

Caesars Palace
Caesars Palace

Analytics 2012 Conference Details

Pre-Conference Workshops – Oct 7
Conference – Oct 8-9
Post-Conference Training – Oct 10-12
Caesars Palace, Las Vegas

Keynote Speakers

The following are confirmed keynote speakers for Analytics 2012. Jim Goodnight Since he co-founded SAS in 1976, Jim Goodnight has served as the company’s Chief Executive Officer.

William Hakes Dr. William Hakes is the CEO and co-founder of Link Analytics, an analytical technology company focused on mobile, energy and government verticals.

Tim Rey Tim Rey  has written over 100 internal papers, published 21 external papers, and delivered numerous keynote presentations and technical talks at various quantitative methods forums. Recently he has co-chaired both forecasting and data mining conferences. He is currently in the process of co-writing a book, Applied Data Mining for Forecasting.



Plan to come to Analytics 2012 a day early and participate in one of the pre-conference workshops or take a SAS Certification exam. Prices for all of the preconference workshops, except for SAS Sentiment Analysis Studio: Introduction to Building Models and the Business Analytics Consulting Workshops, are included in the conference package pricing. You will be prompted to select your pre-conference training options when you register.

Sunday Morning Workshop

SAS Sentiment Analysis Studio: Introduction to Building Models

This course provides an introduction to SAS Sentiment Analysis Studio. It is designed for system designers, developers, analytical consultants and managers who want to understand techniques and approaches for identifying sentiment in textual documents.
View outline
Sunday, Oct. 7, 8:30a.m.-12p.m. – $250

Sunday Afternoon Workshops

Business Analytics Consulting Workshops

This workshop is designed for the analyst, statistician, or executive who wants to discuss best-practice approaches to solving specific business problems, in the context of analytics. The two-hour workshop will be customized to discuss your specific analytical needs and will be designed as a one-on-one session for you, including up to five individuals within your company sharing your analytical goal. This workshop is specifically geared for an expert tasked with solving a critical business problem who needs consultation for developing the analytical approach required. The workshop can be customized to meet your needs, from a deep-dive into modeling methods to a strategic plan for analytic initiatives. In addition to the two hours at the conference location, this workshop includes some advanced consulting time over the phone, making it a valuable investment at a bargain price.
View outline
Sunday, Oct. 7; 1-3 p.m. or 3:30-5:30 p.m. – $200

Demand-Driven Forecasting: Sensing Demand Signals, Shaping and Predicting Demand

This half-day lecture teaches students how to integrate demand-driven forecasting into the consensus forecasting process and how to make the current demand forecasting process more demand-driven.
View outline
Sunday, Oct. 7; 1-5 p.m.

Forecast Value Added Analysis

Forecast Value Added (FVA) is the change in a forecasting performance metric (such as MAPE or bias) that can be attributed to a particular step or participant in the forecasting process. FVA analysis is used to identify those process activities that are failing to make the forecast any better (or might even be making it worse). This course provides step-by-step guidelines for conducting FVA analysis – to identify and eliminate the waste, inefficiency, and worst practices from your forecasting process. The result can be better forecasts, with fewer resources and less management time spent on forecasting.
View outline
Sunday, Oct. 7; 1-5 p.m.

SAS Enterprise Content Categorization: An Introduction

This course gives an introduction to methods of unstructured data analysis, document classification and document content identification. The course also uses examples as the basis for constructing parse expressions and resulting entities.
View outline
Sunday, Oct. 7; 1-5 p.m.

Introduction to Data Mining and SAS Enterprise Miner

This course serves as an introduction to data mining and SAS Enterprise Miner for Desktop software. It is designed for data analysts and qualitative experts as well as those with less of a technical background who want a general understanding of data mining.
View outline
Sunday, Oct. 7, 1-5 p.m.

Modeling Trend, Cycles, and Seasonality in Time Series Data Using PROC UCM

This half-day lecture teaches students how to model, interpret, and predict time series data using UCMs. The UCM procedure analyzes and forecasts equally spaced univariate time series data using the unobserved components models (UCM). This course is designed for business analysts who want to analyze time series data to uncover patterns such as trend, seasonal effects, and cycles using the latest techniques.
View outline
Sunday, Oct. 7, 1-5 p.m.

SAS Rapid Predictive Modeler

This seminar will provide a brief introduction to the use of SAS Enterprise Guide for graphical and data analysis. However, the focus will be on using SAS Enterprise Guide and SAS Enterprise Miner along with the Rapid Predictive Modeling component to build predictive models. Predictive modeling will be introduced using the SEMMA process developed with the introduction of SAS Enterprise Miner. Several examples will be used to illustrate the use of the Rapid Predictive Modeling component, and interpretations of the model results will be provided.
View outline
Sunday, Oct. 7, 1-5 p.m


Get every new post delivered to your Inbox.

Join 735 other followers