Data Quality in R #rstats

Many Data Quality Formats give problems when importing in your statistical software.A statistical software is quite unable to distingush between $1,000, 1000% and 1,000 and 1000 and will treat the former three as character variables while the third as a numeric variable by default. This issue is further compounded by the numerous ways we can represent date-time variables.

The good thing is for specific domains like finance and web analytics, even these weird data input formats are fixed, so we can fix up a list of handy data quality conversion functions in R for reference.


After much muddling about with coverting internet formats (or data used in web analytics) (mostly time formats without date like 00:35:23)  into data frame numeric formats, I found that the way to handle Date-Time conversions in R is

Dataset$Var2= strptime(as.character(Dataset$Var1),”%M:%S”)

The problem with this approach is you will get the value as a Date Time format (02/31/2012 04:00:45-  By default R will add today’s date to it.)  while you are interested in only Time Durations (4:00:45 or actually just the equivalent in seconds).

this can be handled using the as.difftime function


or to get purely numeric values so we can do numeric analysis (like summary)


(#Maybe there is  a more elegant way here- but I dont know)

The kind of data is usually one we get in web analytics for average time on site , etc.







for factor variables

Dataset$Var2= as.numeric(as.character(Dataset$Var1))



Dataset$Var2= as.numeric(paste(Dataset$Var1))


Slight problem is suppose there is data like 1,504 – it will be converted to NA instead of 1504

The way to solve this is use the nice gsub function ONLy on that variable. Since the comma is also the most commonly used delimiter , you dont want to replace all the commas, just only the one in that variable.




Now lets assume we have data in the form of % like 0.00% , 1.23%, 3.5%

again we use the gsub function to replace the % value in the string with  (nothing).





If you simply do the following for a factor variable, it will show you the level not the value. This can create an error when you are reading in CSV data which may be read as character or factor data type.

Dataset$Var2= as.numeric(Dataset$Var1)

An additional way is to use substr (using substr( and concatenate (using paste) for manipulating string /character variables.


iris$sp=substr(iris$Species,1,3) –will reduce the famous Iris species into three digits , without losing any analytical value.

The other issue is with missing values, and na.rm=T helps with getting summaries of numeric variables with missing values, we need to further investigate how suitable, na.omit functions are for domains which have large amounts of missing data and need to be treated.



Analytics 2012 Conference

A nice conference from the grand old institution of Analytics,  SAS  Institute’s annual analytic pow-wow.

I especially like some of the trainings- and wonder if they could be stored as e-learning modules for students/academics to review

in SAS’s extensive and generous Online Education Program.

Sunday Morning Workshop

SAS Sentiment Analysis Studio: Introduction to Building Models

This course provides an introduction to SAS Sentiment Analysis Studio. It is designed for system designers, developers, analytical consultants and managers who want to understand techniques and approaches for identifying sentiment in textual documents.
View outline
Sunday, Oct. 7, 8:30a.m.-12p.m. – $250

Sunday Afternoon Workshops

Business Analytics Consulting Workshops

This workshop is designed for the analyst, statistician, or executive who wants to discuss best-practice approaches to solving specific business problems, in the context of analytics. The two-hour workshop will be customized to discuss your specific analytical needs and will be designed as a one-on-one session for you, including up to five individuals within your company sharing your analytical goal. This workshop is specifically geared for an expert tasked with solving a critical business problem who needs consultation for developing the analytical approach required. The workshop can be customized to meet your needs, from a deep-dive into modeling methods to a strategic plan for analytic initiatives. In addition to the two hours at the conference location, this workshop includes some advanced consulting time over the phone, making it a valuable investment at a bargain price.
View outline
Sunday, Oct. 7; 1-3 p.m. or 3:30-5:30 p.m. – $200

Demand-Driven Forecasting: Sensing Demand Signals, Shaping and Predicting Demand

This half-day lecture teaches students how to integrate demand-driven forecasting into the consensus forecasting process and how to make the current demand forecasting process more demand-driven.
View outline
Sunday, Oct. 7; 1-5 p.m.

Forecast Value Added Analysis

Forecast Value Added (FVA) is the change in a forecasting performance metric (such as MAPE or bias) that can be attributed to a particular step or participant in the forecasting process. FVA analysis is used to identify those process activities that are failing to make the forecast any better (or might even be making it worse). This course provides step-by-step guidelines for conducting FVA analysis – to identify and eliminate the waste, inefficiency, and worst practices from your forecasting process. The result can be better forecasts, with fewer resources and less management time spent on forecasting.
View outline
Sunday, Oct. 7; 1-5 p.m.

SAS Enterprise Content Categorization: An Introduction

This course gives an introduction to methods of unstructured data analysis, document classification and document content identification. The course also uses examples as the basis for constructing parse expressions and resulting entities.
View outline
Sunday, Oct. 7; 1-5 p.m.


You can see more on this yourself at –












Webscraping using iMacros

The noted Diamonds dataset in the ggplot2 package of R is actually culled from the website

However it has ~55000 diamonds, while the whole Diamonds search engine has almost ten times that number. Using iMacros – a Google Chrome Plugin, we can scrape that data (or almost any data). The iMacros chrome plugin is available at while notes on coding are at

Imacros makes coding as easy as recording macro and the code is automatcially generated for whatever actions you do. You can set parameters to extract only specific parts of the website, and code can be run into a loop (of 9999 times!)

Here is the iMacros code-Note you need to navigate to the web site before running it

TAG POS=1 TYPE=DIV ATTR=CLASS:paginate_enabled_next










and voila- all the diamonds you need to analyze!

The returning data can be read using the standard delimiter data munging in the language of SAS or R.

More on IMacros from


Automate your web browser. Record and replay repetitious work

If you encounter any problems with iMacros for Chrome, please let us know in our Chrome user forum at

Our forum is also the best place for new feature suggestions :-)

iMacros was designed to automate the most repetitious tasks on the web. If there’s an activity you have to do repeatedly, just record it in iMacros. The next time you need to do it, the entire macro will run at the click of a button! With iMacros, you can quickly and easily fill out web forms, remember passwords, create a webmail notifier, and more. You can keep the macros on your computer for your own use, use them within bookmark sync / Xmarks or share them with others by embedding them on your homepage, blog, company Intranet or any social bookmarking service as bookmarklet. The uses are limited only by your imagination!

Popular uses are as web macro recorder, form filler on steroids and highly-secure password manager (256-bit AES encryption).

FaceBook IPO- Who hacked whom?

Some thoughts on the FB IPO-

1) Is Zuck reading emails on his honeymoon? Where is he?

2) In 3 days FB lost 34 billion USD in market valuation. Thats enough to buy AOL,Yahoo, LinkedIn and Twitter (combined)

3) People are now shorting FB based on 3-4 days of trading performance. Maybe they know more ARIMA !

4) Who made money on the over-pricing in terms on employees who sold on 1 st day, financial bankers who did the same?

5) Who lost money on the first three days due to Nasdaq’s problems?

6) What is the exact technical problem that Nasdaq had?

7) The much deplored FaceBook Price/Earnings ratio (99) is still comparable to AOL’s (85) and much less than LI (620!). see

8) Maybe FB can stop copying Google’s ad model (which Google invented) and go back to the drawing table. Like a FB kind of Paypal

9) There are more experts on the blogosphere than experts in Wall Street.

10) No blogger is willing to admit that they erred in the optimism on the great white IPO hope.

I did. Mea culpa. I thought FB is a good stock. I would buy it still- but the rupee tanked by 10% since past 1 week against the dollar.


I am now waiting for Chinese social network market to open with IPO’s. Thats walled gardens within walled gardens of Jade and Bamboo.

Related- Art Work of Another 100 billion dollar company (2006)

Visualizing Bigger Data in R using Tabplot

The amazing tabplot package creates the tableplot feature for visualizing huge chunks of data. This is a great example of creative data visualization that is resource lite and extremely fast in a first look at the data. (note- The tabplot package is being used and table plot function is being used . The TABLEPLOT package is different and is NOT being used here).


visualizing a 50000 row by 10 variable dataset in 0.7 s is fast !!

click on screenshot to see it

and some say R is slow 😉


Note I used a free Windows Amazon EC2 Instance for it-

See screenshot for hardware configuration


the best thing is there is a handy GTK GUI for this package. You can check it out at



New RCommander with ggplot #rstats


My favorite GUI (or one of them) R Commander has a relatively new plugin called KMGGplot2. Until now Deducer was the only GUI with ggplot features , but the much lighter and more popular R Commander has been a long champion in people wanting to pick up R quickly.

RcmdrPlugin.KMggplot2: Rcmdr Plug-In for Kaplan-Meier Plot and Other Plots by Using the ggplot2 Package


As you can see by the screenshot- it makes ggplot even easier for people (like R  newbies and experienced folks alike)


This package is an R Commander plug-in for Kaplan-Meier plot and other plots by using the ggplot2 package.

Version: 0.1-0
Depends: R (≥ 2.15.0), stats, methods, grid, Rcmdr (≥ 1.8-4), ggplot2 (≥ 0.9.1)
Imports: tcltk2 (≥ 1.2-3), RColorBrewer (≥ 1.0-5), scales (≥ 0.2.1), survival (≥ 2.36-14)
Published: 2012-05-18
Author: Triad sou. and Kengo NAGASHIMA
Maintainer: Triad sou. <triadsou at>
License: GPL-2
CRAN checks: RcmdrPlugin.KMggplot2 results


NEWS file for the RcmdrPlugin.KMggplot2 package


Changes in version 0.1-0 (2012-05-18)

 o Restructuring implementation approach for efficient
 o Added options() for storing package specific options (e.g.,
   font size, font family, ...).
 o Added a theme: theme_simple().
 o Added a theme element: theme_rect2().
 o Added a list box for facet_xx() functions in some menus
   (Thanks to Professor Murtaza Haider).
 o Kaplan-Meier plot: added confidence intervals.
 o Box plot: added violin plots.
 o Bar chart for discrete variables: deleted dynamite plots.
 o Bar chart for discrete variables: added stacked bar charts.
 o Scatter plot matrix: added univariate plots at diagonal
   positions (ggplot2::plotmatrix).
 o Deleted the dummy data for histograms, which is large in


Changes in version 0.0-4 (2011-07-28)

 o Fixed "scale_y_continuous(formatter = "percent")" to
   "scale_y_continuous(labels = percent)" for ggplot2
   (>= 0.9.0).
 o Fixed "legend = FALSE" to "show_guide = FALSE" for
   ggplot2 (>= 0.9.0).
 o Fixed the DESCRIPTION file for ggplot2 (>= 0.9.0) dependency.


Changes in version 0.0-3 (2011-07-28; FIRST RELEASE VERSION)

 o Kaplan-Meier plot: Show no. at risk table on outside.
 o Histogram: Color coding.
 o Histogram: Density estimation.
 o Q-Q plot: Create plots based on a maximum likelihood estimate
   for the parameters of the selected theoretical distribution.
 o Q-Q plot: Create plots based on a user-specified theoretical
 o Box plot / Errorbar plot: Box plot.
 o Box plot / Errorbar plot: Mean plus/minus S.D.
 o Box plot / Errorbar plot: Mean plus/minus S.D. (Bar plot).
 o Box plot / Errorbar plot: 95 percent Confidence interval
   (t distribution).
 o Box plot / Errorbar plot: 95 percent Confidence interval
 o Scatter plot: Fitting a linear regression.
 o Scatter plot: Smoothing with LOESS for small datasets or GAM
   with a cubic regression basis for large data.
 o Scatter plot matrix: Fitting a linear regression.
 o Scatter plot matrix: Smoothing with LOESS for small datasets
   or GAM with a cubic regression basis for large data.
 o Line chart: Normal line chart.
 o Line chart: Line char with a step function.
 o Line chart: Area plot.
 o Pie chart: Pie chart.
 o Bar chart for discrete variables: Bar chart for discrete
 o Contour plot: Color coding.
 o Contour plot: Heat map.
 o Distribution plot: Normal distribution.
 o Distribution plot: t distribution.
 o Distribution plot: Chi-square distribution.
 o Distribution plot: F distribution.
 o Distribution plot: Exponential distribution.
 o Distribution plot: Uniform distribution.
 o Distribution plot: Beta distribution.
 o Distribution plot: Cauchy distribution.
 o Distribution plot: Logistic distribution.
 o Distribution plot: Log-normal distribution.
 o Distribution plot: Gamma distribution.
 o Distribution plot: Weibull distribution.
 o Distribution plot: Binomial distribution.
 o Distribution plot: Poisson distribution.
 o Distribution plot: Geometric distribution.
 o Distribution plot: Hypergeometric distribution.
 o Distribution plot: Negative binomial distribution.