Data Quality in R #rstats

Many Data Quality Formats give problems when importing in your statistical software.A statistical software is quite unable to distingush between $1,000, 1000% and 1,000 and 1000 and will treat the former three as character variables while the third as a numeric variable by default. This issue is further compounded by the numerous ways we can represent date-time variables.

The good thing is for specific domains like finance and web analytics, even these weird data input formats are fixed, so we can fix up a list of handy data quality conversion functions in R for reference.

 

After much muddling about with coverting internet formats (or data used in web analytics) (mostly time formats without date like 00:35:23)  into data frame numeric formats, I found that the way to handle Date-Time conversions in R is

Dataset$Var2= strptime(as.character(Dataset$Var1),”%M:%S”)

The problem with this approach is you will get the value as a Date Time format (02/31/2012 04:00:45-  By default R will add today’s date to it.)  while you are interested in only Time Durations (4:00:45 or actually just the equivalent in seconds).

this can be handled using the as.difftime function

dataset$Var2=as.difftime(paste(dataset$Var1))

or to get purely numeric values so we can do numeric analysis (like summary)

dataset$Var2=as.numeric(as.difftime(paste(dataset$Var1)))

(#Maybe there is  a more elegant way here- but I dont know)

The kind of data is usually one we get in web analytics for average time on site , etc.

 

 

 

 

 

and

for factor variables

Dataset$Var2= as.numeric(as.character(Dataset$Var1))

 

or

Dataset$Var2= as.numeric(paste(Dataset$Var1))

 

Slight problem is suppose there is data like 1,504 – it will be converted to NA instead of 1504

The way to solve this is use the nice gsub function ONLy on that variable. Since the comma is also the most commonly used delimiter , you dont want to replace all the commas, just only the one in that variable.

 

dataset$Variable2=as.numeric(paste(gsub(“,”,””,dataset$Variable)))

 

Now lets assume we have data in the form of % like 0.00% , 1.23%, 3.5%

again we use the gsub function to replace the % value in the string with  (nothing).

 

dataset$Variable2=as.numeric(paste(gsub(“%”,””,dataset$Variable)))

 

 

If you simply do the following for a factor variable, it will show you the level not the value. This can create an error when you are reading in CSV data which may be read as character or factor data type.

Dataset$Var2= as.numeric(Dataset$Var1)

An additional way is to use substr (using substr( and concatenate (using paste) for manipulating string /character variables.

 

iris$sp=substr(iris$Species,1,3) –will reduce the famous Iris species into three digits , without losing any analytical value.

The other issue is with missing values, and na.rm=T helps with getting summaries of numeric variables with missing values, we need to further investigate how suitable, na.omit functions are for domains which have large amounts of missing data and need to be treated.

 

 

Book Review- Machine Learning for Hackers

This is review of the fashionably named book Machine Learning for Hackers by Drew Conway and John Myles White (O’Reilly ). The book is about hacking code in R.

 

The preface introduces the reader to the authors conception of what machine learning and hacking is all about. If the name of the book was machine learning for business analytsts or data miners, I am sure the content would have been unchanged though the popularity (and ambiguity) of the word hacker can often substitute for its usefulness. Indeed the many wise and learned Professors of statistics departments through out the civilized world would be mildly surprised and bemused by their day to day activities as hacking or teaching hackers. The book follows a case study and example based approach and uses the GGPLOT2 package within R programming almost to the point of ignoring any other native graphics system based in R. It can be quite useful for the aspiring reader who wishes to understand and join the booming market for skilled talent in statistical computing.

Chapter 1 has a very useful set of functions for data cleansing and formatting. It walks you through the basics of formatting based on dates and conditions, missing value and outlier treatment and using ggplot package in R for graphical analysis. The case study used is an Infochimps dataset with 60,000 recordings of UFO sightings. The case study is lucid, and done at a extremely helpful pace illustrating the powerful and flexible nature of R functions that can be used for data cleansing.The chapter mentions text editors and IDEs but fails to list them in a tabular format, while listing several other tables like Packages used in the book. It also jumps straight from installation instructions to functions in R without getting into the various kinds of data types within R or specifying where these can be referenced from. It thus assumes a higher level of basic programming understanding for the reader than the average R book.

Chapter 2 discusses data exploration, and has a very clear set of diagrams that explain the various data summary operations that are performed routinely. This is an innovative approach and will help students or newcomers to the field of data analysis. It introduces the reader to type determination functions, as well different kinds of encoding. The introduction to creating functions is quite elegant and simple , and numerical summary methods are explained adequately. While the chapter explains data exploration with the help of various histogram options in ggplot2 , it fails to create a more generic framework for data exploration or rules to assist the reader in visual data exploration in non standard data situations. While the examples are very helpful for a reader , there needs to be slightly more depth to step out of the example and into a framework for visual data exploration (or references for the same). A couple of case studies however elaborately explained cannot do justice to the vast field of data exploration and especially visual data exploration.

Chapter 3 discussed binary classification for the specific purpose for spam filtering using a dataset from SpamAssassin. It introduces the reader to the naïve Bayes classifier and the principles of text mining suing the tm package in R. Some of the example codes could have been better commented for easier readability in the book. Overall it is quite a easy tutorial for creating a naïve Bayes classifier even for beginners.

Chapter 4 discusses the issues in importance ranking and creating recommendation systems specifically in the case of ordering email messages into important and not important. It introduces the useful grepl, gsub, strsplit, strptime ,difftime and strtrim functions for parsing data. The chapter further introduces the reader to the concept of log (and affine) transformations in a lucid and clear way that can help even beginners learn this powerful transformation concept. Again the coding within this chapter is sparsely commented which can cause difficulties to people not used to learn reams of code. ( it may have been part of the code attached with the book, but I am reading an electronic book and I did not find an easy way to go back and forth between the code and the book). The readability of the chapters would be further enhanced by the use of flow charts explaining the path and process followed than overtly verbose textual descriptions running into multiple pages. The chapters are quite clearly written, but a helpful visual summary can help in both revising the concepts and elucidate the approach taken further.A suggestion for the authors could be to compile the list of useful functions they introduce in this book as a sort of reference card (or Ref Card) for R Hackers or atleast have a chapter wise summary of functions, datasets and packages used.

Chapter 5 discusses linear regression , and it is a surprising and not very good explanation of regression theory in the introduction to regression. However the chapter makes up in practical example what it oversimplifies in theory. The chapter on regression is not the finest chapter written in this otherwise excellent book. Part of this is because of relative lack of organization- correlation is explained after linear regression is explained. Once again the lack of a function summary and a process flow diagram hinders readability and a separate section on regression metrics that help make a regression result good or not so good could be a welcome addition. Functions introduced include lm.

Chapter 6 showcases Generalized Additive Model (GAM) and Polynomial Regression, including an introduction to singularity and of over-fitting. Functions included in this chapter are transform, and poly while the package glmnet is also used here. The chapter also introduces the reader formally to the concept of cross validation (though examples of cross validation had been introduced in earlier chapters) and regularization. Logistic regression is also introduced at the end in this chapter.

Chapter 7 is about optimization. It describes error metric in a very easy to understand way. It creates a grid by using nested loops for various values of intercept and slope of a regression equation and computing the sum of square of errors. It then describes the optim function in detail including how it works and it’s various parameters. It introduces the curve function. The chapter then describes ridge regression including definition and hyperparameter lamda. The use of optim function to optimize the error in regression is useful learning for the aspiring hacker. Lastly it describes a case study of breaking codes using the simplistic Caesar cipher, a lexical database and the Metropolis method. Functions introduced in this chapter include .Machine$double.eps .

Chapter 8 deals with Principal Component Analysis and unsupervised learning. It uses the ymd function from lubridate package to convert string to date objects, and the cast function from reshape package to further manipulate the structure of data. Using the princomp functions enables PCA in R.The case study creates a stock market index and compares the results with the Dow Jones index.

Chapter 9 deals with Multidimensional Scaling as well as clustering US senators on the basis of similarity in voting records on legislation .It showcases matrix multiplication using %*% and also the dist function to compute distance matrix.

Chapter 10 has the subject of K Nearest Neighbors for recommendation systems. Packages used include class ,reshape and and functions used include cor, function and log. It also demonstrates creating a custom kNN function for calculating Euclidean distance between center of centroids and data. The case study used is the R package recommendation contest on Kaggle. Overall a simplistic introduction to creating a recommendation system using K nearest neighbors, without getting into any of the prepackaged packages within R that deal with association analysis , clustering or recommendation systems.

Chapter 11 introduces the reader to social network analysis (and elements of graph theory) using the example of Erdos Number as an interesting example of social networks of mathematicians. The example of Social Graph API by Google for hacking are quite new and intriguing (though a bit obsolete by changes, and should be rectified in either the errata or next edition) . However there exists packages within R that should be atleast referenced or used within this chapter (like TwitteR package that use the Twitter API and ROauth package for other social networks). Packages used within this chapter include Rcurl, RJSONIO, and igraph packages of R and functions used include rbind and ifelse. It also introduces the reader to the advanced software Gephi. The last example is to build a recommendation engine for whom to follow in Twitter using R.

Chapter 12 is about model comparison and introduces the concept of Support Vector Machines. It uses the package e1071 and shows the svm function. It also introduces the concept of tuning hyper parameters within default algorithms . A small problem in understanding the concepts is the misalignment of diagram pages with the relevant code. It lastly concludes with using mean square error as a method for comparing models built with different algorithms.

 

Overall the book is a welcome addition in the library of books based on R programming language, and the refreshing nature of the flow of material and the practicality of it’s case studies make this a recommended addition to both academic and corporate business analysts trying to derive insights by hacking lots of heterogeneous data.

Have a look for yourself at-
http://shop.oreilly.com/product/0636920018483.do

Reading Google Docs using R Curl

 

and finally I can download my Google spreadsheet file using-

require(RCurl)

download.file(url=”http://curl.haxx.se/ca/cacert.pem”, destfile=”cacert.pem”)

url=”https://docs.google.com/spreadsheet/pub?key=0AtYMMvghK2ytcldUcWNNZTltcXdIZUZ2MWU0R1NfeWc&output=csv”

b <- getURL(url,cainfo=”cacert.pem”)

write.table(b,quote = FALSE, sep = “,”,file=”test.csv”)

 

Previously (for past 3 5 7 hours)-

The codes at http://blog.revolutionanalytics.com/2011/09/using-google-spreadsheets-with-r-an-update.html dont work thanks to the SSL authentication issue. and the packages at [http://www.omegahat.org/RGoogleDocs/] and [https://r-forge.r-project.org/projects/rgoogledata/] are missing in action. 

So I mixed the codes at http://blog.revolutionanalytics.com/2009/09/how-to-use-a-google-spreadsheet-as-data-in-r.html and http://www.brocktibert.com/blog/2012/01/19/358/

and I get this error while using http://thebiobucket.blogspot.in/2012/03/r-function-to-read-data-from-google.html#more

Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names

 

How to add or change the %PATH variable in Windows 7

See this for a simple 5 step way to add or change the %PATH variable in Windows 7 if you need to install an application that shows error while installing (because that APP was built for Linux based systems… )

Radoop 0.3 launched- Open Source Graphical Analytics meets Big Data

What is Radoop? Quite possibly an exciting mix of analytics and big data computing

 

http://blog.radoop.eu/?p=12

What is Radoop?

Hadoop is an excellent tool for analyzing large data sets, but it lacks an easy-to-use graphical interface. RapidMiner is an excellent tool for data analytics, but its data size is limited by the memory available, and a single machine is often not enough to run the analyses on time. In this project, we combine the strengths of both projects and provide a RapidMiner extension for editing and running ETL, data analytics and machine learning processes over Hadoop.

We have closely integrated the highly optimized data analytics capabilities of Hive and Mahout, and the user-friendly interface of RapidMiner to form a powerful and easy-to-use data analytics solution for Hadoop.

 

and what’s new

http://blog.radoop.eu/?p=198

Radoop 0.3 released – fully graphical big data analytics

Today, Radoop had a major step forward with its 0.3 release. The new version of the visual big data analytics package adds full support for all major Hadoop distributions used these days: Apache Hadoop 0.20.2, 0.20.203, 1.0 and Cloudera’s Distribution including Apache Hadoop 3 (CDH3). It also adds support for large clusters by allowing the namenode, the jobtracker and the Hive server to reside on different nodes.

As Radoop’s promise is to make big data analytics easier, the 0.3 release is also focused on improving the user interface. It has an enhanced breakpointing system which allows to investigate intermediate results, and it adds dozens of quick fixes, so common process design mistakes get much easier to solve.

There are many further improvements and fixes, so please consult the release notes for more details. Radoop is in private beta mode, but heading towards a public release in Q2 2012. If you would like to get early access, then please apply at the signup page or describe your use case in email (beta at radoop.eu).

Radoop 0.3 (15 February 2012)

  • Support for Apache Hadoop 0.20.2, 0.20.203, 1.0 and Cloudera’s Distribution Including Apache Hadoop 3 (CDH3) in a single release
  • Support for clusters with separate master nodes (namenode, jobtracker, Hive server)
  • Enhanced breakpointing to evaluate intermediate results
  • Dozens of quick fixes for the most common process design errors
  • Improved process design and error reporting
  • New welcome perspective to help in the first steps
  • Many bugfixes and performance improvements

Radoop 0.2.2 (6 December 2011)

  • More Aggregate functions and distinct option
  • Generate ID operator for convenience
  • Numerous bug fixes and improvements
  • Improved user interface

Radoop 0.2.1 (16 September 2011)

  • Set Role and Data Multiplier operators
  • Management panel for testing Hadoop connections
  • Stability improvements for Hive access
  • Further small bugfixes and improvements

Radoop 0.2 (26 July 2011)

  • Three new algoritms: Fuzzy K-Means, Canopy, and Dirichlet clustering
  • Three new data preprocessing operators: Normalize, Replace, and Replace Missing Values
  • Significant speed improvements in data transmission and interactive analytics
  • Increased stability and speedup for K-Means
  • More flexible settings for Join operations
  • More meaningful error messages
  • Other small bugfixes and improvements

Radoop 0.1 (14 June 2011)

Initial release with 26 operators for data transmission, data preprocessing, and one clustering algorithm.

Note that Rapid Miner also has a great R extension so you can use R, a graphical interface and big data analytics is now easier and more powerful than ever.


Business Analytics Projects

As per me, Analytics Projects get into these four  broad phases-

  • Business Problem  PhaseWhat needs to be done?
  1. Increase Revenues
  2. Cut Costs
  3. Investigate Unusual Events
  4. Project Timelines
  • Technical Problem PhaseTechnical Problems in Project Execution 
  1. Data Availability /Data Quality/Data Augmentation Costs
  2. Statistical -(Technique based approach) , Hypothesis Formulation,Sampling, Iterations
  3. Programming-(Tool based approach) Analytics Platform Coding (Input, Formats,Processing)
  • Technical Solution PhaseProblem Solving using the Tools and Skills Available 
  1. Data Cleaning /Outlier Treatment/Missing Value Imputation
  2. Statistical -(Technique based approach) Error Minimization, Model Validation, Confidence Levels
  3. Programming-(Tool based approach) Analytics Platform Coding (Output, Display,Graphs)
  • Business Solution PhasePut it all together in a word document, presentation and/or spreadsheet
  1. Finalized- Forecasts  , Models and Data Strategies
  2. Improvements  in existing processes
  3.  Control and Monitoring of Analytical Results post Implementation
  4. Legal and Compliance  guidelines to execution
  5. (Internal or External) Client Satisfaction and Expectation Management
  6. Audience Feedback based on presenting final deliverable to broader audience

Funny HTML hack in Bit.ly on Twitter

Just saw a funny bit.ly hack/spam

scroll down to see the title tag that shows the link to my blog when I mouse-hover on the bit.ly  url.

Now you get this error message if you go here- because I changed my url structures. Note the bit.ly url is uATQ13 (:-)

http://www.decisionstats.com/2008/07/02/phpb

But if you go to bit.ly and type in

bit.ly/uATQ13

You first get a redirect to usyy.net/redirect.php?cookies=true

and then a redirect to

http://www.justz.info/mobilemoneymachines

What surprised me is the hacking of the bit.ly link

which changed the title in html (html newbies refer to http://www.w3schools.com/tags/att_standard_title.asp)

text from usyy.net/redirect.php?cookies=true to http://www.decisionstats.com/2008/07/02/phpb (see bottom of the picture) when my mousetip hovered over it. How did this happen? Is it due to my chrome extensions ..hmm..or is it my alternate identity (cyber Jekyl and Cyber Hyde)….hmmm

Now I know it has just been two days since I wrote on chrome redirect extension called mediaafire (which could be one possible reason for this since I installed it too on my chrome browser besides having the adblock extension) http://www.decisionstats.com/chrome-extension-mafiaafire/

 

But nice hack-huh- two days is fast!!! Someone help bit.ly/chrome/me figure this out. 😉

 

%d bloggers like this: