R for Business Analytics- Book by Ajay Ohri

So the cover art is ready, and if you are a reviewer, you can reserve online copies of the book I have been writing for past 2 years. Special thanks to my mentors, detractors, readers and students- I owe you a beer!

You can also go here-

http://www.springer.com/statistics/book/978-1-4614-4342-1

 

R for Business Analytics

R for Business Analytics

Ohri, Ajay

2012, 2012, XVI, 300 p. 208 illus., 162 in color.

Hardcover
Information

ISBN 978-1-4614-4342-1

Due: September 30, 2012

(net)

approx. 44,95 €
  • Covers full spectrum of R packages related to business analytics
  • Step-by-step instruction on the use of R packages, in addition to exercises, references, interviews and useful links
  • Background information and exercises are all applied to practical business analysis topics, such as code examples on web and social media analytics, data mining, clustering and regression models

R for Business Analytics looks at some of the most common tasks performed by business analysts and helps the user navigate the wealth of information in R and its 4000 packages.  With this information the reader can select the packages that can help process the analytical tasks with minimum effort and maximum usefulness. The use of Graphical User Interfaces (GUI) is emphasized in this book to further cut down and bend the famous learning curve in learning R. This book is aimed to help you kick-start with analytics including chapters on data visualization, code examples on web analytics and social media analytics, clustering, regression models, text mining, data mining models and forecasting. The book tries to expose the reader to a breadth of business analytics topics without burying the user in needless depth. The included references and links allow the reader to pursue business analytics topics.

 

This book is aimed at business analysts with basic programming skills for using R for Business Analytics. Note the scope of the book is neither statistical theory nor graduate level research for statistics, but rather it is for business analytics practitioners. Business analytics (BA) refers to the field of exploration and investigation of data generated by businesses. Business Intelligence (BI) is the seamless dissemination of information through the organization, which primarily involves business metrics both past and current for the use of decision support in businesses. Data Mining (DM) is the process of discovering new patterns from large data using algorithms and statistical methods. To differentiate between the three, BI is mostly current reports, BA is models to predict and strategize and DM matches patterns in big data. The R statistical software is the fastest growing analytics platform in the world, and is established in both academia and corporations for robustness, reliability and accuracy.

Content Level » Professional/practitioner

Keywords » Business Analytics – Data Mining – Data Visualization – Forecasting – GUI – Graphical User Interface – R software – Text Mining

Related subjects » Business, Economics & Finance – Computational Statistics – Statistics

TABLE OF CONTENTS

Why R.- R Infrastructure.- R Interfaces.- Manipulating Data.- Exploring Data.- Building Regression Models.- Data Mining using R.- Clustering and Data Segmentation.- Forecasting and Time-Series Models.- Data Export and Output.- Optimizing your R Coding.- Additional Training Literature.- Appendix

Book Review- Machine Learning for Hackers

This is review of the fashionably named book Machine Learning for Hackers by Drew Conway and John Myles White (O’Reilly ). The book is about hacking code in R.

 

The preface introduces the reader to the authors conception of what machine learning and hacking is all about. If the name of the book was machine learning for business analytsts or data miners, I am sure the content would have been unchanged though the popularity (and ambiguity) of the word hacker can often substitute for its usefulness. Indeed the many wise and learned Professors of statistics departments through out the civilized world would be mildly surprised and bemused by their day to day activities as hacking or teaching hackers. The book follows a case study and example based approach and uses the GGPLOT2 package within R programming almost to the point of ignoring any other native graphics system based in R. It can be quite useful for the aspiring reader who wishes to understand and join the booming market for skilled talent in statistical computing.

Chapter 1 has a very useful set of functions for data cleansing and formatting. It walks you through the basics of formatting based on dates and conditions, missing value and outlier treatment and using ggplot package in R for graphical analysis. The case study used is an Infochimps dataset with 60,000 recordings of UFO sightings. The case study is lucid, and done at a extremely helpful pace illustrating the powerful and flexible nature of R functions that can be used for data cleansing.The chapter mentions text editors and IDEs but fails to list them in a tabular format, while listing several other tables like Packages used in the book. It also jumps straight from installation instructions to functions in R without getting into the various kinds of data types within R or specifying where these can be referenced from. It thus assumes a higher level of basic programming understanding for the reader than the average R book.

Chapter 2 discusses data exploration, and has a very clear set of diagrams that explain the various data summary operations that are performed routinely. This is an innovative approach and will help students or newcomers to the field of data analysis. It introduces the reader to type determination functions, as well different kinds of encoding. The introduction to creating functions is quite elegant and simple , and numerical summary methods are explained adequately. While the chapter explains data exploration with the help of various histogram options in ggplot2 , it fails to create a more generic framework for data exploration or rules to assist the reader in visual data exploration in non standard data situations. While the examples are very helpful for a reader , there needs to be slightly more depth to step out of the example and into a framework for visual data exploration (or references for the same). A couple of case studies however elaborately explained cannot do justice to the vast field of data exploration and especially visual data exploration.

Chapter 3 discussed binary classification for the specific purpose for spam filtering using a dataset from SpamAssassin. It introduces the reader to the naïve Bayes classifier and the principles of text mining suing the tm package in R. Some of the example codes could have been better commented for easier readability in the book. Overall it is quite a easy tutorial for creating a naïve Bayes classifier even for beginners.

Chapter 4 discusses the issues in importance ranking and creating recommendation systems specifically in the case of ordering email messages into important and not important. It introduces the useful grepl, gsub, strsplit, strptime ,difftime and strtrim functions for parsing data. The chapter further introduces the reader to the concept of log (and affine) transformations in a lucid and clear way that can help even beginners learn this powerful transformation concept. Again the coding within this chapter is sparsely commented which can cause difficulties to people not used to learn reams of code. ( it may have been part of the code attached with the book, but I am reading an electronic book and I did not find an easy way to go back and forth between the code and the book). The readability of the chapters would be further enhanced by the use of flow charts explaining the path and process followed than overtly verbose textual descriptions running into multiple pages. The chapters are quite clearly written, but a helpful visual summary can help in both revising the concepts and elucidate the approach taken further.A suggestion for the authors could be to compile the list of useful functions they introduce in this book as a sort of reference card (or Ref Card) for R Hackers or atleast have a chapter wise summary of functions, datasets and packages used.

Chapter 5 discusses linear regression , and it is a surprising and not very good explanation of regression theory in the introduction to regression. However the chapter makes up in practical example what it oversimplifies in theory. The chapter on regression is not the finest chapter written in this otherwise excellent book. Part of this is because of relative lack of organization- correlation is explained after linear regression is explained. Once again the lack of a function summary and a process flow diagram hinders readability and a separate section on regression metrics that help make a regression result good or not so good could be a welcome addition. Functions introduced include lm.

Chapter 6 showcases Generalized Additive Model (GAM) and Polynomial Regression, including an introduction to singularity and of over-fitting. Functions included in this chapter are transform, and poly while the package glmnet is also used here. The chapter also introduces the reader formally to the concept of cross validation (though examples of cross validation had been introduced in earlier chapters) and regularization. Logistic regression is also introduced at the end in this chapter.

Chapter 7 is about optimization. It describes error metric in a very easy to understand way. It creates a grid by using nested loops for various values of intercept and slope of a regression equation and computing the sum of square of errors. It then describes the optim function in detail including how it works and it’s various parameters. It introduces the curve function. The chapter then describes ridge regression including definition and hyperparameter lamda. The use of optim function to optimize the error in regression is useful learning for the aspiring hacker. Lastly it describes a case study of breaking codes using the simplistic Caesar cipher, a lexical database and the Metropolis method. Functions introduced in this chapter include .Machine$double.eps .

Chapter 8 deals with Principal Component Analysis and unsupervised learning. It uses the ymd function from lubridate package to convert string to date objects, and the cast function from reshape package to further manipulate the structure of data. Using the princomp functions enables PCA in R.The case study creates a stock market index and compares the results with the Dow Jones index.

Chapter 9 deals with Multidimensional Scaling as well as clustering US senators on the basis of similarity in voting records on legislation .It showcases matrix multiplication using %*% and also the dist function to compute distance matrix.

Chapter 10 has the subject of K Nearest Neighbors for recommendation systems. Packages used include class ,reshape and and functions used include cor, function and log. It also demonstrates creating a custom kNN function for calculating Euclidean distance between center of centroids and data. The case study used is the R package recommendation contest on Kaggle. Overall a simplistic introduction to creating a recommendation system using K nearest neighbors, without getting into any of the prepackaged packages within R that deal with association analysis , clustering or recommendation systems.

Chapter 11 introduces the reader to social network analysis (and elements of graph theory) using the example of Erdos Number as an interesting example of social networks of mathematicians. The example of Social Graph API by Google for hacking are quite new and intriguing (though a bit obsolete by changes, and should be rectified in either the errata or next edition) . However there exists packages within R that should be atleast referenced or used within this chapter (like TwitteR package that use the Twitter API and ROauth package for other social networks). Packages used within this chapter include Rcurl, RJSONIO, and igraph packages of R and functions used include rbind and ifelse. It also introduces the reader to the advanced software Gephi. The last example is to build a recommendation engine for whom to follow in Twitter using R.

Chapter 12 is about model comparison and introduces the concept of Support Vector Machines. It uses the package e1071 and shows the svm function. It also introduces the concept of tuning hyper parameters within default algorithms . A small problem in understanding the concepts is the misalignment of diagram pages with the relevant code. It lastly concludes with using mean square error as a method for comparing models built with different algorithms.

 

Overall the book is a welcome addition in the library of books based on R programming language, and the refreshing nature of the flow of material and the practicality of it’s case studies make this a recommended addition to both academic and corporate business analysts trying to derive insights by hacking lots of heterogeneous data.

Have a look for yourself at-
http://shop.oreilly.com/product/0636920018483.do

Using Google Analytics API with R:dimensions and metrics

I modified the query I wrote earlier  at http://www.decisionstats.com/using-google-analytics-with-r/to get multiple dimensions and metrics from the Google Analytics API, like hour of day,day of week to get cyclical parameters.We are adding the dimensions, and metrics to bring more depth in our analysis.Basically we are trying to do a time series analysis for forecasting web analytics data( which is basically time -stamped and rich in details ).

Basically I am modifying the dimensions and metrics parameters of the query code using the list at

http://code.google.com/apis/analytics/docs/gdata/dimsmets/dimsmets.html

 

query <- QueryBuilder()
query$Init(start.date = "2011-08-20",
                   end.date = "2012-08-25",
                   dimensions = c("ga:date","ga:hour","ga:dayOfWeek"),
                   metrics = c("ga:visitors","ga:visits","ga:pageviews","ga:timeOnSite"),
                   sort = c("ga:date","ga:hour","ga:dayOfWeek"),
                   table.id = paste(profiles$profile[3,3]))

#5. Make a request to get the data from the API 

ga.data <- ga$GetReportData(query)

#6. Look at the returned data 

str(ga.data)

head(ga.data$data) 

and we need the lubridate package to create a ymd:hour (time stamp)    since GA gives data aggregated at a hourly level at most. Also we need to smoothen the effect of weekend on web analytics data.

#Using package lubridate to convert character dates into time

library(lubridate)
ga.data$data[,1]=ymd(ga.data$data[,1])
ls()
dataset1=ga.data$data
names(dataset1) <- make.names(names(dataset1))
str(dataset1)
head(dataset1)

To be continued-

Using Google Analytics with R

Some code to read in data from Google Analytics data. Some modifications include adding the SSL authentication code and modifying (in bold) the table.id parameter to choose correct website from a GA profile with many websites

The Google Analytics Package files can be downloaded from http://code.google.com/p/r-google-analytics/downloads/list

It provides access to Google Analytics data natively from the R Statistical Computing programming language. You can use this library to retrieve an R data.frame with Google Analytics data. Then perform advanced statistical analysis, like time series analysis and regressions.

Supported Features

  • Access to v2 of the Google Analytics Data Export API Data Feed
  • A QueryBuilder class to simplify creating API queries
  • API response is converted directly into R as a data.frame
  • Library returns the aggregates, and confidence intervals of the metrics, dynamically if they exist
  • Auto-pagination to return more than 10,000 rows of information by combining multiple data requests. (Upper Limit 1M rows)
  • Authorization through the ClientLogin routine
  • Access to all the profiles ids for the authorized user
  • Full documentation and unit tests
Code-

> library(XML)

>

> library(RCurl)

Loading required package: bitops

>

> #Change path name in the following to the folder you downloaded the Google Analytics Package

>

> source(“C:/Users/KUs/Desktop/CANADA/R/RGoogleAnalytics/R/RGoogleAnalytics.R”)

>

> source(“C:/Users/KUs/Desktop/CANADA/R/RGoogleAnalytics/R/QueryBuilder.R”)

> # download the file needed for authentication

> download.file(url=”http://curl.haxx.se/ca/cacert.pem&#8221;, destfile=”cacert.pem”)

trying URL ‘http://curl.haxx.se/ca/cacert.pem&#8217; Content type ‘text/plain’ length 215993 bytes (210 Kb) opened

URL downloaded 210 Kb

>

> # set the curl options

> curl <- getCurlHandle()

> options(RCurlOptions = list(capath = system.file(“CurlSSL”, “cacert.pem”,

+ package = “RCurl”),

+ ssl.verifypeer = FALSE))

> curlSetOpt(.opts = list(proxy = ‘proxyserver:port’), curl = curl)

An object of class “CURLHandle” Slot “ref”: <pointer: 0000000006AA2B70>

>

> # 1. Create a new Google Analytics API object

>

> ga <- RGoogleAnalytics()

>

> # 2. Authorize the object with your Google Analytics Account Credentials

>

> ga$SetCredentials(“USERNAME”, “PASSWORD”)

>

> # 3. Get the list of different profiles, to help build the query

>

> profiles <- ga$GetProfileData()

>

> profiles #Error Check to See if we get the right website

$profile AccountName ProfileName TableId

1 dudeofdata.com dudeofdata.com ga:44926237

2 knol.google.com knol.google.com ga:45564890

3 decisionstats.com decisionstats.com ga:46751946

$total.results

total.results

1 3

>

> # 4. Build the Data Export API query

>

> #Modify the start.date and end.date parameters based on data requirements

>

> #Modify the table.id at table.id = paste(profiles$profile[X,3]) to get the X th website in your profile

> # 4. Build the Data Export API query

> query <- QueryBuilder() > query$Init(start.date = “2012-01-09”, + end.date = “2012-03-20”, + dimensions = “ga:date”,

+ metrics = “ga:visitors”,

+ sort = “ga:date”,

+ table.id = paste(profiles$profile[3,3]))

>

>

> #5. Make a request to get the data from the API

>

> ga.data <- ga$GetReportData(query)

[1] “Executing query: https://www.google.com/analytics/feeds/data?start-date=2012%2D01%2D09&end-date=2012%2D03%2D20&dimensions=ga%3Adate&metrics=ga%3Avisitors&sort=ga%3Adate&ids=ga%3A46751946&#8221;

>

> #6. Look at the returned data

>

> str(ga.data)

List of 3

$ data :’data.frame’: 72 obs. of 2 variables: ..

$ ga:date : chr [1:72] “20120109” “20120110” “20120111” “20120112” … ..

$ ga:visitors: num [1:72] 394 405 381 390 323 47 169 67 94 89 …

$ aggr.totals :’data.frame’: 1 obs. of 1 variable: ..

$ aggregate.totals: num 28348

$ total.results: num 72

>

> head(ga.data$data)

ga:date ga:visitors

1 20120109 394

2 20120110 405

3 20120111 381

4 20120112 390

5 20120113 323

6 20120114 47 >

> #Plotting the Traffic >

> plot(ga.data$data[,2],type=”l”)

Update- Some errors come from pasting Latex directly to WordPress. Here is some code , made pretty-r in case you want to play with the GA api

library(XML)

library(RCurl)

#Change path name in the following to the folder you downloaded the Google Analytics Package 

source("C:/Users/KUs/Desktop/CANADA/R/RGoogleAnalytics/R/RGoogleAnalytics.R")

source("C:/Users/KUs/Desktop/CANADA/R/RGoogleAnalytics/R/QueryBuilder.R")
# download the file needed for authentication
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")

# set the curl options
curl <- getCurlHandle()
options(RCurlOptions = list(capath = system.file("CurlSSL", "cacert.pem",
package = "RCurl"),
ssl.verifypeer = FALSE))
curlSetOpt(.opts = list(proxy = 'proxyserver:port'), curl = curl)

# 1. Create a new Google Analytics API object 

ga <- RGoogleAnalytics()

# 2. Authorize the object with your Google Analytics Account Credentials 

ga$SetCredentials("ohri2007@gmail.com", "XXXXXXX")

# 3. Get the list of different profiles, to help build the query

profiles <- ga$GetProfileData()

profiles #Error Check to See if we get the right website

# 4. Build the Data Export API query 

#Modify the start.date and end.date parameters based on data requirements 

#Modify the table.id at table.id = paste(profiles$profile[X,3]) to get the X th website in your profile 
# 4. Build the Data Export API query
query <- QueryBuilder()
query$Init(start.date = "2012-01-09",
                   end.date = "2012-03-20",
                   dimensions = "ga:date",
                   metrics = "ga:visitors",
                   sort = "ga:date",
                   table.id = paste(profiles$profile[3,3]))

#5. Make a request to get the data from the API 

ga.data <- ga$GetReportData(query)

#6. Look at the returned data 

str(ga.data)

head(ga.data$data)

#Plotting the Traffic 

plot(ga.data$data[,2],type="l")

Created by Pretty R at inside-R.org

Amazon gives away 750 hours /month of Windows based computing

and an additional 750 hours /month of Linux based computing. The windows instance is really quite easy for users to start getting the hang of cloud computing. and it is quite useful for people to tinker around, given Google’s retail cloud offerings are taking so long to hit the market

But it is only for new users.

http://aws.typepad.com/aws/2012/01/aws-free-usage-tier-now-includes-microsoft-windows-on-ec2.html

WS Free Usage Tier now Includes Microsoft Windows on EC2

The AWS Free Usage Tier now allows you to run Microsoft Windows Server 2008 R2 on an EC2 t1.micro instance for up to 750 hours per month. This benefit is open to new AWS customers and to those who are already participating in the Free Usage Tier, and is available in all AWS Regions with the exception of GovCloud. This is an easy way for Windows users to start learning about and enjoying the benefits of cloud computing with AWS.

The micro instances provide a small amount of consistent processing power and the ability to burst to a higher level of usage from time to time. You can use this instance to learn about Amazon EC2, support a development and test environment, build an AWS application, or host a web site (or all of the above). We’ve fine-tuned the micro instances to make them even better at running Microsoft Windows Server.

You can launch your instance from the AWS Management Console:

We have lots of helpful resources to get you started:

Along with 750 instance hours of Windows Server 2008 R2 per month, the Free Usage Tier also provides another 750 instance hours to run Linux (also on a t1.micro), Elastic Load Balancer time and bandwidth, Elastic Block Storage, Amazon S3 Storage, and SimpleDB storage, a bunch of Simple Queue Service and Simple Notification Service requests, and some CloudWatch metrics and alarms (see the AWS Free Usage Tier page for details). We’ve also boosted the amount of EBS storage space offered in the Free Usage Tier to 30GB, and we’ve doubled the I/O requests in the Free Usage Tier, to 2 million.

 

Information Ladder for Analytics

One very commonly used diagram in marketing and sales by analytics providers, which is hardly ever credited to its author is the Information Ladder

http://en.wikipedia.org/wiki/Information_ladder

The information ladder is a diagram created by education professor Norman Longworth to describe the stages in human learning. According to the ladder, a learner moves through the following progression to construct “wisdom” at the highest level from “data” at the lowest level:

Data →
   Information 
                Knowledge →
                                    Understanding → 
                                                                  Insight →
                                                                                 Wisdom

Whereas the first two steps can be scientifically exactly defined, the upper parts belong to the domain of psychology and philosophy.

I sometimes think the information ladder and especially the latter two parts are underutilized, under-quantified as metrics and rarely understood completely by the wise men in analytics and information display.

Some visual versions are below

 

Funny enough, it is one of the rare concepts first inspired by poetry-

http://en.wikipedia.org/wiki/DIKW

The earliest formalized distinction between wisdom, knowledge, and information may have been made by poet and playwright T.S. Eliot 

Where is the Life we have lost in living?
Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?

 

Google Webinar on Web Analytics

Google webinar on web analytics-

recommended for anyone with anything to do with the WWW

From

http://analytics.blogspot.com/2011/11/webinar-reaching-your-goals-with.html

 

Webinar: Reaching Your Goals with Analytics

 

 

Is your website performing as well as it could be? Do you want to get more out of your digital marketing campaigns, including AdWords and other digital media? Do you feel like you have gaps in your current Google Analytics setup?

We’ve heard from many of our users who want to go deeper into their Analytics — with so much data, it can be hard to know where to look first. If you’d like to move beyond standard “pageview” metrics and visitor statistics, then please join us next Thursday:

Webinar: Reaching Your Goals with Analytics
Date: Thursday, December 1
Time: 11am PST / 2pm EST
Sign up here!

During the webinar, we’ll cover:

  • Key questions to ask for richer insights from your data
  • How to define “success” (for websites, visitors, or campaigns)
  • How to set up and use Goals
  • How to set up and use Ecommerce (for websites with a shopping cart)
  • How to link AdWords to your Google Analytics account

Whatever your online business model — shopping, lead-generation, or pure content — these tools will deliver actionable insights into your buying cycle.

This webinar will be led by Joe Larkin, a technical specialist on the Google Analytics team, and it’s designed for intermediate users of Google Analytics. If you’re comfortable with the basics, but you’d like to do more with your data, then we hope you’ll join us next week!