## Random Sampling a Dataset in R

A common example in business  analytics data is to take a random sample of a very large dataset, to test your analytics code. Note most business analytics datasets are data.frame ( records as rows and variables as columns)  in structure or database bound.This is partly due to a legacy of traditional analytics software.

Here is how we do it in R-

• Refering to parts of data.frame rather than whole dataset.

Using square brackets to reference variable columns and rows

The notation dataset[i,k] refers to element in the ith row and jth column.

The notation dataset[i,] refers to all elements in the ith row .or a record for a data.frame

The notation dataset[,j] refers to all elements in the jth column- or a variable for a data.frame.

For a data.frame dataset

> nrow(dataset) #This gives number of rows

> ncol(dataset) #This gives number of columns

An example for corelation between only a few variables in a data.frame.

> cor(dataset1[,4:6])

Splitting a dataset into test and control.

ts.test=dataset2[1:200] #First 200 rows

ts.control=dataset2[201:275] #Next 75 rows

• Sampling

Random sampling enables us to work on a smaller size of the whole dataset.

use sample to create a random permutation of the vector x.

Suppose we want to take a 5% sample of a data frame with no replacement.

Let us create a dataset ajay of random numbers

`ajay=matrix( round(rnorm(200, 5,15)), ncol=10)`

#This is the kind of code line that frightens most MBAs!!

Note we use the round function to round off values.

```ajay=as.data.frame(ajay)

nrow(ajay)```

[1] 20

`> ncol(ajay)`

[1] 10

This is a typical business data scenario when we want to select only a few records to do our analysis (or test our code), but have all the columns for those records. Let  us assume we want to sample only 5% of the whole data so we can run our code on it

Then the number of rows in the new object will be 0.05*nrow(ajay).That will be the size of the sample.

The new object can be referenced to choose only a sample of all rows in original object using the size parameter.

We also use the replace=FALSE or F , to not the same row again and again. The new_rows is thus a 5% sample of the existing rows.

Then using the square backets and ajay[new_rows,] to get-

`b=ajay[sample(nrow(ajay),replace=F,size=0.05*nrow(ajay)),]`

You can change the percentage from 5 % to whatever you want accordingly.

## Using Google Analytics API with R:dimensions and metrics

I modified the query I wrote earlier  at http://www.decisionstats.com/using-google-analytics-with-r/to get multiple dimensions and metrics from the Google Analytics API, like hour of day,day of week to get cyclical parameters.We are adding the dimensions, and metrics to bring more depth in our analysis.Basically we are trying to do a time series analysis for forecasting web analytics data( which is basically time -stamped and rich in details ).

Basically I am modifying the dimensions and metrics parameters of the query code using the list at

```query <- QueryBuilder()
query\$Init(start.date = "2011-08-20",
end.date = "2012-08-25",
dimensions = c("ga:date","ga:hour","ga:dayOfWeek"),
metrics = c("ga:visitors","ga:visits","ga:pageviews","ga:timeOnSite"),
sort = c("ga:date","ga:hour","ga:dayOfWeek"),
table.id = paste(profiles\$profile[3,3]))

#5. Make a request to get the data from the API

ga.data <- ga\$GetReportData(query)

#6. Look at the returned data

str(ga.data)

and we need the lubridate package to create a ymd:hour (time stamp)    since GA gives data aggregated at a hourly level at most. Also we need to smoothen the effect of weekend on web analytics data.

#Using package lubridate to convert character dates into time

```library(lubridate)
ga.data\$data[,1]=ymd(ga.data\$data[,1])
ls()
dataset1=ga.data\$data
names(dataset1) <- make.names(names(dataset1))
str(dataset1)

To be continued-

## Funny Photos- it happens only in India 2

Raj Weds Deepika (with spelling error) and no, its not photoshopped

Translation- No Girlfriend No Tension (back of Truck).Note the amazing spelling in the picture above

Polite reminder- Please do not spit

Ok maybe this maybe an international thing- but it is funny in India (which is hot enough, thank you and our curries are hot enough!)

environmental sign

—–

Based on the award winning series of pictures at http://www.decisionstats.com/funny-photo-it-happens-only-in-india/

## Web Analytics using R , Google Analytics and TS Forecasting

This is a continuation of the previous post on using Google Analytics .

Now that we have downloaded and plotted the data- we try and fit time series to the website data to forecast future traffic.

Some observations-

1) Google Analytics has 0 predictive analytics, it is just descriptive analytics and data visualization models (including the recent social analytics). However you can very well add in basic TS function using R to the GA API.

Why do people look at Website Analytics? To know today’s traffic and derive insights for the Future

2) Web Data clearly follows a 7 day peak and trough for weekly effects (weekdays and weekends), this is also true for hourly data …and this can be used for smoothing historic web data for future forecast.

3) On an advanced level, any hugely popular viral posts can be called a level shift (not drift) and accoringly dampened.

Test and Control!

Similarly using ARIMAX, we can factor in quantity and tag of posts as X regressor variables.

and now the code-( dont laugh at the simplicity please, I am just tinkering and playing with data here!)

You need to copy and paste the code at the bottom of   this post  http://www.decisionstats.com/using-google-analytics-with-r/ if you want to download your GA data down first.

Note I am using lubridate ,forecast and timeSeries packages in this section.

`#Plotting the Traffic  plot(ga.data\$data[,2],type="l") `

library(timeSeries)
library(forecast)

```#Using package lubridate to convert character dates into time
library(lubridate)
ga.data\$data[,1]=ymd(ga.data\$data[,1])
ls()
dataset1=ga.data\$data
names(dataset1) <- make.names(names(dataset1))
str(dataset1)
dataset2 <- ts(dataset1\$ga.visitors,start=0,frequency = frequency(dataset1\$ga.visitors), names=dataset1\$ga.date)
str(dataset2)
ts.test=dataset2[1:200]
ts.control=dataset2[201:275]

#Note I am splitting the data into test and control here

fitets=ets(ts.test)
plot(fitets)
testets=ets(ts.control,model=fitets)
accuracy(testets)
plot(testets)
spectrum(ts.test,method='ar')
decompose(ts.test)

library("TTR")
bb=SMA(dataset2,n=7)#We are doing a simple moving average for every 7 days. Note this can be 24 hrs for hourly data, or 30 days for daily data for month #

to month comparison or 12 months for annual
#We notice that Web Analytics needs sommethening for every 7 thday as there is some relation to traffic on weekedays /weekends /same time last week

par(mfrow=c(2,1))
plot(bb,type="l",main="Using Seven Day Moving Average for Web Visitors")
plot(dataset2,main="Original Data")```

Created by Pretty R at inside-R.org

Though I still wonder why the R query, gA R code /package could not be on the cloud (why it  needs to be downloaded)– cloud computing Gs?

To be continued-

auto.arima() and forecasts!!!

cross validations!!!

and adapting the idiosyncratic periods and cycles  of web analytics to time series !!

## iPhones Siri to replace call centers . Is it doable

I was wondering why the planet  spends  so much money in the \$150-billion business process outsourcing industry, especially in voice calls to call centers.

If your iPhone Siri phone can be configured to answer any query, Why can’t it be configured to be  a virtual assistant, customer support, marketing outbound or even a super charged call center interactive voice response .

Can we  do and run some tests on this?

## Using Google Analytics with R

Some code to read in data from Google Analytics data. Some modifications include adding the SSL authentication code and modifying (in bold) the table.id parameter to choose correct website from a GA profile with many websites

It provides access to Google Analytics data natively from the R Statistical Computing programming language. You can use this library to retrieve an R data.frame with Google Analytics data. Then perform advanced statistical analysis, like time series analysis and regressions.

## Supported Features

• A QueryBuilder class to simplify creating API queries
• API response is converted directly into R as a data.frame
• Library returns the aggregates, and confidence intervals of the metrics, dynamically if they exist
• Auto-pagination to return more than 10,000 rows of information by combining multiple data requests. (Upper Limit 1M rows)
• Authorization through the ClientLogin routine
• Full documentation and unit tests
Code-

> library(XML)

>

> library(RCurl)

>

>

>

trying URL ‘http://curl.haxx.se/ca/cacert.pem&#8217; Content type ‘text/plain’ length 215993 bytes (210 Kb) opened

>

> # set the curl options

> curl <- getCurlHandle()

> options(RCurlOptions = list(capath = system.file(“CurlSSL”, “cacert.pem”,

+ package = “RCurl”),

+ ssl.verifypeer = FALSE))

> curlSetOpt(.opts = list(proxy = ‘proxyserver:port’), curl = curl)

An object of class “CURLHandle” Slot “ref”: <pointer: 0000000006AA2B70>

>

> # 1. Create a new Google Analytics API object

>

>

> # 2. Authorize the object with your Google Analytics Account Credentials

>

>

> # 3. Get the list of different profiles, to help build the query

>

> profiles <- ga\$GetProfileData()

>

> profiles #Error Check to See if we get the right website

\$profile AccountName ProfileName TableId

1 dudeofdata.com dudeofdata.com ga:44926237

3 decisionstats.com decisionstats.com ga:46751946

\$total.results

total.results

1 3

>

> # 4. Build the Data Export API query

>

> #Modify the start.date and end.date parameters based on data requirements

>

> #Modify the table.id at table.id = paste(profiles\$profile[X,3]) to get the X th website in your profile

> # 4. Build the Data Export API query

> query <- QueryBuilder() > query\$Init(start.date = “2012-01-09”, + end.date = “2012-03-20”, + dimensions = “ga:date”,

+ metrics = “ga:visitors”,

+ sort = “ga:date”,

+ table.id = paste(profiles\$profile[3,3]))

>

>

> #5. Make a request to get the data from the API

>

> ga.data <- ga\$GetReportData(query)

>

> #6. Look at the returned data

>

> str(ga.data)

List of 3

\$ data :’data.frame’: 72 obs. of 2 variables: ..

\$ ga:date : chr [1:72] “20120109” “20120110” “20120111” “20120112” … ..

\$ ga:visitors: num [1:72] 394 405 381 390 323 47 169 67 94 89 …

\$ aggr.totals :’data.frame’: 1 obs. of 1 variable: ..

\$ aggregate.totals: num 28348

\$ total.results: num 72

>

ga:date ga:visitors

1 20120109 394

2 20120110 405

3 20120111 381

4 20120112 390

5 20120113 323

6 20120114 47 >

> #Plotting the Traffic >

> plot(ga.data\$data[,2],type=”l”)

Update- Some errors come from pasting Latex directly to WordPress. Here is some code , made pretty-r in case you want to play with the GA api

```library(XML)

library(RCurl)

# set the curl options
curl <- getCurlHandle()
options(RCurlOptions = list(capath = system.file("CurlSSL", "cacert.pem",
package = "RCurl"),
ssl.verifypeer = FALSE))
curlSetOpt(.opts = list(proxy = 'proxyserver:port'), curl = curl)

# 1. Create a new Google Analytics API object

ga\$SetCredentials("ohri2007@gmail.com", "XXXXXXX")

# 3. Get the list of different profiles, to help build the query

profiles <- ga\$GetProfileData()

profiles #Error Check to See if we get the right website

# 4. Build the Data Export API query

#Modify the start.date and end.date parameters based on data requirements

#Modify the table.id at table.id = paste(profiles\$profile[X,3]) to get the X th website in your profile
# 4. Build the Data Export API query
query <- QueryBuilder()
query\$Init(start.date = "2012-01-09",
end.date = "2012-03-20",
dimensions = "ga:date",
metrics = "ga:visitors",
sort = "ga:date",
table.id = paste(profiles\$profile[3,3]))

#5. Make a request to get the data from the API

ga.data <- ga\$GetReportData(query)

#6. Look at the returned data

str(ga.data)

#Plotting the Traffic

plot(ga.data\$data[,2],type="l")```

Created by Pretty R at inside-R.org

require(RCurl)

b <- getURL(url,cainfo=”cacert.pem”)

write.table(b,quote = FALSE, sep = “,”,file=”test.csv”)

Previously (for past 3 5 7 hours)-

The codes at http://blog.revolutionanalytics.com/2011/09/using-google-spreadsheets-with-r-an-update.html dont work thanks to the SSL authentication issue. and the packages at [http://www.omegahat.org/RGoogleDocs/] and [https://r-forge.r-project.org/projects/rgoogledata/] are missing in action.