Tatvic bets on R

Tatvic, a up and coming startup founded by an ex-Trilogy colleague, has helped with the R for Google Analytics package. While Tatvic is into heavy duty web analytics, they are betting big on R, and using it for Web Analytics. David Smith, most excellent blogger-de-chief in R universe has blogged on them before here http://blog.revolutionanalytics.com/2013/02/analyze-web-traffic-data-with-google-analytics-and-r.html

Here is an upcoming seminar on R in Web Analytics.

Click here

From this webinar, you will get to know:

  • What is R and why should you use this tool? How to extract your Web Analytics data into R?
  • How to build a predictive model using web analytics data with the help of R?
  • How predictive modelling can take your analysis to the next level?
  • How to carry out insightful analysis through visualization?

Who should attend: Every web analyst who wants to take his analysis to the next level.

ps- Hat tip to Caroline  A

R for Business Analytics- Book by Ajay Ohri

So the cover art is ready, and if you are a reviewer, you can reserve online copies of the book I have been writing for past 2 years. Special thanks to my mentors, detractors, readers and students- I owe you a beer!

You can also go here-

http://www.springer.com/statistics/book/978-1-4614-4342-1

 

R for Business Analytics

R for Business Analytics

Ohri, Ajay

2012, 2012, XVI, 300 p. 208 illus., 162 in color.

Hardcover
Information

ISBN 978-1-4614-4342-1

Due: September 30, 2012

(net)

approx. 44,95 €
  • Covers full spectrum of R packages related to business analytics
  • Step-by-step instruction on the use of R packages, in addition to exercises, references, interviews and useful links
  • Background information and exercises are all applied to practical business analysis topics, such as code examples on web and social media analytics, data mining, clustering and regression models

R for Business Analytics looks at some of the most common tasks performed by business analysts and helps the user navigate the wealth of information in R and its 4000 packages.  With this information the reader can select the packages that can help process the analytical tasks with minimum effort and maximum usefulness. The use of Graphical User Interfaces (GUI) is emphasized in this book to further cut down and bend the famous learning curve in learning R. This book is aimed to help you kick-start with analytics including chapters on data visualization, code examples on web analytics and social media analytics, clustering, regression models, text mining, data mining models and forecasting. The book tries to expose the reader to a breadth of business analytics topics without burying the user in needless depth. The included references and links allow the reader to pursue business analytics topics.

 

This book is aimed at business analysts with basic programming skills for using R for Business Analytics. Note the scope of the book is neither statistical theory nor graduate level research for statistics, but rather it is for business analytics practitioners. Business analytics (BA) refers to the field of exploration and investigation of data generated by businesses. Business Intelligence (BI) is the seamless dissemination of information through the organization, which primarily involves business metrics both past and current for the use of decision support in businesses. Data Mining (DM) is the process of discovering new patterns from large data using algorithms and statistical methods. To differentiate between the three, BI is mostly current reports, BA is models to predict and strategize and DM matches patterns in big data. The R statistical software is the fastest growing analytics platform in the world, and is established in both academia and corporations for robustness, reliability and accuracy.

Content Level » Professional/practitioner

Keywords » Business Analytics – Data Mining – Data Visualization – Forecasting – GUI – Graphical User Interface – R software – Text Mining

Related subjects » Business, Economics & Finance – Computational Statistics – Statistics

TABLE OF CONTENTS

Why R.- R Infrastructure.- R Interfaces.- Manipulating Data.- Exploring Data.- Building Regression Models.- Data Mining using R.- Clustering and Data Segmentation.- Forecasting and Time-Series Models.- Data Export and Output.- Optimizing your R Coding.- Additional Training Literature.- Appendix

Data Quality in R #rstats

Many Data Quality Formats give problems when importing in your statistical software.A statistical software is quite unable to distingush between $1,000, 1000% and 1,000 and 1000 and will treat the former three as character variables while the third as a numeric variable by default. This issue is further compounded by the numerous ways we can represent date-time variables.

The good thing is for specific domains like finance and web analytics, even these weird data input formats are fixed, so we can fix up a list of handy data quality conversion functions in R for reference.

 

After much muddling about with coverting internet formats (or data used in web analytics) (mostly time formats without date like 00:35:23)  into data frame numeric formats, I found that the way to handle Date-Time conversions in R is

Dataset$Var2= strptime(as.character(Dataset$Var1),”%M:%S”)

The problem with this approach is you will get the value as a Date Time format (02/31/2012 04:00:45-  By default R will add today’s date to it.)  while you are interested in only Time Durations (4:00:45 or actually just the equivalent in seconds).

this can be handled using the as.difftime function

dataset$Var2=as.difftime(paste(dataset$Var1))

or to get purely numeric values so we can do numeric analysis (like summary)

dataset$Var2=as.numeric(as.difftime(paste(dataset$Var1)))

(#Maybe there is  a more elegant way here- but I dont know)

The kind of data is usually one we get in web analytics for average time on site , etc.

 

 

 

 

 

and

for factor variables

Dataset$Var2= as.numeric(as.character(Dataset$Var1))

 

or

Dataset$Var2= as.numeric(paste(Dataset$Var1))

 

Slight problem is suppose there is data like 1,504 – it will be converted to NA instead of 1504

The way to solve this is use the nice gsub function ONLy on that variable. Since the comma is also the most commonly used delimiter , you dont want to replace all the commas, just only the one in that variable.

 

dataset$Variable2=as.numeric(paste(gsub(“,”,””,dataset$Variable)))

 

Now lets assume we have data in the form of % like 0.00% , 1.23%, 3.5%

again we use the gsub function to replace the % value in the string with  (nothing).

 

dataset$Variable2=as.numeric(paste(gsub(“%”,””,dataset$Variable)))

 

 

If you simply do the following for a factor variable, it will show you the level not the value. This can create an error when you are reading in CSV data which may be read as character or factor data type.

Dataset$Var2= as.numeric(Dataset$Var1)

An additional way is to use substr (using substr( and concatenate (using paste) for manipulating string /character variables.

 

iris$sp=substr(iris$Species,1,3) –will reduce the famous Iris species into three digits , without losing any analytical value.

The other issue is with missing values, and na.rm=T helps with getting summaries of numeric variables with missing values, we need to further investigate how suitable, na.omit functions are for domains which have large amounts of missing data and need to be treated.

 

 

Using Google Analytics API with R:dimensions and metrics

I modified the query I wrote earlier  at http://www.decisionstats.com/using-google-analytics-with-r/to get multiple dimensions and metrics from the Google Analytics API, like hour of day,day of week to get cyclical parameters.We are adding the dimensions, and metrics to bring more depth in our analysis.Basically we are trying to do a time series analysis for forecasting web analytics data( which is basically time -stamped and rich in details ).

Basically I am modifying the dimensions and metrics parameters of the query code using the list at

http://code.google.com/apis/analytics/docs/gdata/dimsmets/dimsmets.html

 

query <- QueryBuilder()
query$Init(start.date = "2011-08-20",
                   end.date = "2012-08-25",
                   dimensions = c("ga:date","ga:hour","ga:dayOfWeek"),
                   metrics = c("ga:visitors","ga:visits","ga:pageviews","ga:timeOnSite"),
                   sort = c("ga:date","ga:hour","ga:dayOfWeek"),
                   table.id = paste(profiles$profile[3,3]))

#5. Make a request to get the data from the API 

ga.data <- ga$GetReportData(query)

#6. Look at the returned data 

str(ga.data)

head(ga.data$data) 

and we need the lubridate package to create a ymd:hour (time stamp)    since GA gives data aggregated at a hourly level at most. Also we need to smoothen the effect of weekend on web analytics data.

#Using package lubridate to convert character dates into time

library(lubridate)
ga.data$data[,1]=ymd(ga.data$data[,1])
ls()
dataset1=ga.data$data
names(dataset1) <- make.names(names(dataset1))
str(dataset1)
head(dataset1)

To be continued-

Web Analytics using R , Google Analytics and TS Forecasting

This is a continuation of the previous post on using Google Analytics .

Now that we have downloaded and plotted the data- we try and fit time series to the website data to forecast future traffic.

Some observations-

1) Google Analytics has 0 predictive analytics, it is just descriptive analytics and data visualization models (including the recent social analytics). However you can very well add in basic TS function using R to the GA API.

Why do people look at Website Analytics? To know today’s traffic and derive insights for the Future

2) Web Data clearly follows a 7 day peak and trough for weekly effects (weekdays and weekends), this is also true for hourly data …and this can be used for smoothing historic web data for future forecast.

3) On an advanced level, any hugely popular viral posts can be called a level shift (not drift) and accoringly dampened.

Test and Control!

Similarly using ARIMAX, we can factor in quantity and tag of posts as X regressor variables.

and now the code-( dont laugh at the simplicity please, I am just tinkering and playing with data here!)

You need to copy and paste the code at the bottom of   this post  http://www.decisionstats.com/using-google-analytics-with-r/ if you want to download your GA data down first.

Note I am using lubridate ,forecast and timeSeries packages in this section.

#Plotting the Traffic  plot(ga.data$data[,2],type="l") 

library(timeSeries)
library(forecast)

#Using package lubridate to convert character dates into time
library(lubridate)
ga.data$data[,1]=ymd(ga.data$data[,1])
ls()
dataset1=ga.data$data
names(dataset1) <- make.names(names(dataset1))
str(dataset1)
head(dataset1)
dataset2 <- ts(dataset1$ga.visitors,start=0,frequency = frequency(dataset1$ga.visitors), names=dataset1$ga.date)
str(dataset2)
head(dataset2)
ts.test=dataset2[1:200]
ts.control=dataset2[201:275]

 #Note I am splitting the data into test and control here

fitets=ets(ts.test)
plot(fitets)
testets=ets(ts.control,model=fitets)
accuracy(testets)
plot(testets)
spectrum(ts.test,method='ar')
decompose(ts.test)

library("TTR")
bb=SMA(dataset2,n=7)#We are doing a simple moving average for every 7 days. Note this can be 24 hrs for hourly data, or 30 days for daily data for month # 

to month comparison or 12 months for annual
#We notice that Web Analytics needs sommethening for every 7 thday as there is some relation to traffic on weekedays /weekends /same time last week
head(dataset2,40)
head(bb,40)

par(mfrow=c(2,1))
plot(bb,type="l",main="Using Seven Day Moving Average for Web Visitors")
plot(dataset2,main="Original Data")

Created by Pretty R at inside-R.org

Though I still wonder why the R query, gA R code /package could not be on the cloud (why it  needs to be downloaded)– cloud computing Gs?

Also how about adding some MORE predictive analytics to Google Analytics, chaps!

To be continued-

auto.arima() and forecasts!!!

cross validations!!!

and adapting the idiosyncratic periods and cycles  of web analytics to time series !!

Does the Internet need its own version of credit bureaus

Data Miners love data. The more data they have the better model they can build. Consumers do not love data so much and find sharing data generally a cumbersome task. They need to be incentivize for filling out survey forms , and for signing to loyalty programs. Lawyers, and privacy advocates love to use examples of improper data collection and usage as the harbinger of an ominous scenario. George Orwell’s 1984 never “mentioned” anything about Big Brother trying to sell you one more loan, credit card or product.

Data generated by customers is now growing without their needing to fill out forms and surveys. This data is about their preferences , tastes and choices and is growing in size and depth because it is generated from social media channels on the Internet.It is this data that can be and is captured by social media analytics.

Mobile data is also growing, including usage of location based applications and usage of Internet from the mobile phone is leading to further increases in data about consumers.Increasingly , location based applications help to provide a much more relevant context to the data generated. Just mobile data is expected to grow to 15 exabytes by 2015.

People want to have more and more conversations online publicly , share pictures , activity and interact with a large number of people whom  they have never met. But resent that information being used or abused without their knowledge.

Also the Internet is increasingly being consolidated into a few players like Microsoft, Amazon, Google  and Facebook, who are unable to agree on agreements to share that data between themselves. Interestingly you can use Yahoo as a data middleman between Google and Facebook.

At the same time, more and more purchases are being done online by customers and Internet advertising has grown much above the rate of growth of other mediums of communication.
Internet retail sales have the advantage that better demand predictability can lead to lower inventories as retailers need not stock up displays to look good. An Amazon warehouse need not keep material to simply stock up it shelves like a K-Mart does.

Our Hypothesis – An Analogy with how Financial Data Marketing is managed offline

  1. Financial information regarding spending and saving is much more sensitive yet the presence of credit bureaus alleviates these concerns.
  2. Credit bureaus collect information from all sources, aggregate and anonymize the individual components accordingly.They use SSN as a unique identifier.
  3. The Internet has a unique number too , called the Internet Protocol Address (I.P) 
  4. Should there be a unique identifier like Internet Security Number for the Internet to ensure adequate balance between the need for privacy as well as the need for appropriate targeting? 

After all, no one complains about privacy intrusions if their credit bureau data is aggregated , rolled up, and anonymized and turned into a propensity model for sending them direct mailers.

Advertising using Social Media and Internet

https://www.facebook.com/about/ads/#stories

1. A business creates an ad
Let’s say a gym opens in your neighborhood. The owner creates an ad to get people to come in for a free workout.
2. Facebook gets paid to deliver the ad
The owner sends the ad to Facebook and describes who should see it: people who live nearby and like running.
The right people see the ad
3. Facebook only shows you the ad if you live in town and like to run. That’s how advertisers reach you without knowing who you are.

Adding in credit bureau data and legislative regulation for anonymizing  and handling privacy data can expand the internet selling market, which is much more efficient from a supply chain perspective than the offline display and shop models.

Privacy Regulations on Marketing using Internet data
Should laws on opt out and do not mail, do not call, lists be extended to do not show ads , do not collect information on social media. In the offline world, you can choose to be part of direct marketing or opt out of direct marketing by enrolling yourself in various do not solicit lists. On the internet the only option from advertisements is to use the Adblock plugin if you are Google Chrome or Firefox browser user. Even Facebook gives you many more ads than you need to see.

One reason for so many ads on the Internet is lack of central anonymize data repositories for giving high quality data to these marketing companies.Software that can be used for social media analytics is already available off the shelf.

The growth of the Internet has helped carved out a big industry for Internet web analytics so it is a matter of time before social media analytics becomes a multi billion dollar business as well. What new developments would be unleashed in this brave new world is just a matter of time, and of course of the social media data!