Facebook Search- The fall of the machines

Increasingly I am beginning to search more and more on Facebook. This is for the following reasons-

1) Facebook is walled off to Google (mostly). While within Facebook , I get both people results and content results (from Bing).

Bing is an okay alternative , though not as fast as Google Instant.

2) Cleaner Web Results When Facebook increases the number of results from 3 top links to say 10 top links, there should be more outbound traffic from FB search to websites.For some reason Google continues to show 14 pages of results… Why? Why not limit to just one page.

3) Better People Search than  Pipl and Google. But not much (or any) image search. This is curious and I am hoping the Instagram results would be added to search results.

4) I am hoping for any company Facebook or Microsoft to challenge Adsense . Adwords already has rivals. Adsense is a de facto monopoly and my experiences in advertising show that content creators can make much more money from a better Adsense (especially ) if Adsense and Adwords do not have a conflict of interest from same advertisers.

Adwords should have been a special case of Adsense for Google.com but it is not.

5) Machine learning can only get you from tau to delta tau. When ad click behavior is inherently dependent on humans who behave mostly on chaotic , or genetic models than linear CPC models. I find FB has an inherent advantage in the quantity and quality of data collected on people behavior rather than click behavior. They are also more aggressive and less apologetic about behavorially targeted  ads.

Additional point- Analytics for Google Analytics is not as rich as analytics from Facebook pages in terms of demographic variables. This can be tested by anyone.

 

send email by R

For automated report delivery I have often used send email options in BASE SAS. For R, for scheduling tasks and sending me automated mails on completion of tasks I have two R options and 1 Windows OS scheduling option. Note red font denotes the parameters that should be changed. Anything else should NOT be changed.

Option 1-

Use the mail package at

http://cran.r-project.org/web/packages/mail/mail.pdf

> library(mail)

Attaching package: ‘mail’

The following object(s) are masked from ‘package:sendmailR’:

sendmail

>
> sendmail(“ohri2007@gmail.com“, subject=”Notification from R“,message=“Calculation finished!”, password=”rmail”)
[1] “Message was sent to ohri2007@gmail.com! You have 19 messages left.”

Disadvantage- Only 20 email messages by IP address per day. (but thats ok!)

Option 2-

use sendmailR package at http://cran.r-project.org/web/packages/sendmailR/sendmailR.pdf

install.packages()
library(sendmailR)
from <- sprintf(“<sendmailR@%s>”, Sys.info()[4])
to <- “<ohri2007@gmail.com>”
subject <- “Hello from R
body <- list(“It works!”, mime_part(iris))
sendmail(from, to, subject, body,control=list(smtpServer=”ASPMX.L.GOOGLE.COM”))

 

 

BiocInstaller version 1.2.1, ?biocLite for help
> install.packages(“sendmailR”)
Installing package(s) into ‘/home/ubuntu/R/library’
(as ‘lib’ is unspecified)
also installing the dependency ‘base64’

trying URL ‘http://cran.at.r-project.org/src/contrib/base64_1.1.tar.gz&#8217;
Content type ‘application/x-gzip’ length 61109 bytes (59 Kb)
opened URL
==================================================
downloaded 59 Kb

trying URL ‘http://cran.at.r-project.org/src/contrib/sendmailR_1.1-1.tar.gz&#8217;
Content type ‘application/x-gzip’ length 6399 bytes
opened URL
==================================================
downloaded 6399 bytes

BiocInstaller version 1.2.1, ?biocLite for help
* installing *source* package ‘base64’ …
** package ‘base64’ successfully unpacked and MD5 sums checked
** libs
gcc -std=gnu99 -I/usr/local/lib64/R/include -I/usr/local/include -fpic -g -O2 -c base64.c -o base64.o
gcc -std=gnu99 -shared -L/usr/local/lib64 -o base64.so base64.o -L/usr/local/lib64/R/lib -lR
installing to /home/ubuntu/R/library/base64/libs
** R
** preparing package for lazy loading
** help
*** installing help indices
** building package indices …
** testing if installed package can be loaded
BiocInstaller version 1.2.1, ?biocLite for help

* DONE (base64)
BiocInstaller version 1.2.1, ?biocLite for help
* installing *source* package ‘sendmailR’ …
** package ‘sendmailR’ successfully unpacked and MD5 sums checked
** R
** preparing package for lazy loading
** help
*** installing help indices
** building package indices …
** testing if installed package can be loaded
BiocInstaller version 1.2.1, ?biocLite for help

* DONE (sendmailR)

The downloaded packages are in
‘/tmp/RtmpsM222s/downloaded_packages’
> library(sendmailR)
Loading required package: base64
> from <- sprintf(“<sendmailR@%s>”, Sys.info()[4])
> to <- “<ohri2007@gmail.com>”
> subject <- “Hello from R”
> body <- list(“It works!”, mime_part(iris))
> sendmail(from, to, subject, body,
+ control=list(smtpServer=”ASPMX.L.GOOGLE.COM”))
$code
[1] “221”

$msg
[1] “2.0.0 closing connection ff2si17226764qab.40”

Disadvantage-This worked when I used the Amazon Cloud using the BioConductor AMI (for free 2 hours) at http://www.bioconductor.org/help/cloud/

It did NOT work when I tried it use it from my Windows 7 Home Premium PC from my Indian ISP (!!) .

It gave me this error

or in wait_for(250) :
SMTP Error: 5.7.1 [180.215.172.252] The IP you’re using to send mail is not authorized

 

PAUSE–

ps Why do this (send email by R)?

Note you can add either of the two programs of the end of the code that you want to be notified automatically. (like daily tasks)

This is mostly done for repeated business analytics tasks (like reports and analysis that need to be run at specific periods of time)

pps- What else can I do with this?

Can be modified to include sms or tweets  or even blog by email by modifying the   “to”  location appropriately.

3) Using Windows Task Scheduler to run R codes automatically (either the above)

or just sending an email

got to Start>  All Programs > Accessories >System Tools > Task Scheduler ( or by default C:Windowssystem32taskschd.msc)

Create a basic task

Now you can use this to run your daily/or scheduled R code  or you can send yourself email as well.

and modify the parameters- note the SMTP server (you can use the ones for google in example 2 at ASPMX.L.GOOGLE.COM)

and check if it works!

 

Related

 Geeky Things , Bro

Configuring IIS on your Windows 7 Home Edition-

note path to do this is-

Control Panel>All Control Panel Items> Program and Features>Turn Windows features on or off> Internet Information Services

and

http://stackoverflow.com/questions/709635/sending-mail-from-batch-file

 

R for Predictive Modeling- PAW Toronto

A nice workshop on using R for Predictive Modeling by Max Kuhn Director, Nonclinical Statistics, Pfizer is on at PAW Toronto.

Workshop

Monday, April 23, 2012 in Toronto
Full-day: 9:00am – 4:30pm

R for Predictive Modeling:
A Hands-On Introduction

Intended Audience: Practitioners who wish to learn how to execute on predictive analytics by way of the R language; anyone who wants “to turn ideas into software, quickly and faithfully.”

Knowledge Level: Either hands-on experience with predictive modeling (without R) or hands-on familiarity with any programming language (other than R) is sufficient background and preparation to participate in this workshop.


What prior attendees have exclaimed


Workshop Description

This one-day session provides a hands-on introduction to R, the well-known open-source platform for data analysis. Real examples are employed in order to methodically expose attendees to best practices driving R and its rich set of predictive modeling packages, providing hands-on experience and know-how. R is compared to other data analysis platforms, and common pitfalls in using R are addressed.

The instructor, a leading R developer and the creator of CARET, a core R package that streamlines the process for creating predictive models, will guide attendees on hands-on execution with R, covering:

  • A working knowledge of the R system
  • The strengths and limitations of the R language
  • Preparing data with R, including splitting, resampling and variable creation
  • Developing predictive models with R, including decision trees, support vector machines and ensemble methods
  • Visualization: Exploratory Data Analysis (EDA), and tools that persuade
  • Evaluating predictive models, including viewing lift curves, variable importance and avoiding overfitting

Hardware: Bring Your Own Laptop
Each workshop participant is required to bring their own laptop running Windows or OS X. The software used during this training program, R, is free and readily available for download.

Attendees receive an electronic copy of the course materials and related R code at the conclusion of the workshop.


Schedule

  • Workshop starts at 9:00am
  • Morning Coffee Break at 10:30am – 11:00am
  • Lunch provided at 12:30 – 1:15pm
  • Afternoon Coffee Break at 2:30pm – 3:00pm
  • End of the Workshop: 4:30pm

Instructor

Max Kuhn, Director, Nonclinical Statistics, Pfizer

Max Kuhn is a Director of Nonclinical Statistics at Pfizer Global R&D in Connecticut. He has been applying models in the pharmaceutical industries for over 15 years.

He is a leading R developer and the author of several R packages including the CARET package that provides a simple and consistent interface to over 100 predictive models available in R.

Mr. Kuhn has taught courses on modeling within Pfizer and externally, including a class for the India Ministry of Information Technology.

Source-

http://www.predictiveanalyticsworld.com/toronto/2012/r_for_predictive_modeling.php

Random Sampling a Dataset in R

A common example in business  analytics data is to take a random sample of a very large dataset, to test your analytics code. Note most business analytics datasets are data.frame ( records as rows and variables as columns)  in structure or database bound.This is partly due to a legacy of traditional analytics software.

Here is how we do it in R-

• Refering to parts of data.frame rather than whole dataset.

Using square brackets to reference variable columns and rows

The notation dataset[i,k] refers to element in the ith row and jth column.

The notation dataset[i,] refers to all elements in the ith row .or a record for a data.frame

The notation dataset[,j] refers to all elements in the jth column- or a variable for a data.frame.

For a data.frame dataset

> nrow(dataset) #This gives number of rows

> ncol(dataset) #This gives number of columns

An example for corelation between only a few variables in a data.frame.

> cor(dataset1[,4:6])

Splitting a dataset into test and control.

ts.test=dataset2[1:200] #First 200 rows

ts.control=dataset2[201:275] #Next 75 rows

• Sampling

Random sampling enables us to work on a smaller size of the whole dataset.

use sample to create a random permutation of the vector x.

Suppose we want to take a 5% sample of a data frame with no replacement.

Let us create a dataset ajay of random numbers

ajay=matrix( round(rnorm(200, 5,15)), ncol=10)

#This is the kind of code line that frightens most MBAs!!

Note we use the round function to round off values.

ajay=as.data.frame(ajay)

 nrow(ajay)

[1] 20

> ncol(ajay)

[1] 10

This is a typical business data scenario when we want to select only a few records to do our analysis (or test our code), but have all the columns for those records. Let  us assume we want to sample only 5% of the whole data so we can run our code on it

Then the number of rows in the new object will be 0.05*nrow(ajay).That will be the size of the sample.

The new object can be referenced to choose only a sample of all rows in original object using the size parameter.

We also use the replace=FALSE or F , to not the same row again and again. The new_rows is thus a 5% sample of the existing rows.

Then using the square backets and ajay[new_rows,] to get-

b=ajay[sample(nrow(ajay),replace=F,size=0.05*nrow(ajay)),]

 

You can change the percentage from 5 % to whatever you want accordingly.

Using Google Analytics API with R:dimensions and metrics

I modified the query I wrote earlier  at http://www.decisionstats.com/using-google-analytics-with-r/to get multiple dimensions and metrics from the Google Analytics API, like hour of day,day of week to get cyclical parameters.We are adding the dimensions, and metrics to bring more depth in our analysis.Basically we are trying to do a time series analysis for forecasting web analytics data( which is basically time -stamped and rich in details ).

Basically I am modifying the dimensions and metrics parameters of the query code using the list at

http://code.google.com/apis/analytics/docs/gdata/dimsmets/dimsmets.html

 

query <- QueryBuilder()
query$Init(start.date = "2011-08-20",
                   end.date = "2012-08-25",
                   dimensions = c("ga:date","ga:hour","ga:dayOfWeek"),
                   metrics = c("ga:visitors","ga:visits","ga:pageviews","ga:timeOnSite"),
                   sort = c("ga:date","ga:hour","ga:dayOfWeek"),
                   table.id = paste(profiles$profile[3,3]))

#5. Make a request to get the data from the API 

ga.data <- ga$GetReportData(query)

#6. Look at the returned data 

str(ga.data)

head(ga.data$data) 

and we need the lubridate package to create a ymd:hour (time stamp)    since GA gives data aggregated at a hourly level at most. Also we need to smoothen the effect of weekend on web analytics data.

#Using package lubridate to convert character dates into time

library(lubridate)
ga.data$data[,1]=ymd(ga.data$data[,1])
ls()
dataset1=ga.data$data
names(dataset1) <- make.names(names(dataset1))
str(dataset1)
head(dataset1)

To be continued-

Web Analytics using R , Google Analytics and TS Forecasting

This is a continuation of the previous post on using Google Analytics .

Now that we have downloaded and plotted the data- we try and fit time series to the website data to forecast future traffic.

Some observations-

1) Google Analytics has 0 predictive analytics, it is just descriptive analytics and data visualization models (including the recent social analytics). However you can very well add in basic TS function using R to the GA API.

Why do people look at Website Analytics? To know today’s traffic and derive insights for the Future

2) Web Data clearly follows a 7 day peak and trough for weekly effects (weekdays and weekends), this is also true for hourly data …and this can be used for smoothing historic web data for future forecast.

3) On an advanced level, any hugely popular viral posts can be called a level shift (not drift) and accoringly dampened.

Test and Control!

Similarly using ARIMAX, we can factor in quantity and tag of posts as X regressor variables.

and now the code-( dont laugh at the simplicity please, I am just tinkering and playing with data here!)

You need to copy and paste the code at the bottom of   this post  http://www.decisionstats.com/using-google-analytics-with-r/ if you want to download your GA data down first.

Note I am using lubridate ,forecast and timeSeries packages in this section.

#Plotting the Traffic  plot(ga.data$data[,2],type="l") 

library(timeSeries)
library(forecast)

#Using package lubridate to convert character dates into time
library(lubridate)
ga.data$data[,1]=ymd(ga.data$data[,1])
ls()
dataset1=ga.data$data
names(dataset1) <- make.names(names(dataset1))
str(dataset1)
head(dataset1)
dataset2 <- ts(dataset1$ga.visitors,start=0,frequency = frequency(dataset1$ga.visitors), names=dataset1$ga.date)
str(dataset2)
head(dataset2)
ts.test=dataset2[1:200]
ts.control=dataset2[201:275]

 #Note I am splitting the data into test and control here

fitets=ets(ts.test)
plot(fitets)
testets=ets(ts.control,model=fitets)
accuracy(testets)
plot(testets)
spectrum(ts.test,method='ar')
decompose(ts.test)

library("TTR")
bb=SMA(dataset2,n=7)#We are doing a simple moving average for every 7 days. Note this can be 24 hrs for hourly data, or 30 days for daily data for month # 

to month comparison or 12 months for annual
#We notice that Web Analytics needs sommethening for every 7 thday as there is some relation to traffic on weekedays /weekends /same time last week
head(dataset2,40)
head(bb,40)

par(mfrow=c(2,1))
plot(bb,type="l",main="Using Seven Day Moving Average for Web Visitors")
plot(dataset2,main="Original Data")

Created by Pretty R at inside-R.org

Though I still wonder why the R query, gA R code /package could not be on the cloud (why it  needs to be downloaded)– cloud computing Gs?

Also how about adding some MORE predictive analytics to Google Analytics, chaps!

To be continued-

auto.arima() and forecasts!!!

cross validations!!!

and adapting the idiosyncratic periods and cycles  of web analytics to time series !!

Using Google Analytics with R

Some code to read in data from Google Analytics data. Some modifications include adding the SSL authentication code and modifying (in bold) the table.id parameter to choose correct website from a GA profile with many websites

The Google Analytics Package files can be downloaded from http://code.google.com/p/r-google-analytics/downloads/list

It provides access to Google Analytics data natively from the R Statistical Computing programming language. You can use this library to retrieve an R data.frame with Google Analytics data. Then perform advanced statistical analysis, like time series analysis and regressions.

Supported Features

  • Access to v2 of the Google Analytics Data Export API Data Feed
  • A QueryBuilder class to simplify creating API queries
  • API response is converted directly into R as a data.frame
  • Library returns the aggregates, and confidence intervals of the metrics, dynamically if they exist
  • Auto-pagination to return more than 10,000 rows of information by combining multiple data requests. (Upper Limit 1M rows)
  • Authorization through the ClientLogin routine
  • Access to all the profiles ids for the authorized user
  • Full documentation and unit tests
Code-

> library(XML)

>

> library(RCurl)

Loading required package: bitops

>

> #Change path name in the following to the folder you downloaded the Google Analytics Package

>

> source(“C:/Users/KUs/Desktop/CANADA/R/RGoogleAnalytics/R/RGoogleAnalytics.R”)

>

> source(“C:/Users/KUs/Desktop/CANADA/R/RGoogleAnalytics/R/QueryBuilder.R”)

> # download the file needed for authentication

> download.file(url=”http://curl.haxx.se/ca/cacert.pem&#8221;, destfile=”cacert.pem”)

trying URL ‘http://curl.haxx.se/ca/cacert.pem&#8217; Content type ‘text/plain’ length 215993 bytes (210 Kb) opened

URL downloaded 210 Kb

>

> # set the curl options

> curl <- getCurlHandle()

> options(RCurlOptions = list(capath = system.file(“CurlSSL”, “cacert.pem”,

+ package = “RCurl”),

+ ssl.verifypeer = FALSE))

> curlSetOpt(.opts = list(proxy = ‘proxyserver:port’), curl = curl)

An object of class “CURLHandle” Slot “ref”: <pointer: 0000000006AA2B70>

>

> # 1. Create a new Google Analytics API object

>

> ga <- RGoogleAnalytics()

>

> # 2. Authorize the object with your Google Analytics Account Credentials

>

> ga$SetCredentials(“USERNAME”, “PASSWORD”)

>

> # 3. Get the list of different profiles, to help build the query

>

> profiles <- ga$GetProfileData()

>

> profiles #Error Check to See if we get the right website

$profile AccountName ProfileName TableId

1 dudeofdata.com dudeofdata.com ga:44926237

2 knol.google.com knol.google.com ga:45564890

3 decisionstats.com decisionstats.com ga:46751946

$total.results

total.results

1 3

>

> # 4. Build the Data Export API query

>

> #Modify the start.date and end.date parameters based on data requirements

>

> #Modify the table.id at table.id = paste(profiles$profile[X,3]) to get the X th website in your profile

> # 4. Build the Data Export API query

> query <- QueryBuilder() > query$Init(start.date = “2012-01-09”, + end.date = “2012-03-20”, + dimensions = “ga:date”,

+ metrics = “ga:visitors”,

+ sort = “ga:date”,

+ table.id = paste(profiles$profile[3,3]))

>

>

> #5. Make a request to get the data from the API

>

> ga.data <- ga$GetReportData(query)

[1] “Executing query: https://www.google.com/analytics/feeds/data?start-date=2012%2D01%2D09&end-date=2012%2D03%2D20&dimensions=ga%3Adate&metrics=ga%3Avisitors&sort=ga%3Adate&ids=ga%3A46751946&#8221;

>

> #6. Look at the returned data

>

> str(ga.data)

List of 3

$ data :’data.frame’: 72 obs. of 2 variables: ..

$ ga:date : chr [1:72] “20120109” “20120110” “20120111” “20120112” … ..

$ ga:visitors: num [1:72] 394 405 381 390 323 47 169 67 94 89 …

$ aggr.totals :’data.frame’: 1 obs. of 1 variable: ..

$ aggregate.totals: num 28348

$ total.results: num 72

>

> head(ga.data$data)

ga:date ga:visitors

1 20120109 394

2 20120110 405

3 20120111 381

4 20120112 390

5 20120113 323

6 20120114 47 >

> #Plotting the Traffic >

> plot(ga.data$data[,2],type=”l”)

Update- Some errors come from pasting Latex directly to WordPress. Here is some code , made pretty-r in case you want to play with the GA api

library(XML)

library(RCurl)

#Change path name in the following to the folder you downloaded the Google Analytics Package 

source("C:/Users/KUs/Desktop/CANADA/R/RGoogleAnalytics/R/RGoogleAnalytics.R")

source("C:/Users/KUs/Desktop/CANADA/R/RGoogleAnalytics/R/QueryBuilder.R")
# download the file needed for authentication
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")

# set the curl options
curl <- getCurlHandle()
options(RCurlOptions = list(capath = system.file("CurlSSL", "cacert.pem",
package = "RCurl"),
ssl.verifypeer = FALSE))
curlSetOpt(.opts = list(proxy = 'proxyserver:port'), curl = curl)

# 1. Create a new Google Analytics API object 

ga <- RGoogleAnalytics()

# 2. Authorize the object with your Google Analytics Account Credentials 

ga$SetCredentials("ohri2007@gmail.com", "XXXXXXX")

# 3. Get the list of different profiles, to help build the query

profiles <- ga$GetProfileData()

profiles #Error Check to See if we get the right website

# 4. Build the Data Export API query 

#Modify the start.date and end.date parameters based on data requirements 

#Modify the table.id at table.id = paste(profiles$profile[X,3]) to get the X th website in your profile 
# 4. Build the Data Export API query
query <- QueryBuilder()
query$Init(start.date = "2012-01-09",
                   end.date = "2012-03-20",
                   dimensions = "ga:date",
                   metrics = "ga:visitors",
                   sort = "ga:date",
                   table.id = paste(profiles$profile[3,3]))

#5. Make a request to get the data from the API 

ga.data <- ga$GetReportData(query)

#6. Look at the returned data 

str(ga.data)

head(ga.data$data)

#Plotting the Traffic 

plot(ga.data$data[,2],type="l")

Created by Pretty R at inside-R.org