Facebook Search- The fall of the machines

Increasingly I am beginning to search more and more on Facebook. This is for the following reasons-

1) Facebook is walled off to Google (mostly). While within Facebook , I get both people results and content results (from Bing).

Bing is an okay alternative , though not as fast as Google Instant.

2) Cleaner Web Results When Facebook increases the number of results from 3 top links to say 10 top links, there should be more outbound traffic from FB search to websites.For some reason Google continues to show 14 pages of results… Why? Why not limit to just one page.

3) Better People Search than  Pipl and Google. But not much (or any) image search. This is curious and I am hoping the Instagram results would be added to search results.

4) I am hoping for any company Facebook or Microsoft to challenge Adsense . Adwords already has rivals. Adsense is a de facto monopoly and my experiences in advertising show that content creators can make much more money from a better Adsense (especially ) if Adsense and Adwords do not have a conflict of interest from same advertisers.

Adwords should have been a special case of Adsense for Google.com but it is not.

5) Machine learning can only get you from tau to delta tau. When ad click behavior is inherently dependent on humans who behave mostly on chaotic , or genetic models than linear CPC models. I find FB has an inherent advantage in the quantity and quality of data collected on people behavior rather than click behavior. They are also more aggressive and less apologetic about behavorially targeted  ads.

Additional point- Analytics for Google Analytics is not as rich as analytics from Facebook pages in terms of demographic variables. This can be tested by anyone.

 

send email by R

For automated report delivery I have often used send email options in BASE SAS. For R, for scheduling tasks and sending me automated mails on completion of tasks I have two R options and 1 Windows OS scheduling option. Note red font denotes the parameters that should be changed. Anything else should NOT be changed.

Option 1-

Use the mail package at

http://cran.r-project.org/web/packages/mail/mail.pdf

> library(mail)

Attaching package: ‘mail’

The following object(s) are masked from ‘package:sendmailR’:

sendmail

>
> sendmail(“ohri2007@gmail.com“, subject=”Notification from R“,message=“Calculation finished!”, password=”rmail”)
[1] “Message was sent to ohri2007@gmail.com! You have 19 messages left.”

Disadvantage- Only 20 email messages by IP address per day. (but thats ok!)

Option 2-

use sendmailR package at http://cran.r-project.org/web/packages/sendmailR/sendmailR.pdf

install.packages()
library(sendmailR)
from <- sprintf(“<sendmailR@%s>”, Sys.info()[4])
to <- “<ohri2007@gmail.com>”
subject <- “Hello from R
body <- list(“It works!”, mime_part(iris))
sendmail(from, to, subject, body,control=list(smtpServer=”ASPMX.L.GOOGLE.COM”))

 

 

BiocInstaller version 1.2.1, ?biocLite for help
> install.packages(“sendmailR”)
Installing package(s) into ‘/home/ubuntu/R/library’
(as ‘lib’ is unspecified)
also installing the dependency ‘base64’

trying URL ‘http://cran.at.r-project.org/src/contrib/base64_1.1.tar.gz&#8217;
Content type ‘application/x-gzip’ length 61109 bytes (59 Kb)
opened URL
==================================================
downloaded 59 Kb

trying URL ‘http://cran.at.r-project.org/src/contrib/sendmailR_1.1-1.tar.gz&#8217;
Content type ‘application/x-gzip’ length 6399 bytes
opened URL
==================================================
downloaded 6399 bytes

BiocInstaller version 1.2.1, ?biocLite for help
* installing *source* package ‘base64’ …
** package ‘base64’ successfully unpacked and MD5 sums checked
** libs
gcc -std=gnu99 -I/usr/local/lib64/R/include -I/usr/local/include -fpic -g -O2 -c base64.c -o base64.o
gcc -std=gnu99 -shared -L/usr/local/lib64 -o base64.so base64.o -L/usr/local/lib64/R/lib -lR
installing to /home/ubuntu/R/library/base64/libs
** R
** preparing package for lazy loading
** help
*** installing help indices
** building package indices …
** testing if installed package can be loaded
BiocInstaller version 1.2.1, ?biocLite for help

* DONE (base64)
BiocInstaller version 1.2.1, ?biocLite for help
* installing *source* package ‘sendmailR’ …
** package ‘sendmailR’ successfully unpacked and MD5 sums checked
** R
** preparing package for lazy loading
** help
*** installing help indices
** building package indices …
** testing if installed package can be loaded
BiocInstaller version 1.2.1, ?biocLite for help

* DONE (sendmailR)

The downloaded packages are in
‘/tmp/RtmpsM222s/downloaded_packages’
> library(sendmailR)
Loading required package: base64
> from <- sprintf(“<sendmailR@%s>”, Sys.info()[4])
> to <- “<ohri2007@gmail.com>”
> subject <- “Hello from R”
> body <- list(“It works!”, mime_part(iris))
> sendmail(from, to, subject, body,
+ control=list(smtpServer=”ASPMX.L.GOOGLE.COM”))
$code
[1] “221”

$msg
[1] “2.0.0 closing connection ff2si17226764qab.40”

Disadvantage-This worked when I used the Amazon Cloud using the BioConductor AMI (for free 2 hours) at http://www.bioconductor.org/help/cloud/

It did NOT work when I tried it use it from my Windows 7 Home Premium PC from my Indian ISP (!!) .

It gave me this error

or in wait_for(250) :
SMTP Error: 5.7.1 [180.215.172.252] The IP you’re using to send mail is not authorized

 

PAUSE–

ps Why do this (send email by R)?

Note you can add either of the two programs of the end of the code that you want to be notified automatically. (like daily tasks)

This is mostly done for repeated business analytics tasks (like reports and analysis that need to be run at specific periods of time)

pps- What else can I do with this?

Can be modified to include sms or tweets  or even blog by email by modifying the   “to”  location appropriately.

3) Using Windows Task Scheduler to run R codes automatically (either the above)

or just sending an email

got to Start>  All Programs > Accessories >System Tools > Task Scheduler ( or by default C:Windowssystem32taskschd.msc)

Create a basic task

Now you can use this to run your daily/or scheduled R code  or you can send yourself email as well.

and modify the parameters- note the SMTP server (you can use the ones for google in example 2 at ASPMX.L.GOOGLE.COM)

and check if it works!

 

Related

 Geeky Things , Bro

Configuring IIS on your Windows 7 Home Edition-

note path to do this is-

Control Panel>All Control Panel Items> Program and Features>Turn Windows features on or off> Internet Information Services

and

http://stackoverflow.com/questions/709635/sending-mail-from-batch-file

 

Google introduces Google Play

Some nice new features from the big G men from Mountain view. Google Play- for movies, games, apps, music and books. Nice to see entertainment is back on Google’s priority.

 

See this to read more

https://play.google.com/about/

When will I get Google Play?

About Google Play

Q: What is Google Play?
A: Google Play is a new digital content experience from Google where you can find your favorite music, movies, books, and Android apps and games. It’s your entertainment hub: you can access it from the web or from your Android device or even TV, and all your content is instantly available across all of these devices.

Q: What is your strategy with Google Play?
A: Our goal with Google Play is to bring together all your favorite content in one place that you can access across your devices. Specifically, digital content is fundamental to the mobile experience, so bringing all of this content together in one place for users makes the Android platform even more compelling. We’re also simplifying digital content for Google users – you can go to the Google Play website on your desktop and purchase and experience the latest movies, music and books. With Google Play, we’re giving you a simpler way to get your digital content.

Q: What will the experience be for users? What will happen to my existing account?
A: All content and apps in your existing account will remain in your account, but will transition to Google Play. On your device, the Android Market app icon will become the Google Play store icon. You’ll see “Play Store.” For the movies, books and music apps, you’ll begin to see Play versions of these as well, such as “Play Music,” and “Play Movies.”

Q: When will I get Google Play? What markets is this available in?
A: We’ll be rolling out Google Play globally starting today. On the web, Google Play will be live today. On devices, it will take a few days for the Android Market app to update to the Google Play Store app. The music, books and movies apps will also receive an update today.
Around the globe, Google Play will include Android apps and games. In countries where we have already launched music, books or movies, you will see those categories available in Google Play, too.

Q: I live outside the US. When will I get the books, music or movies verticals? I only see Android apps and games?
A: We want to bring different content categories to as many countries as possible. We’ve already launched movies and books in several countries outside the U.S. and will continue to do so overtime, but we don’t have a specific timeline to share.

Q: What types of content are available in my country?

  • Paid Apps: Available in these countries
  • Movies: Available in US, UK, Canada, and Japan
  • eBooks: Available in US, UK, Canada, and Australia
  • Music: Available in US

 

Q: Does this mean Google Music and the Google eBookstore will cease to exist? What about my account?
A: Both Google Music and the Google eBookstore are now part of Google Play. Your music and your books, including anything you bought, are still there, available to you in Google Play and accessible through your Google account.

Q: Where did my Google eBooks books go? Will I still have access to them?
A: Your books are now part of Google Play. Your books are still there, available to you in your Google Play library and accessible through your Google account.

Q: I don’t use an Android phone, can I still use Google Play?
A: Yes. Google Play is available on any computer with a modern browser at play.google.com. On the web, you can browse and buy books, movies and music. You can read books on the Google Play web reader, listen to music on your computer or watch movies online. Your digital content is all stored in the cloud, so you can access from anywhere using your Google Account.
We’ve also created ways to experience your music and books on other platforms such as the Google Books iOS app.

Q: Why do I not see Google Play yet on my device?
A: Please see our help center article on this here.

Q: How can I contact Google Play consumer support?
A: You can call or email our team here.

Book Review- Machine Learning for Hackers

This is review of the fashionably named book Machine Learning for Hackers by Drew Conway and John Myles White (O’Reilly ). The book is about hacking code in R.

 

The preface introduces the reader to the authors conception of what machine learning and hacking is all about. If the name of the book was machine learning for business analytsts or data miners, I am sure the content would have been unchanged though the popularity (and ambiguity) of the word hacker can often substitute for its usefulness. Indeed the many wise and learned Professors of statistics departments through out the civilized world would be mildly surprised and bemused by their day to day activities as hacking or teaching hackers. The book follows a case study and example based approach and uses the GGPLOT2 package within R programming almost to the point of ignoring any other native graphics system based in R. It can be quite useful for the aspiring reader who wishes to understand and join the booming market for skilled talent in statistical computing.

Chapter 1 has a very useful set of functions for data cleansing and formatting. It walks you through the basics of formatting based on dates and conditions, missing value and outlier treatment and using ggplot package in R for graphical analysis. The case study used is an Infochimps dataset with 60,000 recordings of UFO sightings. The case study is lucid, and done at a extremely helpful pace illustrating the powerful and flexible nature of R functions that can be used for data cleansing.The chapter mentions text editors and IDEs but fails to list them in a tabular format, while listing several other tables like Packages used in the book. It also jumps straight from installation instructions to functions in R without getting into the various kinds of data types within R or specifying where these can be referenced from. It thus assumes a higher level of basic programming understanding for the reader than the average R book.

Chapter 2 discusses data exploration, and has a very clear set of diagrams that explain the various data summary operations that are performed routinely. This is an innovative approach and will help students or newcomers to the field of data analysis. It introduces the reader to type determination functions, as well different kinds of encoding. The introduction to creating functions is quite elegant and simple , and numerical summary methods are explained adequately. While the chapter explains data exploration with the help of various histogram options in ggplot2 , it fails to create a more generic framework for data exploration or rules to assist the reader in visual data exploration in non standard data situations. While the examples are very helpful for a reader , there needs to be slightly more depth to step out of the example and into a framework for visual data exploration (or references for the same). A couple of case studies however elaborately explained cannot do justice to the vast field of data exploration and especially visual data exploration.

Chapter 3 discussed binary classification for the specific purpose for spam filtering using a dataset from SpamAssassin. It introduces the reader to the naïve Bayes classifier and the principles of text mining suing the tm package in R. Some of the example codes could have been better commented for easier readability in the book. Overall it is quite a easy tutorial for creating a naïve Bayes classifier even for beginners.

Chapter 4 discusses the issues in importance ranking and creating recommendation systems specifically in the case of ordering email messages into important and not important. It introduces the useful grepl, gsub, strsplit, strptime ,difftime and strtrim functions for parsing data. The chapter further introduces the reader to the concept of log (and affine) transformations in a lucid and clear way that can help even beginners learn this powerful transformation concept. Again the coding within this chapter is sparsely commented which can cause difficulties to people not used to learn reams of code. ( it may have been part of the code attached with the book, but I am reading an electronic book and I did not find an easy way to go back and forth between the code and the book). The readability of the chapters would be further enhanced by the use of flow charts explaining the path and process followed than overtly verbose textual descriptions running into multiple pages. The chapters are quite clearly written, but a helpful visual summary can help in both revising the concepts and elucidate the approach taken further.A suggestion for the authors could be to compile the list of useful functions they introduce in this book as a sort of reference card (or Ref Card) for R Hackers or atleast have a chapter wise summary of functions, datasets and packages used.

Chapter 5 discusses linear regression , and it is a surprising and not very good explanation of regression theory in the introduction to regression. However the chapter makes up in practical example what it oversimplifies in theory. The chapter on regression is not the finest chapter written in this otherwise excellent book. Part of this is because of relative lack of organization- correlation is explained after linear regression is explained. Once again the lack of a function summary and a process flow diagram hinders readability and a separate section on regression metrics that help make a regression result good or not so good could be a welcome addition. Functions introduced include lm.

Chapter 6 showcases Generalized Additive Model (GAM) and Polynomial Regression, including an introduction to singularity and of over-fitting. Functions included in this chapter are transform, and poly while the package glmnet is also used here. The chapter also introduces the reader formally to the concept of cross validation (though examples of cross validation had been introduced in earlier chapters) and regularization. Logistic regression is also introduced at the end in this chapter.

Chapter 7 is about optimization. It describes error metric in a very easy to understand way. It creates a grid by using nested loops for various values of intercept and slope of a regression equation and computing the sum of square of errors. It then describes the optim function in detail including how it works and it’s various parameters. It introduces the curve function. The chapter then describes ridge regression including definition and hyperparameter lamda. The use of optim function to optimize the error in regression is useful learning for the aspiring hacker. Lastly it describes a case study of breaking codes using the simplistic Caesar cipher, a lexical database and the Metropolis method. Functions introduced in this chapter include .Machine$double.eps .

Chapter 8 deals with Principal Component Analysis and unsupervised learning. It uses the ymd function from lubridate package to convert string to date objects, and the cast function from reshape package to further manipulate the structure of data. Using the princomp functions enables PCA in R.The case study creates a stock market index and compares the results with the Dow Jones index.

Chapter 9 deals with Multidimensional Scaling as well as clustering US senators on the basis of similarity in voting records on legislation .It showcases matrix multiplication using %*% and also the dist function to compute distance matrix.

Chapter 10 has the subject of K Nearest Neighbors for recommendation systems. Packages used include class ,reshape and and functions used include cor, function and log. It also demonstrates creating a custom kNN function for calculating Euclidean distance between center of centroids and data. The case study used is the R package recommendation contest on Kaggle. Overall a simplistic introduction to creating a recommendation system using K nearest neighbors, without getting into any of the prepackaged packages within R that deal with association analysis , clustering or recommendation systems.

Chapter 11 introduces the reader to social network analysis (and elements of graph theory) using the example of Erdos Number as an interesting example of social networks of mathematicians. The example of Social Graph API by Google for hacking are quite new and intriguing (though a bit obsolete by changes, and should be rectified in either the errata or next edition) . However there exists packages within R that should be atleast referenced or used within this chapter (like TwitteR package that use the Twitter API and ROauth package for other social networks). Packages used within this chapter include Rcurl, RJSONIO, and igraph packages of R and functions used include rbind and ifelse. It also introduces the reader to the advanced software Gephi. The last example is to build a recommendation engine for whom to follow in Twitter using R.

Chapter 12 is about model comparison and introduces the concept of Support Vector Machines. It uses the package e1071 and shows the svm function. It also introduces the concept of tuning hyper parameters within default algorithms . A small problem in understanding the concepts is the misalignment of diagram pages with the relevant code. It lastly concludes with using mean square error as a method for comparing models built with different algorithms.

 

Overall the book is a welcome addition in the library of books based on R programming language, and the refreshing nature of the flow of material and the practicality of it’s case studies make this a recommended addition to both academic and corporate business analysts trying to derive insights by hacking lots of heterogeneous data.

Have a look for yourself at-
http://shop.oreilly.com/product/0636920018483.do

Using R for Cloud Computing – made very easy and free by BioConductor

I really liked the no hassles way Biocnoductor has put a cloud AMI loaded with RStudio to help people learn R, and even try using R from within a browser in the cloud.

Not only is the tutorial very easy to use- they also give away 2 hours for free computing!!!

Check it out-

Step 1

Step 2

Step 3

and wow! I am using Google Chrome to run R ..and its awesome!

Interesting- check out two hours for free — all you need is a browser and internet connection

http://www.bioconductor.org/help/cloud/

Using Google Analytics API with R:dimensions and metrics

I modified the query I wrote earlier  at http://www.decisionstats.com/using-google-analytics-with-r/to get multiple dimensions and metrics from the Google Analytics API, like hour of day,day of week to get cyclical parameters.We are adding the dimensions, and metrics to bring more depth in our analysis.Basically we are trying to do a time series analysis for forecasting web analytics data( which is basically time -stamped and rich in details ).

Basically I am modifying the dimensions and metrics parameters of the query code using the list at

http://code.google.com/apis/analytics/docs/gdata/dimsmets/dimsmets.html

 

query <- QueryBuilder()
query$Init(start.date = "2011-08-20",
                   end.date = "2012-08-25",
                   dimensions = c("ga:date","ga:hour","ga:dayOfWeek"),
                   metrics = c("ga:visitors","ga:visits","ga:pageviews","ga:timeOnSite"),
                   sort = c("ga:date","ga:hour","ga:dayOfWeek"),
                   table.id = paste(profiles$profile[3,3]))

#5. Make a request to get the data from the API 

ga.data <- ga$GetReportData(query)

#6. Look at the returned data 

str(ga.data)

head(ga.data$data) 

and we need the lubridate package to create a ymd:hour (time stamp)    since GA gives data aggregated at a hourly level at most. Also we need to smoothen the effect of weekend on web analytics data.

#Using package lubridate to convert character dates into time

library(lubridate)
ga.data$data[,1]=ymd(ga.data$data[,1])
ls()
dataset1=ga.data$data
names(dataset1) <- make.names(names(dataset1))
str(dataset1)
head(dataset1)

To be continued-

Web Analytics using R , Google Analytics and TS Forecasting

This is a continuation of the previous post on using Google Analytics .

Now that we have downloaded and plotted the data- we try and fit time series to the website data to forecast future traffic.

Some observations-

1) Google Analytics has 0 predictive analytics, it is just descriptive analytics and data visualization models (including the recent social analytics). However you can very well add in basic TS function using R to the GA API.

Why do people look at Website Analytics? To know today’s traffic and derive insights for the Future

2) Web Data clearly follows a 7 day peak and trough for weekly effects (weekdays and weekends), this is also true for hourly data …and this can be used for smoothing historic web data for future forecast.

3) On an advanced level, any hugely popular viral posts can be called a level shift (not drift) and accoringly dampened.

Test and Control!

Similarly using ARIMAX, we can factor in quantity and tag of posts as X regressor variables.

and now the code-( dont laugh at the simplicity please, I am just tinkering and playing with data here!)

You need to copy and paste the code at the bottom of   this post  http://www.decisionstats.com/using-google-analytics-with-r/ if you want to download your GA data down first.

Note I am using lubridate ,forecast and timeSeries packages in this section.

#Plotting the Traffic  plot(ga.data$data[,2],type="l") 

library(timeSeries)
library(forecast)

#Using package lubridate to convert character dates into time
library(lubridate)
ga.data$data[,1]=ymd(ga.data$data[,1])
ls()
dataset1=ga.data$data
names(dataset1) <- make.names(names(dataset1))
str(dataset1)
head(dataset1)
dataset2 <- ts(dataset1$ga.visitors,start=0,frequency = frequency(dataset1$ga.visitors), names=dataset1$ga.date)
str(dataset2)
head(dataset2)
ts.test=dataset2[1:200]
ts.control=dataset2[201:275]

 #Note I am splitting the data into test and control here

fitets=ets(ts.test)
plot(fitets)
testets=ets(ts.control,model=fitets)
accuracy(testets)
plot(testets)
spectrum(ts.test,method='ar')
decompose(ts.test)

library("TTR")
bb=SMA(dataset2,n=7)#We are doing a simple moving average for every 7 days. Note this can be 24 hrs for hourly data, or 30 days for daily data for month # 

to month comparison or 12 months for annual
#We notice that Web Analytics needs sommethening for every 7 thday as there is some relation to traffic on weekedays /weekends /same time last week
head(dataset2,40)
head(bb,40)

par(mfrow=c(2,1))
plot(bb,type="l",main="Using Seven Day Moving Average for Web Visitors")
plot(dataset2,main="Original Data")

Created by Pretty R at inside-R.org

Though I still wonder why the R query, gA R code /package could not be on the cloud (why it  needs to be downloaded)– cloud computing Gs?

Also how about adding some MORE predictive analytics to Google Analytics, chaps!

To be continued-

auto.arima() and forecasts!!!

cross validations!!!

and adapting the idiosyncratic periods and cycles  of web analytics to time series !!