FaceBook IPO- Who hacked whom?

Some thoughts on the FB IPO-

1) Is Zuck reading emails on his honeymoon? Where is he?

2) In 3 days FB lost 34 billion USD in market valuation. Thats enough to buy AOL,Yahoo, LinkedIn and Twitter (combined)

3) People are now shorting FB based on 3-4 days of trading performance. Maybe they know more ARIMA !

4) Who made money on the over-pricing in terms on employees who sold on 1 st day, financial bankers who did the same?

5) Who lost money on the first three days due to Nasdaq’s problems?

6) What is the exact technical problem that Nasdaq had?

7) The much deplored FaceBook Price/Earnings ratio (99) is still comparable to AOL’s (85) and much less than LI (620!). see http://www.google.com/finance?cid=296878244325128

8) Maybe FB can stop copying Google’s ad model (which Google invented) and go back to the drawing table. Like a FB kind of Paypal

9) There are more experts on the blogosphere than experts in Wall Street.

10) No blogger is willing to admit that they erred in the optimism on the great white IPO hope.

I did. Mea culpa. I thought FB is a good stock. I would buy it still- but the rupee tanked by 10% since past 1 week against the dollar.

 

I am now waiting for Chinese social network market to open with IPO’s. Thats walled gardens within walled gardens of Jade and Bamboo.

Related- Art Work of Another 100 billion dollar company (2006)

Book Review- Machine Learning for Hackers

This is review of the fashionably named book Machine Learning for Hackers by Drew Conway and John Myles White (O’Reilly ). The book is about hacking code in R.

 

The preface introduces the reader to the authors conception of what machine learning and hacking is all about. If the name of the book was machine learning for business analytsts or data miners, I am sure the content would have been unchanged though the popularity (and ambiguity) of the word hacker can often substitute for its usefulness. Indeed the many wise and learned Professors of statistics departments through out the civilized world would be mildly surprised and bemused by their day to day activities as hacking or teaching hackers. The book follows a case study and example based approach and uses the GGPLOT2 package within R programming almost to the point of ignoring any other native graphics system based in R. It can be quite useful for the aspiring reader who wishes to understand and join the booming market for skilled talent in statistical computing.

Chapter 1 has a very useful set of functions for data cleansing and formatting. It walks you through the basics of formatting based on dates and conditions, missing value and outlier treatment and using ggplot package in R for graphical analysis. The case study used is an Infochimps dataset with 60,000 recordings of UFO sightings. The case study is lucid, and done at a extremely helpful pace illustrating the powerful and flexible nature of R functions that can be used for data cleansing.The chapter mentions text editors and IDEs but fails to list them in a tabular format, while listing several other tables like Packages used in the book. It also jumps straight from installation instructions to functions in R without getting into the various kinds of data types within R or specifying where these can be referenced from. It thus assumes a higher level of basic programming understanding for the reader than the average R book.

Chapter 2 discusses data exploration, and has a very clear set of diagrams that explain the various data summary operations that are performed routinely. This is an innovative approach and will help students or newcomers to the field of data analysis. It introduces the reader to type determination functions, as well different kinds of encoding. The introduction to creating functions is quite elegant and simple , and numerical summary methods are explained adequately. While the chapter explains data exploration with the help of various histogram options in ggplot2 , it fails to create a more generic framework for data exploration or rules to assist the reader in visual data exploration in non standard data situations. While the examples are very helpful for a reader , there needs to be slightly more depth to step out of the example and into a framework for visual data exploration (or references for the same). A couple of case studies however elaborately explained cannot do justice to the vast field of data exploration and especially visual data exploration.

Chapter 3 discussed binary classification for the specific purpose for spam filtering using a dataset from SpamAssassin. It introduces the reader to the naïve Bayes classifier and the principles of text mining suing the tm package in R. Some of the example codes could have been better commented for easier readability in the book. Overall it is quite a easy tutorial for creating a naïve Bayes classifier even for beginners.

Chapter 4 discusses the issues in importance ranking and creating recommendation systems specifically in the case of ordering email messages into important and not important. It introduces the useful grepl, gsub, strsplit, strptime ,difftime and strtrim functions for parsing data. The chapter further introduces the reader to the concept of log (and affine) transformations in a lucid and clear way that can help even beginners learn this powerful transformation concept. Again the coding within this chapter is sparsely commented which can cause difficulties to people not used to learn reams of code. ( it may have been part of the code attached with the book, but I am reading an electronic book and I did not find an easy way to go back and forth between the code and the book). The readability of the chapters would be further enhanced by the use of flow charts explaining the path and process followed than overtly verbose textual descriptions running into multiple pages. The chapters are quite clearly written, but a helpful visual summary can help in both revising the concepts and elucidate the approach taken further.A suggestion for the authors could be to compile the list of useful functions they introduce in this book as a sort of reference card (or Ref Card) for R Hackers or atleast have a chapter wise summary of functions, datasets and packages used.

Chapter 5 discusses linear regression , and it is a surprising and not very good explanation of regression theory in the introduction to regression. However the chapter makes up in practical example what it oversimplifies in theory. The chapter on regression is not the finest chapter written in this otherwise excellent book. Part of this is because of relative lack of organization- correlation is explained after linear regression is explained. Once again the lack of a function summary and a process flow diagram hinders readability and a separate section on regression metrics that help make a regression result good or not so good could be a welcome addition. Functions introduced include lm.

Chapter 6 showcases Generalized Additive Model (GAM) and Polynomial Regression, including an introduction to singularity and of over-fitting. Functions included in this chapter are transform, and poly while the package glmnet is also used here. The chapter also introduces the reader formally to the concept of cross validation (though examples of cross validation had been introduced in earlier chapters) and regularization. Logistic regression is also introduced at the end in this chapter.

Chapter 7 is about optimization. It describes error metric in a very easy to understand way. It creates a grid by using nested loops for various values of intercept and slope of a regression equation and computing the sum of square of errors. It then describes the optim function in detail including how it works and it’s various parameters. It introduces the curve function. The chapter then describes ridge regression including definition and hyperparameter lamda. The use of optim function to optimize the error in regression is useful learning for the aspiring hacker. Lastly it describes a case study of breaking codes using the simplistic Caesar cipher, a lexical database and the Metropolis method. Functions introduced in this chapter include .Machine$double.eps .

Chapter 8 deals with Principal Component Analysis and unsupervised learning. It uses the ymd function from lubridate package to convert string to date objects, and the cast function from reshape package to further manipulate the structure of data. Using the princomp functions enables PCA in R.The case study creates a stock market index and compares the results with the Dow Jones index.

Chapter 9 deals with Multidimensional Scaling as well as clustering US senators on the basis of similarity in voting records on legislation .It showcases matrix multiplication using %*% and also the dist function to compute distance matrix.

Chapter 10 has the subject of K Nearest Neighbors for recommendation systems. Packages used include class ,reshape and and functions used include cor, function and log. It also demonstrates creating a custom kNN function for calculating Euclidean distance between center of centroids and data. The case study used is the R package recommendation contest on Kaggle. Overall a simplistic introduction to creating a recommendation system using K nearest neighbors, without getting into any of the prepackaged packages within R that deal with association analysis , clustering or recommendation systems.

Chapter 11 introduces the reader to social network analysis (and elements of graph theory) using the example of Erdos Number as an interesting example of social networks of mathematicians. The example of Social Graph API by Google for hacking are quite new and intriguing (though a bit obsolete by changes, and should be rectified in either the errata or next edition) . However there exists packages within R that should be atleast referenced or used within this chapter (like TwitteR package that use the Twitter API and ROauth package for other social networks). Packages used within this chapter include Rcurl, RJSONIO, and igraph packages of R and functions used include rbind and ifelse. It also introduces the reader to the advanced software Gephi. The last example is to build a recommendation engine for whom to follow in Twitter using R.

Chapter 12 is about model comparison and introduces the concept of Support Vector Machines. It uses the package e1071 and shows the svm function. It also introduces the concept of tuning hyper parameters within default algorithms . A small problem in understanding the concepts is the misalignment of diagram pages with the relevant code. It lastly concludes with using mean square error as a method for comparing models built with different algorithms.

 

Overall the book is a welcome addition in the library of books based on R programming language, and the refreshing nature of the flow of material and the practicality of it’s case studies make this a recommended addition to both academic and corporate business analysts trying to derive insights by hacking lots of heterogeneous data.

Have a look for yourself at-
http://shop.oreilly.com/product/0636920018483.do

Facebook and R

Part 1 How do people at Facebook use R?

tamar Rosenn, Facebook

Itamar conveyed how Facebook’s Data Team used R in 2007 to answer two questions about new users: (i) which data points predict whether a user will stay? and (ii) if they stay, which data points predict how active they’ll be after three months?

For the first question, Itamar’s team used recursive partitioning (via the rpartpackage) to infer that just two data points are significantly predictive of whether a user remains on Facebook: (i) having more than one session as a new user, and (ii) entering basic profile information.

For the second question, they fit the data to a logistic model using a least angle regression approach (via the lars package), and found that activity at three months was predicted by variables related to three classes of behavior: (i) how often a user was reached out to by others, (ii) frequency of third party application use, and (iii) what Itamar termed “receptiveness” — related to how forthcoming a user was on the site.

source-http://www.dataspora.com/2009/02/predictive-analytics-using-r/

and cute graphs like the famous

https://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919

 

and

studying baseball on facebook

https://www.facebook.com/notes/facebook-data-team/baseball-on-facebook/10150142265858859

by counting the number of posts that occurred the day after a team lost divided by the total number of wins, since losses for great teams are remarkable and since winning teams’ fans just post more.

 

But mostly at

https://www.facebook.com/data?sk=notes and https://www.facebook.com/data?v=app_4949752878

 

and creating new packages

1. jjplot (not much action here!)

https://r-forge.r-project.org/scm/viewvc.php/?root=jjplot

though

I liked the promise of JJplot at

http://pleasescoopme.com/2010/03/31/using-jjplot-to-explore-tipping-behavior/

2. ising models

https://github.com/slycoder/Rflim

https://www.facebook.com/note.php?note_id=10150359708746212

3. R pipe

https://github.com/slycoder/Rpipe

 

even the FB interns are cool

http://brenocon.com/blog/2009/02/comparison-of-data-analysis-packages-r-matlab-scipy-excel-sas-spss-stata/

 

Part 2 How do people with R use Facebook?

Using the API at https://developers.facebook.com/tools/explorer

and code mashes from

 

http://romainfrancois.blog.free.fr/index.php?post/2012/01/15/Crawling-facebook-with-R

http://applyr.blogspot.in/2012/01/mining-facebook-data-most-liked-status.html

but the wonderful troubleshooting code from http://www.brocktibert.com/blog/2012/01/19/358/

which needs to be added to the code first

 

and using network package

>access_token=”XXXXXXXXXXXX”

Annoyingly the Facebook token can expire after some time, this can lead to huge wait and NULL results with Oauth errors

If that happens you need to regenerate the token

What we need
> require(RCurl)
> require(rjson)
> download.file(url=”http://curl.haxx.se/ca/cacert.pem”, destfile=”cacert.pem”)

Roman’s Famous Facebook Function (altered)

> facebook <- function( path = “me”, access_token , options){
+ if( !missing(options) ){
+ options <- sprintf( “?%s”, paste( names(options), “=”, unlist(options), collapse = “&”, sep = “” ) )
+ } else {
+ options <- “”
+ }
+ data <- getURL( sprintf( “https://graph.facebook.com/%s%s&access_token=%s&#8221;, path, options, access_token ), cainfo=”cacert.pem” )
+ fromJSON( data )
+ }

 

Now getting the friends list
> friends <- facebook( path=”me/friends” , access_token=access_token)
> # extract Facebook IDs
> friends.id <- sapply(friends$data, function(x) x$id)
> # extract names
> friends.name <- sapply(friends$data, function(x) iconv(x$name,”UTF-8″,”ASCII//TRANSLIT”))
> # short names to initials
> initials <- function(x) paste(substr(x,1,1), collapse=””)
> friends.initial <- sapply(strsplit(friends.name,” “), initials)

This matrix can take a long time to build, so you can change the value of N to say 40 to test your network. I needed to press the escape button to cut short the plotting of all 400 friends of mine.
> # friendship relation matrix
> N <- length(friends.id)
> friendship.matrix <- matrix(0,N,N)
> for (i in 1:N) {
+ tmp <- facebook( path=paste(“me/mutualfriends”, friends.id[i], sep=”/”) , access_token=access_token)
+ mutualfriends <- sapply(tmp$data, function(x) x$id)
+ friendship.matrix[i,friends.id %in% mutualfriends] <- 1
+ }

 

Plotting using Network package in R (with help from the  comments at http://applyr.blogspot.in/2012/01/mining-facebook-data-most-liked-status.html)

> require(network)

>net1<- as.network(friendship.matrix)

> plot(net1, label=friends.initial, arrowhead.cex=0)

(Rgraphviz is tough if you are on Windows 7 like me)

but there is an alternative igraph solution at https://github.com/sciruela/facebookFriends/blob/master/facebook.r

 

After all that-..talk.. a graph..of my Facebook Network with friends initials as labels..

 

Opinion piece-

I hope plans to make the Facebook R package get fulfilled (just as the twitteR  package led to many interesting analysis)

and also Linkedin has an API at http://developer.linkedin.com/apis

I think it would be interesting to plot professional relationships across social networks as well. But I hope to see a LinkedIn package (or blog code) soon.

As for jjplot, I had hoped ggplot and jjplot merged or atleast had some kind of inclusion in the Deducer GUI. Maybe a Google Summer of Code project if people are busy!!

Also the geeks at Facebook.com can think of giving something back to the R community, as Google generously does with funding packages like RUnit, Deducer and Summer of Code, besides sponsoring meet ups etc.

 

(note – this is part of the research for the upcoming book ” R for Business Analytics”)

 

ps-

but didnt get time to download all my posts using R code at

https://gist.github.com/1634662#

or do specific Facebook Page analysis using R at

http://tonybreyal.wordpress.com/2012/01/06/r-web-scraping-r-bloggers-facebook-page-to-gain-further-information-about-an-authors-r-blog-posts-e-g-number-of-likes-comments-shares-etc/

Updated-

 #access token from https://developers.facebook.com/tools/explorer
access_token="AAuFgaOcVaUZAssCvL9dPbZCjghTEwwhNxZAwpLdZCbw6xw7gARYoWnPHxihO1DcJgSSahd67LgZDZD"
require(RCurl)
 require(rjson)
# download the file needed for authentication http://www.brocktibert.com/blog/2012/01/19/358/
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")
# http://romainfrancois.blog.free.fr/index.php?post/2012/01/15/Crawling-facebook-with-R
facebook <- function( path = "me", access_token = token, options){
if( !missing(options) ){
options <- sprintf( "?%s", paste( names(options), "=", unlist(options), collapse = "&", sep = "" ) )
} else {
options <- ""
}
data <- getURL( sprintf( "https://graph.facebook.com/%s%s&access_token=%s", path, options, access_token ), cainfo="cacert.pem" )
fromJSON( data )
}

 # see http://applyr.blogspot.in/2012/01/mining-facebook-data-most-liked-status.html

# scrape the list of friends
friends <- facebook( path="me/friends" , access_token=access_token)
# extract Facebook IDs
friends.id <- sapply(friends$data, function(x) x$id)
# extract names 
friends.name <- sapply(friends$data, function(x)  iconv(x$name,"UTF-8","ASCII//TRANSLIT"))
# short names to initials 
initials <- function(x) paste(substr(x,1,1), collapse="")
friends.initial <- sapply(strsplit(friends.name," "), initials)

# friendship relation matrix
#N <- length(friends.id)
N <- 200
friendship.matrix <- matrix(0,N,N)
for (i in 1:N) {
  tmp <- facebook( path=paste("me/mutualfriends", friends.id[i], sep="/") , access_token=access_token)
  mutualfriends <- sapply(tmp$data, function(x) x$id)
  friendship.matrix[i,friends.id %in% mutualfriends] <- 1
}
require(network)
net1<- as.network(friendship.matrix)
plot(net1, label=friends.initial, arrowhead.cex=0)

Created by Pretty R at inside-R.org