Using Google Image search for OkCupid Profile Images

  1. Suppose you like some one on OKCupid. Click to navigate to photos.
  2. Click Save as to save the webpage completely. All Images are now in a folder on your laptop
  3. Use Google Image Search Upload to one by one . The biggest privacy drawback, Google Image search doesn’t find Instagram or Twitter  or Facebook or LinkedIn profile images so well, but does it extremely well for Google Plus Profile Image. Lolcat!
  4. Solution- have two sets of photos that make you look good, one for  friends, other for OKCupid or other online dating shenanigans!

faceoffI interview Stephen, awesome Face Recognition hacker here, and I think his solution is the best for privacy- use a robots.txt equivalent for images, just as you use it for websites.

I hope they find a solution, in time for wearable computing to take off properly.

In the meantime, I am telling people on OK Cupid that I am a fortune teller. It is really working awesome!

FaceBook IPO- Who hacked whom?

Some thoughts on the FB IPO-

1) Is Zuck reading emails on his honeymoon? Where is he?

2) In 3 days FB lost 34 billion USD in market valuation. Thats enough to buy AOL,Yahoo, LinkedIn and Twitter (combined)

3) People are now shorting FB based on 3-4 days of trading performance. Maybe they know more ARIMA !

4) Who made money on the over-pricing in terms on employees who sold on 1 st day, financial bankers who did the same?

5) Who lost money on the first three days due to Nasdaq’s problems?

6) What is the exact technical problem that Nasdaq had?

7) The much deplored FaceBook Price/Earnings ratio (99) is still comparable to AOL’s (85) and much less than LI (620!). see

8) Maybe FB can stop copying Google’s ad model (which Google invented) and go back to the drawing table. Like a FB kind of Paypal

9) There are more experts on the blogosphere than experts in Wall Street.

10) No blogger is willing to admit that they erred in the optimism on the great white IPO hope.

I did. Mea culpa. I thought FB is a good stock. I would buy it still- but the rupee tanked by 10% since past 1 week against the dollar.


I am now waiting for Chinese social network market to open with IPO’s. Thats walled gardens within walled gardens of Jade and Bamboo.

Related- Art Work of Another 100 billion dollar company (2006)

Facebook and R

Part 1 How do people at Facebook use R?

tamar Rosenn, Facebook

Itamar conveyed how Facebook’s Data Team used R in 2007 to answer two questions about new users: (i) which data points predict whether a user will stay? and (ii) if they stay, which data points predict how active they’ll be after three months?

For the first question, Itamar’s team used recursive partitioning (via the rpartpackage) to infer that just two data points are significantly predictive of whether a user remains on Facebook: (i) having more than one session as a new user, and (ii) entering basic profile information.

For the second question, they fit the data to a logistic model using a least angle regression approach (via the lars package), and found that activity at three months was predicted by variables related to three classes of behavior: (i) how often a user was reached out to by others, (ii) frequency of third party application use, and (iii) what Itamar termed “receptiveness” — related to how forthcoming a user was on the site.


and cute graphs like the famous



studying baseball on facebook

by counting the number of posts that occurred the day after a team lost divided by the total number of wins, since losses for great teams are remarkable and since winning teams’ fans just post more.


But mostly at and


and creating new packages

1. jjplot (not much action here!)


I liked the promise of JJplot at

2. ising models

3. R pipe


even the FB interns are cool


Part 2 How do people with R use Facebook?

Using the API at

and code mashes from

but the wonderful troubleshooting code from

which needs to be added to the code first


and using network package


Annoyingly the Facebook token can expire after some time, this can lead to huge wait and NULL results with Oauth errors

If that happens you need to regenerate the token

What we need
> require(RCurl)
> require(rjson)
> download.file(url=””, destfile=”cacert.pem”)

Roman’s Famous Facebook Function (altered)

> facebook <- function( path = “me”, access_token , options){
+ if( !missing(options) ){
+ options <- sprintf( “?%s”, paste( names(options), “=”, unlist(options), collapse = “&”, sep = “” ) )
+ } else {
+ options <- “”
+ }
+ data <- getURL( sprintf( “;, path, options, access_token ), cainfo=”cacert.pem” )
+ fromJSON( data )
+ }


Now getting the friends list
> friends <- facebook( path=”me/friends” , access_token=access_token)
> # extract Facebook IDs
> <- sapply(friends$data, function(x) x$id)
> # extract names
> <- sapply(friends$data, function(x) iconv(x$name,”UTF-8″,”ASCII//TRANSLIT”))
> # short names to initials
> initials <- function(x) paste(substr(x,1,1), collapse=””)
> friends.initial <- sapply(strsplit(,” “), initials)

This matrix can take a long time to build, so you can change the value of N to say 40 to test your network. I needed to press the escape button to cut short the plotting of all 400 friends of mine.
> # friendship relation matrix
> N <- length(
> friendship.matrix <- matrix(0,N,N)
> for (i in 1:N) {
+ tmp <- facebook( path=paste(“me/mutualfriends”,[i], sep=”/”) , access_token=access_token)
+ mutualfriends <- sapply(tmp$data, function(x) x$id)
+ friendship.matrix[i, %in% mutualfriends] <- 1
+ }


Plotting using Network package in R (with help from the  comments at

> require(network)


> plot(net1, label=friends.initial, arrowhead.cex=0)

(Rgraphviz is tough if you are on Windows 7 like me)

but there is an alternative igraph solution at


After all a graph..of my Facebook Network with friends initials as labels..


Opinion piece-

I hope plans to make the Facebook R package get fulfilled (just as the twitteR  package led to many interesting analysis)

and also Linkedin has an API at

I think it would be interesting to plot professional relationships across social networks as well. But I hope to see a LinkedIn package (or blog code) soon.

As for jjplot, I had hoped ggplot and jjplot merged or atleast had some kind of inclusion in the Deducer GUI. Maybe a Google Summer of Code project if people are busy!!

Also the geeks at can think of giving something back to the R community, as Google generously does with funding packages like RUnit, Deducer and Summer of Code, besides sponsoring meet ups etc.


(note – this is part of the research for the upcoming book ” R for Business Analytics”)



but didnt get time to download all my posts using R code at

or do specific Facebook Page analysis using R at


 #access token from
# download the file needed for authentication
download.file(url="", destfile="cacert.pem")
facebook <- function( path = "me", access_token = token, options){
if( !missing(options) ){
options <- sprintf( "?%s", paste( names(options), "=", unlist(options), collapse = "&", sep = "" ) )
} else {
options <- ""
data <- getURL( sprintf( "", path, options, access_token ), cainfo="cacert.pem" )
fromJSON( data )

 # see

# scrape the list of friends
friends <- facebook( path="me/friends" , access_token=access_token)
# extract Facebook IDs <- sapply(friends$data, function(x) x$id)
# extract names <- sapply(friends$data, function(x)  iconv(x$name,"UTF-8","ASCII//TRANSLIT"))
# short names to initials 
initials <- function(x) paste(substr(x,1,1), collapse="")
friends.initial <- sapply(strsplit(," "), initials)

# friendship relation matrix
#N <- length(
N <- 200
friendship.matrix <- matrix(0,N,N)
for (i in 1:N) {
  tmp <- facebook( path=paste("me/mutualfriends",[i], sep="/") , access_token=access_token)
  mutualfriends <- sapply(tmp$data, function(x) x$id)
  friendship.matrix[i, %in% mutualfriends] <- 1
plot(net1, label=friends.initial, arrowhead.cex=0)

Created by Pretty R at

Why LinkedIn and Twitter are up for grabs in 2012-14?

Given Facebook’s valuation at $60-$100 billion , Apple’s $100 billion cash pile, Microsoft’s cash of $ 52 billion, Google’s cash of 43 billion $ , there is a lot of money floating. I am not counting Amazon as it deals with its own Fire issues.

But what is left to buy. In terms of richness of data available for data mining for better advertising- it is Twitter and LinkedIn that have the best sources of data.

and LinkedIn is worth only 9 billion dollars and Twitter is only $8.5 billion dollars. Throw in a competitive dynamic  premium, and you can get 50 % of both these companies at 13 billion dollars. if owners dont want to sell 100%, well buy a big big stake.

Makes a good case- buy the company- buy the data- sell them ads- sell them better products.

What do you think?

Quantitative Modeling for Arbitrage Positions in Ad KeyWords Internet Marketing

Assume you treat an ad keyword as an equity stock. There are slight differences in the cost for advertising for that keyword across various locations (Zurich vs Delhi) and various channels (Facebook vs Google) . You get revenue if your website ranks naturally in organic search for the keyword, and you have to pay costs for getting traffic to your website for that keyword.
An arbitrage position is defined as a riskless profit when cost of keyword is less than revenue from keyword. We take examples of Adsense  and Adwords primarily.
There are primarily two types of economic curves on the foundation of which commerce of the  internet  resides-
1) Cost Curve- Cost of Advertising to drive traffic into the website  (Google Adwords, Twitter Ads, Facebook , LinkedIn ads)
2) Revenue Curve – Revenue from ads clicked by the incoming traffic on website (like Adsense, LinkAds, Banner Ads, Ad Sharing Programs , In Game Ads)
The cost and revenue curves are primarily dependent on two things
1) Type of KeyWord-Also subdependent on
a) Location of Prospective Customer, and
b) Net Present Value of Good and Service to be eventually purchased
For example , keyword for targeting sales of enterprise “business intelligence software” should ideally be costing say X times as much as keywords for “flower shop for birthdays” where X is the multiple of the expected payoffs from sales of business intelligence software divided by expected payoff from sales of flowers (say in Location, Daytona Beach ,Florida or Austin, Texas)
2) Traffic Volume – Also sub-dependent on Time Series and
a) Seasonality -Annual Shoppping Cycle
b) Cyclicality– Macro economic shifts in time series
The cost and revenue curves are not linear and ideally should be continuous in a definitive exponential or polynomial manner, but in actual reality they may have sharp inflections , due to location, time, as well as web traffic volume thresholds
Type of Keyword – For example ,keywords for targeting sales for Eminem Albums may shoot up in a non linear manner after the musician dies.
The third and not so publicly known component of both the cost and revenue curves is factoring in internet industry dynamics , including relative market share of internet advertising platforms, as well as percentage splits between content creator and ad providing platforms.
For example, based on internet advertising spend, people belive that the internet advertising is currently heading for a duo-poly with Google and Facebook are the top two players, while Microsoft/Skype/Yahoo and LinkedIn/Twitter offer niche options, but primarily depend on price setting from Google/Bing/Facebook.
It is difficut to quantify  the elasticity and efficiency of market curves as most literature and research on this is by in-house corporate teams , or advisors or mentors or consultants to the primary leaders in a kind of incesteous fraternal hold on public academic research on this.
It is recommended that-
1) a balance be found in the need for corporate secrecy to protest shareholder value /stakeholder value maximization versus the need for data liberation for innovation and grow the internet ad pie faster-
2) Cost and Revenue Curves between different keywords, time,location, service providers, be studied by quants for hedging inetrent ad inventory or /and choose arbitrage positions This kind of analysis is done for groups of stocks and commodities in the financial world, but as commerce grows on the internet this may need more specific and independent quants.
3) attention be made to how cost and revenue curves mature as per level of sophistication of underlying economy like Brazil, Russia, China, Korea, US, Sweden may be in different stages of internet ad market evolution.
For example-
A study in cost and revenue curves for certain keywords across domains across various ad providers across various locations from 2003-2008 can help academia and research (much more than top ten lists of popular terms like non quantitative reports) as well as ensure that current algorithmic wightings are not inadvertently given away.
Part 2- of this series will explore the ways to create third party re-sellers of keywords and measuring impacts of search and ad engine optimization based on keywords.

Should you buy Zynga or Wait for the FB IPO

I am going to make a case for whether to buy or not buy  Zynga, and waiting to buy Facebook instead. Of course if Mark Pincus offers you a deep discount, and Mark Zuckenberg totally goes over the top with his P/E multiple, all bets would be re-valuated.

In the interest of your time, and my personal happiness, I am going to use a fairly standard way to measure attractiveness of both these companies- notably the Porter’s Five Forces Model. I will also review the recent experiences of Groupon and LinkedIn valuation to underscore what subtle differences in culture, and reputation of founders can affect the eventual value creation or destruction in an IPO.

(to be continued)

How many LinkedIn Connections do I have?

I have 8116 LinkedIn connections at

But wait I have only 7557 connections at

Somebody refresh those DB tables faster!! Where did 600 connections go!!

Of course if you see my profile at

Someone at LinkedIn totally forgot to update the 500+ cutoff for connections.