Facebook and R

Part 1 How do people at Facebook use R?

tamar Rosenn, Facebook

Itamar conveyed how Facebook’s Data Team used R in 2007 to answer two questions about new users: (i) which data points predict whether a user will stay? and (ii) if they stay, which data points predict how active they’ll be after three months?

For the first question, Itamar’s team used recursive partitioning (via the rpartpackage) to infer that just two data points are significantly predictive of whether a user remains on Facebook: (i) having more than one session as a new user, and (ii) entering basic profile information.

For the second question, they fit the data to a logistic model using a least angle regression approach (via the lars package), and found that activity at three months was predicted by variables related to three classes of behavior: (i) how often a user was reached out to by others, (ii) frequency of third party application use, and (iii) what Itamar termed “receptiveness” — related to how forthcoming a user was on the site.

source-http://www.dataspora.com/2009/02/predictive-analytics-using-r/

and cute graphs like the famous

https://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919

 

and

studying baseball on facebook

https://www.facebook.com/notes/facebook-data-team/baseball-on-facebook/10150142265858859

by counting the number of posts that occurred the day after a team lost divided by the total number of wins, since losses for great teams are remarkable and since winning teams’ fans just post more.

 

But mostly at

https://www.facebook.com/data?sk=notes and https://www.facebook.com/data?v=app_4949752878

 

and creating new packages

1. jjplot (not much action here!)

https://r-forge.r-project.org/scm/viewvc.php/?root=jjplot

though

I liked the promise of JJplot at

http://pleasescoopme.com/2010/03/31/using-jjplot-to-explore-tipping-behavior/

2. ising models

https://github.com/slycoder/Rflim

https://www.facebook.com/note.php?note_id=10150359708746212

3. R pipe

https://github.com/slycoder/Rpipe

 

even the FB interns are cool

http://brenocon.com/blog/2009/02/comparison-of-data-analysis-packages-r-matlab-scipy-excel-sas-spss-stata/

 

Part 2 How do people with R use Facebook?

Using the API at https://developers.facebook.com/tools/explorer

and code mashes from

 

http://romainfrancois.blog.free.fr/index.php?post/2012/01/15/Crawling-facebook-with-R

http://applyr.blogspot.in/2012/01/mining-facebook-data-most-liked-status.html

but the wonderful troubleshooting code from http://www.brocktibert.com/blog/2012/01/19/358/

which needs to be added to the code first

 

and using network package

>access_token=”XXXXXXXXXXXX”

Annoyingly the Facebook token can expire after some time, this can lead to huge wait and NULL results with Oauth errors

If that happens you need to regenerate the token

What we need
> require(RCurl)
> require(rjson)
> download.file(url=”http://curl.haxx.se/ca/cacert.pem”, destfile=”cacert.pem”)

Roman’s Famous Facebook Function (altered)

> facebook <- function( path = “me”, access_token , options){
+ if( !missing(options) ){
+ options <- sprintf( “?%s”, paste( names(options), “=”, unlist(options), collapse = “&”, sep = “” ) )
+ } else {
+ options <- “”
+ }
+ data <- getURL( sprintf( “https://graph.facebook.com/%s%s&access_token=%s&#8221;, path, options, access_token ), cainfo=”cacert.pem” )
+ fromJSON( data )
+ }

 

Now getting the friends list
> friends <- facebook( path=”me/friends” , access_token=access_token)
> # extract Facebook IDs
> friends.id <- sapply(friends$data, function(x) x$id)
> # extract names
> friends.name <- sapply(friends$data, function(x) iconv(x$name,”UTF-8″,”ASCII//TRANSLIT”))
> # short names to initials
> initials <- function(x) paste(substr(x,1,1), collapse=””)
> friends.initial <- sapply(strsplit(friends.name,” “), initials)

This matrix can take a long time to build, so you can change the value of N to say 40 to test your network. I needed to press the escape button to cut short the plotting of all 400 friends of mine.
> # friendship relation matrix
> N <- length(friends.id)
> friendship.matrix <- matrix(0,N,N)
> for (i in 1:N) {
+ tmp <- facebook( path=paste(“me/mutualfriends”, friends.id[i], sep=”/”) , access_token=access_token)
+ mutualfriends <- sapply(tmp$data, function(x) x$id)
+ friendship.matrix[i,friends.id %in% mutualfriends] <- 1
+ }

 

Plotting using Network package in R (with help from the  comments at http://applyr.blogspot.in/2012/01/mining-facebook-data-most-liked-status.html)

> require(network)

>net1<- as.network(friendship.matrix)

> plot(net1, label=friends.initial, arrowhead.cex=0)

(Rgraphviz is tough if you are on Windows 7 like me)

but there is an alternative igraph solution at https://github.com/sciruela/facebookFriends/blob/master/facebook.r

 

After all that-..talk.. a graph..of my Facebook Network with friends initials as labels..

 

Opinion piece-

I hope plans to make the Facebook R package get fulfilled (just as the twitteR  package led to many interesting analysis)

and also Linkedin has an API at http://developer.linkedin.com/apis

I think it would be interesting to plot professional relationships across social networks as well. But I hope to see a LinkedIn package (or blog code) soon.

As for jjplot, I had hoped ggplot and jjplot merged or atleast had some kind of inclusion in the Deducer GUI. Maybe a Google Summer of Code project if people are busy!!

Also the geeks at Facebook.com can think of giving something back to the R community, as Google generously does with funding packages like RUnit, Deducer and Summer of Code, besides sponsoring meet ups etc.

 

(note – this is part of the research for the upcoming book ” R for Business Analytics”)

 

ps-

but didnt get time to download all my posts using R code at

https://gist.github.com/1634662#

or do specific Facebook Page analysis using R at

http://tonybreyal.wordpress.com/2012/01/06/r-web-scraping-r-bloggers-facebook-page-to-gain-further-information-about-an-authors-r-blog-posts-e-g-number-of-likes-comments-shares-etc/

Updated-

 #access token from https://developers.facebook.com/tools/explorer
access_token="AAuFgaOcVaUZAssCvL9dPbZCjghTEwwhNxZAwpLdZCbw6xw7gARYoWnPHxihO1DcJgSSahd67LgZDZD"
require(RCurl)
 require(rjson)
# download the file needed for authentication http://www.brocktibert.com/blog/2012/01/19/358/
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")
# http://romainfrancois.blog.free.fr/index.php?post/2012/01/15/Crawling-facebook-with-R
facebook <- function( path = "me", access_token = token, options){
if( !missing(options) ){
options <- sprintf( "?%s", paste( names(options), "=", unlist(options), collapse = "&", sep = "" ) )
} else {
options <- ""
}
data <- getURL( sprintf( "https://graph.facebook.com/%s%s&access_token=%s", path, options, access_token ), cainfo="cacert.pem" )
fromJSON( data )
}

 # see http://applyr.blogspot.in/2012/01/mining-facebook-data-most-liked-status.html

# scrape the list of friends
friends <- facebook( path="me/friends" , access_token=access_token)
# extract Facebook IDs
friends.id <- sapply(friends$data, function(x) x$id)
# extract names 
friends.name <- sapply(friends$data, function(x)  iconv(x$name,"UTF-8","ASCII//TRANSLIT"))
# short names to initials 
initials <- function(x) paste(substr(x,1,1), collapse="")
friends.initial <- sapply(strsplit(friends.name," "), initials)

# friendship relation matrix
#N <- length(friends.id)
N <- 200
friendship.matrix <- matrix(0,N,N)
for (i in 1:N) {
  tmp <- facebook( path=paste("me/mutualfriends", friends.id[i], sep="/") , access_token=access_token)
  mutualfriends <- sapply(tmp$data, function(x) x$id)
  friendship.matrix[i,friends.id %in% mutualfriends] <- 1
}
require(network)
net1<- as.network(friendship.matrix)
plot(net1, label=friends.initial, arrowhead.cex=0)

Created by Pretty R at inside-R.org

How to add or change the %PATH variable in Windows 7

See this for a simple 5 step way to add or change the %PATH variable in Windows 7 if you need to install an application that shows error while installing (because that APP was built for Linux based systems… )

Interview: Hjálmar Gíslason, CEO of DataMarket.com

Here is an interview with Hjálmar Gíslason, CEO of Datamarket.com  . DataMarket is an active marketplace for structured data and statistics. Through powerful search and visual data exploration, DataMarket connects data seekers with data providers.

 

Ajay-  Describe your journey as an entrepreneur and techie in Iceland. What are the 10 things that surprised you most as a tech entrepreneur.

HG- DataMarket is my fourth tech start-up since at age 20 in 1996. The previous ones have been in gaming, mobile and web search. I come from a technical background but have been moving more and more to the business side over the years. I can still prototype, but I hope there isn’t a single line of my code in production!

Funny you should ask about the 10 things that have surprised me the most on this journey, as I gave a presentation – literally yesterday – titled: “9 things nobody told me about the start-up business”

They are:
* Do NOT generalize – especially not to begin with
* Prioritize – and find a work-flow that works for you
* Meet people – face to face
* You are a sales person – whether you like it or not
* Technology is not a product – it’s the entire experience
* Sell the current version – no matter how amazing the next one is
* Learn from mistakes – preferably others’
* Pick the right people – good people is not enough
* Tell a good story – but don’t make them up

I obviously elaborate on each of these points in the talk, but the points illustrate roughly some of the things I believe I’ve learned … so far 😉

9 things nobody told me about the start-up business

Ajay-

Both Amazon  and Google  have entered the public datasets space. Infochimps  has 14,000+ public datasets. The US has http://www.data.gov/

So clearly the space is both competitive and yet the demand for public data repositories is clearly under served still. 

How does DataMarket intend to address this market in a unique way to differentiate itself from others.

HG- DataMarket is about delivering business data to decision makers. We help data seekers find the data they need for planning and informed decision making, and data publishers reaching this audience. DataMarket.com is the meeting point, where data seekers can come to find the best available data, and data publishers can make their data available whether for free or for a fee. We’ve populated the site with a wealth of data from public sources such as the UN, Eurostat, World Bank, IMF and others, but there is also premium data that is only available to those that subscribe to and pay for the access. For example we resell the entire data offering from the EIU (Economist Intelligence Unit) (link: http://datamarket.com/data/list/?q=provider:eiu)

DataMarket.com allows all this data to be searched, visualized, compared and downloaded in a single place in a standard, unified manner.

We see many of these efforts not as competition, but as valuable potential sources of data for our offering, while others may be competing with parts of our proposition, such as easy access to the public data sets.

 

Ajay- What are your views on data confidentiality and access to data owned by Governments funded by tax payer money.

HG- My views are very simple: Any data that is gathered or created for taxpayers’ money should be open and free of charge unless higher priorities such as privacy or national security indicate otherwise.

Reflecting that, any data that is originally open and free of charge is still open and free of charge on DataMarket.com, just easier to find and work with.

Ajay-  How is the technology entrepreneurship and venture capital scene in Iceland. What things work and what things can be improved?

HG- The scene is quite vibrant, given the small community. Good teams with promising concepts have been able to get the funding they need to get started and test their footing internationally. When the rapid growth phase is reached outside funding may still be needed.

There are positive and negative things about any location. Among the good things about Iceland from the stand point of a technology start-up are highly skilled tech people and a relatively simple corporate environment. Among the bad things are a tiny local market, lack of skills in international sales and marketing and capital controls that were put in place after the crash of the Icelandic economy in 2008.

I’ve jokingly said that if a company is hot in the eyes of VCs it would get funding even if it was located in the jungles of Congo, while if they’re only lukewarm towards you, they will be looking for any excuse not to invest. Location can certainly be one of them, and in that case being close to the investor communities – physically – can be very important.

We’re opening up our sales and marketing offices in Boston as we speak. Not to be close to investors though, but to be close to our market and current customers.

Ajay- Describe your hobbies when you are not founding amazing tech startups.

HG- Most of my time is spent working – which happens to by my number one hobby.

It is still important to step away from it all every now and then to see things in perspective and come back with a clear mind.

I *love* traveling to exotic places. Me and my wife have done quite a lot of traveling in Africa and S-America: safari, scuba diving, skiing, enjoying nature. When at home I try to do some sports activities 3-4 times a week at least, and – recently – play with my now 8 month old son as much as I can.

About-

http://datamarket.com/p/about/team/

Management

Hjalmar GislasonHjálmar Gíslason, Founder and CEO: Hjalmar is a successful entrepreneur, founder of three startups in the gaming, mobile and web sectors since 1996. Prior to launching DataMarket, Hjalmar worked on new media and business development for companies in the Skipti Group (owners of Iceland Telecom) after their acquisition of his search startup – Spurl. Hjalmar offers a mix of business, strategy and technical expertise. DataMarket is based largely on his vision of the need for a global exchange for structured data.

hjalmar.gislason@datamarket.com

To know more, have a quick  look at  http://datamarket.com/

Teradata Analytics

A recent announcement showing Teradata partnering with KXEN and Revolution Analytics for Teradata Analytics.

http://www.teradata.com/News-Releases/2012/Teradata-Expands-Integrated-Analytics-Portfolio/

The Latest in Open Source Emerging Software Technologies
Teradata provides customers with two additional open source technologies – “R” technology from Revolution Analytics for analytics and GeoServer technology for spatial data offered by the OpenGeo organization – both of which are able to leverage the power of Teradata in-database processing for faster, smarter answers to business questions.

In addition to the existing world-class analytic partners, Teradata supports the use of the evolving “R” technology, an open source language for statistical computing and graphics. “R” technology is gaining popularity with data scientists who are exploiting its new and innovative capabilities, which are not readily available. The enhanced “R add-on for Teradata” has a 50 percent performance improvement, it is easier to use, and its capabilities support large data analytics. Users can quickly profile, explore, and analyze larger quantities of data directly in the Teradata Database to deliver faster answers by leveraging embedded analytics.

Teradata has partnered with Revolution Analytics, the leading commercial provider of “R” technology, because of customer interest in high-performing R applications that deliver superior performance for large-scale data. “Our innovative customers understand that big data analytics takes a smart approach to the entire infrastructure and we will enable them to differentiate their business in a cost-effective way,” said David Rich, chief executive officer, Revolution Analytics. “We are excited to partner with Teradata, because we see great affinity between Teradata and Revolution Analytics – we embrace parallel computing and the high performance offered by multi-core and multi-processor hardware.”

and

The Teradata Data Lab empowers business users and leading analytic partners to start building new analytics in less than five minutes, as compared to waiting several weeks for the IT department’s assistance.

“The Data Lab within the Teradata database provides the perfect foundation to enable self-service predictive analytics with KXEN InfiniteInsight,” said John Ball, chief executive officer, KXEN. “Teradata technologies, combined with KXEN’s automated modeling capabilities and in-database scoring, put the power of predictive analytics and data mining directly into the hands of business users. This powerful combination helps our joint customers accelerate insight by delivering top-quality models in orders of magnitude faster than traditional approaches.”

Read more at

http://www.sacbee.com/2012/03/06/4315500/teradata-expands-integrated-analytics.html

Using Cloud Computing for Hacking

This is not about hacking the cloud. Instead this is about using the cloud to hack

 

Some articles last year wrote on how hackers used Amazon Ec2 for hacking/ddos attacks.

http://www.pcworld.com/businesscenter/article/216434/cloud_computing_used_to_hack_wireless_passwords.html

Roth claims that a typical wireless password can be guessed by EC2 and his software in about six minutes. He proved this by hacking networks in the area where he lives. The type of EC2 computers used in the attack costs 28 cents per minute, so $1.68 is all it could take to lay open a wireless network.

and

http://www.bloomberg.com/news/2011-05-15/sony-attack-shows-amazon-s-cloud-service-lures-hackers-at-pennies-an-hour.html

Cloud services are also attractive for hackers because the use of multiple servers can facilitate tasks such as cracking passwords, said Ray Valdes, an analyst at Gartner Inc. Amazon could improve measures to weed out bogus accounts, he said.

 

and this article by Anti-Sec pointed out how one can obtain a debit card anonymously

https://www.facebook.com/notes/lulzsec/want-to-be-a-ghost-on-the-internet/230293097062823

VPN Account without paper trail

  • Purchase prepaid visa card with cash
  • Purchase Bitcoins with Money Order
  • Donate Bitcoins to different account

 

Masking your IP address to log on is done by TOR

https://www.torproject.org/download/download.html.en

and the actual flooding is done by tools like LOIC or HOIC

http://sourceforge.net/projects/loic/

and

http://www.4shared.com/rar/UmCu0ds1/hoic.html

 

So what safeguards can be expected from the next wave of Teenage Mutant Ninjas..?