Interview Alain Chesnais Chief Scientist Trendspottr.com

Here is a brief interview with Alain Chesnais ,Chief Scientist  Trendspottr.com. It is a big honor to interview such a legend in computer science, and I am grateful to both him and Mark Zohar for taking time to write these down.
alain_chesnais2.jpg

Ajay-  Describe your career from your student days to being the President of ACM (Association of Computing Machinery http://www.acm.org/ ). How can we increase  the interest of students in STEM education, particularly in view of the shortage of data scientists.
 
Alain- I’m trying to sum up a career of over 35 years. This may be a bit long winded…
I started my career in CS when I was in high school in the early 70’s. I was accepted in the National Science Foundation’s Science Honors Program in 9th grade and the first course I took was a Fortran programming course at Columbia University. This was on an IBM 360 using punch cards.
The next year my high school got a donation from DEC of a PDP-8E mini computer. I ended up spending a lot of time in the machine room all through high school at a time when access to computers wasn’t all that common. I went to college in Paris and ended up at l’Ecole Normale Supérieure de Cachan in the newly created Computer Science department.
My first job after finishing my graduate studies was as a research assistant at the Centre National de la Recherche Scientifique where I focused my efforts on modelling the behaviour of distributed database systems in the presence of locking. When François Mitterand was elected president of France in 1981, he invited Nicholas Negroponte and Seymour Papert to come to France to set up the Centre Mondial Informatique. I was hired as a researcher there and continued on to become director of software development until it was closed down in 1986. I then started up my own company focusing on distributed computer graphics. We sold the company to Abvent in the early 90’s.
After that, I was hired by Thomson Digital Image to lead their rendering team. We were acquired by Wavefront Technologies in 1993 then by SGI in 1995 and merged with Alias Research. In the merged company: Alias|wavefront, I was director of engineering on the Maya project. Our team received an Oscar in 2003 for the creation of the Maya software system.
Since then I’ve worked at various companies, most recently focusing on social media and Big Data issues associated with it. Mark Zohar and I worked together at SceneCaster in 2007 where we developed a Facebook app that allowed users to create their own 3D scenes and share them with friends via Facebook without requiring a proprietary plugin. In December 2007 it was the most popular app in its category on Facebook.
Recently Mark approached me with a concept related to mining the content of public tweets to determine what was trending in real time. Using math similar to what I had developed during my graduate studies to model the performance of distributed databases in the presence of locking, we built up a real time analytics engine that ranks the content of tweets as they stream in. The math is designed to scale linearly in complexity with the volume of data that we analyze. That is the basis for what we have created for TrendSpottr.
In parallel to my professional career, I have been a very active volunteer at ACM. I started out as a member of the Paris ACM SIGGRAPH chapter in 1985 and volunteered to help do our mailings (snail mail at the time). After taking on more responsibilities with the chapter, I was elected chair of the chapter in 1991. I was first appointed to the SIGGRAPH Local Groups Steering Committee, then became ACM Director for Chapters. Later I was successively elected SIGGRAPH Vice Chair, ACM SIG Governing Board (SGB) Vice Chair for Operations, SGB Chair, ACM SIGGRAPH President, ACM Secretary/Treasurer, ACM Vice President, and finally, in 2010, I was elected ACM President. My term as ACM President has just ended on July 1st. Vint Cerf is our new President. I continue to serve on the ACM Executive Committee in my role as immediate Past President.
(Note- About ACM
ACM, the Association for Computing Machinery www.acm.org, is the world’s largest educational and scientific computing society, uniting computing educators, researchers and professionals to inspire dialogue, share resources and address the field’s challenges. )
Ajay- What sets Trendspotter apart from other startups out there in terms of vision in trying to achieve a more coherent experience on the web.
 
Alain- The Basic difference with other approaches that we are aware of is that we have developed an incremental solution that calculates the results on the fly as the data streams in. Our evaluators are based on solid mathematical foundations that have proven their usefulness over time. One way to describe what we do is to think of it as signal processing where the tweets are the signal and our evaluators are like triggers that tell you what elements of the signal have the characteristics that we are filtering for (velocity and acceleration). One key result of using this approach is that our unit cost per tweet analyzed does not go up with increased volume. Using more traditional data analysis approaches involving an implicit sort would imply a complexity of N*log(N), where N is the volume of tweets being analyzed. That would imply that the cost per tweet analyzed would go up with the volume of tweets. Our approach was designed to avoid that, so that we can maintain a cap on our unit costs of analysis, no matter what volume of data we analyze.
Ajay- What do you think is the future of big data visualization going to look like? What are some of the technologies that you are currently bullish on?
Alain- I see several trends that would have deep impact on Big Data visualization. I firmly believe that with large amounts of data, visualization is key tool for understanding both the structure and the relationships that exist between data elements. Let’s focus on some of the key things that are pushing in this direction:
  • the volume of data that is available is growing at a rate we have never seen before. Cisco has measured an 8 fold increase in the volume of IP traffic over the last 5 years and predicts that we will reach the zettabyte of data over IP in 2016
  • more of the data is becoming publicly available. This isn’t only on social networks such as Facebook and twitter, but joins a more general trend involving open research initiatives and open government programs
  • the desired time to get meaningful results is going down dramatically. In the past 5 years we have seen the half life of data on Facebook, defined as the amount of time that half of the public reactions to any given post (likes, shares., comments) take place, go from about 12 hours to under 3 hours currently
  • our access to the net is always on via mobile device. You are always connected.
  • the CPU and GPU capabilities of mobile devices is huge (an iPhone has 10 times the compute power of a Cray-1 and more graphics capabilities than early SGI workstations)
Put all of these observations together and you quickly come up with a massive opportunity to analyze data visually on the go as it happens no matter where you are. We can’t afford to have to wait for results. When something of interest occurs we need to be aware of it immediately.
Ajay- What are some of the applications we could use Trendspottr. Could we predict events like Arab Spring, or even the next viral thing.
 
Alain- TrendSpottr won’t predict what will happen next. What it *will* do is alert you immediately when it happens. You can think of it like a smoke detector. It doesn’t tell that a fire will take place, but it will save your life when a fire does break out.
Typical uses for TrendSpottr are
  • thought leadership by tracking content that your readership is interested in via TrendSpottr you can be seen as a thought leader on the subject by being one of the first to share trending content on a given subject. I personally do this on my Facebook page (http://www.facebook.com/alain.chesnais) and have seen my klout score go up dramatically as a result
  • brand marketing to be able to know when something is trending about your brand and take advantage of it as it happens.
  • competitive analysis to see what is being said about two competing elements. For instance, searching TrendSpottr for “Obama OR Romney” gives you a very good understanding about how social networks are reacting to each politician. You can also do searches like “$aapl OR $msft OR $goog” to get a sense of what is the current buzz for certain hi tech stocks.
  • understanding your impact in real time to be able to see which of the content that you are posting is trending the most on social media so that you can highlight it on your main page. So if all of your content is hosted on common domain name (ourbrand.com), searching for ourbrand.com will show you the most active of your site’s content. That can easily be set up by putting a TrendSpottr widget on your front page

Ajay- What are some of the privacy guidelines that you keep in  mind- given the fact that you collect individual information but also have government agencies as potential users.

 
Alain- We take privacy very seriously and anonymize all of the data that we collect. We don’t keep explicit records of the data we collected through the various incoming streams and only store the aggregate results of our analysis.
About
Alain Chesnais is immediate Past President of ACM, elected for the two-year term beginning July 1, 2010.Chesnais studied at l’Ecole Normale Supérieure de l’Enseignement Technique and l’Université de Paris where he earned a Maîtrise de Mathematiques, a Maitrise de Structure Mathématique de l’Informatique, and a Diplôme d’Etudes Approfondies in Compuer Science. He was a high school student at the United Nations International School in New York, where, along with preparing his International Baccalaureate with a focus on Math, Physics and Chemistry, he also studied Mandarin Chinese.Chesnais recently founded Visual Transitions, which specializes in helping companies move to HTML 5, the newest standard for structuring and presenting content on the World Wide Web. He was the CTO of SceneCaster.com from June 2007 until April 2010, and was Vice President of Product Development at Tucows Inc. from July 2005 – May 2007. He also served as director of engineering at Alias|Wavefront on the team that received an Oscar from the Academy of Motion Picture Arts and Sciences for developing the Maya 3D software package.

Prior to his election as ACM president, Chesnais was vice president from July 2008 – June 2010 as well as secretary/treasurer from July 2006 – June 2008. He also served as president of ACM SIGGRAPH from July 2002 – June 2005 and as SIG Governing Board Chair from July 2000 – June 2002.

As a French citizen now residing in Canada, he has more than 20 years of management experience in the software industry. He joined the local SIGGRAPH Chapter in Paris some 20 years ago as a volunteer and has continued his involvement with ACM in a variety of leadership capacities since then.

About Trendspottr.com

TrendSpottr is a real-time viral search and predictive analytics service that identifies the most timely and trending information for any topic or keyword. Our core technology analyzes real-time data streams and spots emerging trends at their earliest acceleration point — hours or days before they have become “popular” and reached mainstream awareness.

TrendSpottr serves as a predictive early warning system for news and media organizations, brands, government agencies and Fortune 500 companies and helps them to identify emerging news, events and issues that have high viral potential and market impact. TrendSpottr has partnered with HootSuite, DataSift and other leading social and big data companies.

Book Review- Machine Learning for Hackers

This is review of the fashionably named book Machine Learning for Hackers by Drew Conway and John Myles White (O’Reilly ). The book is about hacking code in R.

 

The preface introduces the reader to the authors conception of what machine learning and hacking is all about. If the name of the book was machine learning for business analytsts or data miners, I am sure the content would have been unchanged though the popularity (and ambiguity) of the word hacker can often substitute for its usefulness. Indeed the many wise and learned Professors of statistics departments through out the civilized world would be mildly surprised and bemused by their day to day activities as hacking or teaching hackers. The book follows a case study and example based approach and uses the GGPLOT2 package within R programming almost to the point of ignoring any other native graphics system based in R. It can be quite useful for the aspiring reader who wishes to understand and join the booming market for skilled talent in statistical computing.

Chapter 1 has a very useful set of functions for data cleansing and formatting. It walks you through the basics of formatting based on dates and conditions, missing value and outlier treatment and using ggplot package in R for graphical analysis. The case study used is an Infochimps dataset with 60,000 recordings of UFO sightings. The case study is lucid, and done at a extremely helpful pace illustrating the powerful and flexible nature of R functions that can be used for data cleansing.The chapter mentions text editors and IDEs but fails to list them in a tabular format, while listing several other tables like Packages used in the book. It also jumps straight from installation instructions to functions in R without getting into the various kinds of data types within R or specifying where these can be referenced from. It thus assumes a higher level of basic programming understanding for the reader than the average R book.

Chapter 2 discusses data exploration, and has a very clear set of diagrams that explain the various data summary operations that are performed routinely. This is an innovative approach and will help students or newcomers to the field of data analysis. It introduces the reader to type determination functions, as well different kinds of encoding. The introduction to creating functions is quite elegant and simple , and numerical summary methods are explained adequately. While the chapter explains data exploration with the help of various histogram options in ggplot2 , it fails to create a more generic framework for data exploration or rules to assist the reader in visual data exploration in non standard data situations. While the examples are very helpful for a reader , there needs to be slightly more depth to step out of the example and into a framework for visual data exploration (or references for the same). A couple of case studies however elaborately explained cannot do justice to the vast field of data exploration and especially visual data exploration.

Chapter 3 discussed binary classification for the specific purpose for spam filtering using a dataset from SpamAssassin. It introduces the reader to the naïve Bayes classifier and the principles of text mining suing the tm package in R. Some of the example codes could have been better commented for easier readability in the book. Overall it is quite a easy tutorial for creating a naïve Bayes classifier even for beginners.

Chapter 4 discusses the issues in importance ranking and creating recommendation systems specifically in the case of ordering email messages into important and not important. It introduces the useful grepl, gsub, strsplit, strptime ,difftime and strtrim functions for parsing data. The chapter further introduces the reader to the concept of log (and affine) transformations in a lucid and clear way that can help even beginners learn this powerful transformation concept. Again the coding within this chapter is sparsely commented which can cause difficulties to people not used to learn reams of code. ( it may have been part of the code attached with the book, but I am reading an electronic book and I did not find an easy way to go back and forth between the code and the book). The readability of the chapters would be further enhanced by the use of flow charts explaining the path and process followed than overtly verbose textual descriptions running into multiple pages. The chapters are quite clearly written, but a helpful visual summary can help in both revising the concepts and elucidate the approach taken further.A suggestion for the authors could be to compile the list of useful functions they introduce in this book as a sort of reference card (or Ref Card) for R Hackers or atleast have a chapter wise summary of functions, datasets and packages used.

Chapter 5 discusses linear regression , and it is a surprising and not very good explanation of regression theory in the introduction to regression. However the chapter makes up in practical example what it oversimplifies in theory. The chapter on regression is not the finest chapter written in this otherwise excellent book. Part of this is because of relative lack of organization- correlation is explained after linear regression is explained. Once again the lack of a function summary and a process flow diagram hinders readability and a separate section on regression metrics that help make a regression result good or not so good could be a welcome addition. Functions introduced include lm.

Chapter 6 showcases Generalized Additive Model (GAM) and Polynomial Regression, including an introduction to singularity and of over-fitting. Functions included in this chapter are transform, and poly while the package glmnet is also used here. The chapter also introduces the reader formally to the concept of cross validation (though examples of cross validation had been introduced in earlier chapters) and regularization. Logistic regression is also introduced at the end in this chapter.

Chapter 7 is about optimization. It describes error metric in a very easy to understand way. It creates a grid by using nested loops for various values of intercept and slope of a regression equation and computing the sum of square of errors. It then describes the optim function in detail including how it works and it’s various parameters. It introduces the curve function. The chapter then describes ridge regression including definition and hyperparameter lamda. The use of optim function to optimize the error in regression is useful learning for the aspiring hacker. Lastly it describes a case study of breaking codes using the simplistic Caesar cipher, a lexical database and the Metropolis method. Functions introduced in this chapter include .Machine$double.eps .

Chapter 8 deals with Principal Component Analysis and unsupervised learning. It uses the ymd function from lubridate package to convert string to date objects, and the cast function from reshape package to further manipulate the structure of data. Using the princomp functions enables PCA in R.The case study creates a stock market index and compares the results with the Dow Jones index.

Chapter 9 deals with Multidimensional Scaling as well as clustering US senators on the basis of similarity in voting records on legislation .It showcases matrix multiplication using %*% and also the dist function to compute distance matrix.

Chapter 10 has the subject of K Nearest Neighbors for recommendation systems. Packages used include class ,reshape and and functions used include cor, function and log. It also demonstrates creating a custom kNN function for calculating Euclidean distance between center of centroids and data. The case study used is the R package recommendation contest on Kaggle. Overall a simplistic introduction to creating a recommendation system using K nearest neighbors, without getting into any of the prepackaged packages within R that deal with association analysis , clustering or recommendation systems.

Chapter 11 introduces the reader to social network analysis (and elements of graph theory) using the example of Erdos Number as an interesting example of social networks of mathematicians. The example of Social Graph API by Google for hacking are quite new and intriguing (though a bit obsolete by changes, and should be rectified in either the errata or next edition) . However there exists packages within R that should be atleast referenced or used within this chapter (like TwitteR package that use the Twitter API and ROauth package for other social networks). Packages used within this chapter include Rcurl, RJSONIO, and igraph packages of R and functions used include rbind and ifelse. It also introduces the reader to the advanced software Gephi. The last example is to build a recommendation engine for whom to follow in Twitter using R.

Chapter 12 is about model comparison and introduces the concept of Support Vector Machines. It uses the package e1071 and shows the svm function. It also introduces the concept of tuning hyper parameters within default algorithms . A small problem in understanding the concepts is the misalignment of diagram pages with the relevant code. It lastly concludes with using mean square error as a method for comparing models built with different algorithms.

 

Overall the book is a welcome addition in the library of books based on R programming language, and the refreshing nature of the flow of material and the practicality of it’s case studies make this a recommended addition to both academic and corporate business analysts trying to derive insights by hacking lots of heterogeneous data.

Have a look for yourself at-
http://shop.oreilly.com/product/0636920018483.do

Facebook and R

Part 1 How do people at Facebook use R?

tamar Rosenn, Facebook

Itamar conveyed how Facebook’s Data Team used R in 2007 to answer two questions about new users: (i) which data points predict whether a user will stay? and (ii) if they stay, which data points predict how active they’ll be after three months?

For the first question, Itamar’s team used recursive partitioning (via the rpartpackage) to infer that just two data points are significantly predictive of whether a user remains on Facebook: (i) having more than one session as a new user, and (ii) entering basic profile information.

For the second question, they fit the data to a logistic model using a least angle regression approach (via the lars package), and found that activity at three months was predicted by variables related to three classes of behavior: (i) how often a user was reached out to by others, (ii) frequency of third party application use, and (iii) what Itamar termed “receptiveness” — related to how forthcoming a user was on the site.

source-http://www.dataspora.com/2009/02/predictive-analytics-using-r/

and cute graphs like the famous

https://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919

 

and

studying baseball on facebook

https://www.facebook.com/notes/facebook-data-team/baseball-on-facebook/10150142265858859

by counting the number of posts that occurred the day after a team lost divided by the total number of wins, since losses for great teams are remarkable and since winning teams’ fans just post more.

 

But mostly at

https://www.facebook.com/data?sk=notes and https://www.facebook.com/data?v=app_4949752878

 

and creating new packages

1. jjplot (not much action here!)

https://r-forge.r-project.org/scm/viewvc.php/?root=jjplot

though

I liked the promise of JJplot at

http://pleasescoopme.com/2010/03/31/using-jjplot-to-explore-tipping-behavior/

2. ising models

https://github.com/slycoder/Rflim

https://www.facebook.com/note.php?note_id=10150359708746212

3. R pipe

https://github.com/slycoder/Rpipe

 

even the FB interns are cool

http://brenocon.com/blog/2009/02/comparison-of-data-analysis-packages-r-matlab-scipy-excel-sas-spss-stata/

 

Part 2 How do people with R use Facebook?

Using the API at https://developers.facebook.com/tools/explorer

and code mashes from

 

http://romainfrancois.blog.free.fr/index.php?post/2012/01/15/Crawling-facebook-with-R

http://applyr.blogspot.in/2012/01/mining-facebook-data-most-liked-status.html

but the wonderful troubleshooting code from http://www.brocktibert.com/blog/2012/01/19/358/

which needs to be added to the code first

 

and using network package

>access_token=”XXXXXXXXXXXX”

Annoyingly the Facebook token can expire after some time, this can lead to huge wait and NULL results with Oauth errors

If that happens you need to regenerate the token

What we need
> require(RCurl)
> require(rjson)
> download.file(url=”http://curl.haxx.se/ca/cacert.pem”, destfile=”cacert.pem”)

Roman’s Famous Facebook Function (altered)

> facebook <- function( path = “me”, access_token , options){
+ if( !missing(options) ){
+ options <- sprintf( “?%s”, paste( names(options), “=”, unlist(options), collapse = “&”, sep = “” ) )
+ } else {
+ options <- “”
+ }
+ data <- getURL( sprintf( “https://graph.facebook.com/%s%s&access_token=%s&#8221;, path, options, access_token ), cainfo=”cacert.pem” )
+ fromJSON( data )
+ }

 

Now getting the friends list
> friends <- facebook( path=”me/friends” , access_token=access_token)
> # extract Facebook IDs
> friends.id <- sapply(friends$data, function(x) x$id)
> # extract names
> friends.name <- sapply(friends$data, function(x) iconv(x$name,”UTF-8″,”ASCII//TRANSLIT”))
> # short names to initials
> initials <- function(x) paste(substr(x,1,1), collapse=””)
> friends.initial <- sapply(strsplit(friends.name,” “), initials)

This matrix can take a long time to build, so you can change the value of N to say 40 to test your network. I needed to press the escape button to cut short the plotting of all 400 friends of mine.
> # friendship relation matrix
> N <- length(friends.id)
> friendship.matrix <- matrix(0,N,N)
> for (i in 1:N) {
+ tmp <- facebook( path=paste(“me/mutualfriends”, friends.id[i], sep=”/”) , access_token=access_token)
+ mutualfriends <- sapply(tmp$data, function(x) x$id)
+ friendship.matrix[i,friends.id %in% mutualfriends] <- 1
+ }

 

Plotting using Network package in R (with help from the  comments at http://applyr.blogspot.in/2012/01/mining-facebook-data-most-liked-status.html)

> require(network)

>net1<- as.network(friendship.matrix)

> plot(net1, label=friends.initial, arrowhead.cex=0)

(Rgraphviz is tough if you are on Windows 7 like me)

but there is an alternative igraph solution at https://github.com/sciruela/facebookFriends/blob/master/facebook.r

 

After all that-..talk.. a graph..of my Facebook Network with friends initials as labels..

 

Opinion piece-

I hope plans to make the Facebook R package get fulfilled (just as the twitteR  package led to many interesting analysis)

and also Linkedin has an API at http://developer.linkedin.com/apis

I think it would be interesting to plot professional relationships across social networks as well. But I hope to see a LinkedIn package (or blog code) soon.

As for jjplot, I had hoped ggplot and jjplot merged or atleast had some kind of inclusion in the Deducer GUI. Maybe a Google Summer of Code project if people are busy!!

Also the geeks at Facebook.com can think of giving something back to the R community, as Google generously does with funding packages like RUnit, Deducer and Summer of Code, besides sponsoring meet ups etc.

 

(note – this is part of the research for the upcoming book ” R for Business Analytics”)

 

ps-

but didnt get time to download all my posts using R code at

https://gist.github.com/1634662#

or do specific Facebook Page analysis using R at

http://tonybreyal.wordpress.com/2012/01/06/r-web-scraping-r-bloggers-facebook-page-to-gain-further-information-about-an-authors-r-blog-posts-e-g-number-of-likes-comments-shares-etc/

Updated-

 #access token from https://developers.facebook.com/tools/explorer
access_token="AAuFgaOcVaUZAssCvL9dPbZCjghTEwwhNxZAwpLdZCbw6xw7gARYoWnPHxihO1DcJgSSahd67LgZDZD"
require(RCurl)
 require(rjson)
# download the file needed for authentication http://www.brocktibert.com/blog/2012/01/19/358/
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")
# http://romainfrancois.blog.free.fr/index.php?post/2012/01/15/Crawling-facebook-with-R
facebook <- function( path = "me", access_token = token, options){
if( !missing(options) ){
options <- sprintf( "?%s", paste( names(options), "=", unlist(options), collapse = "&", sep = "" ) )
} else {
options <- ""
}
data <- getURL( sprintf( "https://graph.facebook.com/%s%s&access_token=%s", path, options, access_token ), cainfo="cacert.pem" )
fromJSON( data )
}

 # see http://applyr.blogspot.in/2012/01/mining-facebook-data-most-liked-status.html

# scrape the list of friends
friends <- facebook( path="me/friends" , access_token=access_token)
# extract Facebook IDs
friends.id <- sapply(friends$data, function(x) x$id)
# extract names 
friends.name <- sapply(friends$data, function(x)  iconv(x$name,"UTF-8","ASCII//TRANSLIT"))
# short names to initials 
initials <- function(x) paste(substr(x,1,1), collapse="")
friends.initial <- sapply(strsplit(friends.name," "), initials)

# friendship relation matrix
#N <- length(friends.id)
N <- 200
friendship.matrix <- matrix(0,N,N)
for (i in 1:N) {
  tmp <- facebook( path=paste("me/mutualfriends", friends.id[i], sep="/") , access_token=access_token)
  mutualfriends <- sapply(tmp$data, function(x) x$id)
  friendship.matrix[i,friends.id %in% mutualfriends] <- 1
}
require(network)
net1<- as.network(friendship.matrix)
plot(net1, label=friends.initial, arrowhead.cex=0)

Created by Pretty R at inside-R.org

Interview Michal Kosinski , Concerto Web Based App using #Rstats

Here is an interview with Michal Kosinski , leader of the team that has created Concerto – a web based application using R. What is Concerto? As per http://www.psychometrics.cam.ac.uk/page/300/concerto-testing-platform.htm

Concerto is a web based, adaptive testing platform for creating and running rich, dynamic tests. It combines the flexibility of HTML presentation with the computing power of the R language, and the safety and performance of the MySQL database. It’s totally free for commercial and academic use, and it’s open source

Ajay-  Describe your career in science from high school to this point. What are the various stats platforms you have trained on- and what do you think about their comparative advantages and disadvantages?  

Michal- I started with maths, but quickly realized that I prefer social sciences – thus after one year, I switched to a psychology major and obtained my MSc in Social Psychology with a specialization in Consumer Behaviour. At that time I was mostly using SPSS – as it was the only statistical package that was taught to students in my department. Also, it was not too bad for small samples and the rather basic analyses I was performing at that time.

 

My more recent research performed during my Mphil course in Psychometrics at Cambridge University followed by my current PhD project in social networks and research work at Microsoft Research, requires significantly more powerful tools. Initially, I tried to squeeze as much as possible from SPSS/PASW by mastering the syntax language. SPSS was all I knew, though I reached its limits pretty quickly and was forced to switch to R. It was a pretty dreary experience at the start, switching from an unwieldy but familiar environment into an unwelcoming command line interface, but I’ve quickly realized how empowering and convenient this tool was.

 

I believe that a course in R should be obligatory for all students that are likely to come close to any data analysis in their careers. It is really empowering – once you got the basics you have the potential to use virtually any method there is, and automate most tasks related to analysing and processing data. It is also free and open-source – so you can use it wherever you work. Finally, it enables you to quickly and seamlessly migrate to other powerful environments such as Matlab, C, or Python.

Ajay- What was the motivation behind building Concerto?

Michal- We deal with a lot of online projects at the Psychometrics Centre – one of them attracted more than 7 million unique participants. We needed a powerful tool that would allow researchers and practitioners to conveniently build and deliver online tests.

Also, our relationships with the website designers and software engineers that worked on developing our tests were rather difficult. We had trouble successfully explaining our needs, each little change was implemented with a delay and at significant cost. Not to mention the difficulties with embedding some more advanced methods (such as adaptive testing) in our tests.

So we created a tool allowing us, psychometricians, to easily develop psychometric tests from scratch an publish them online. And all this without having to hire software developers.

Ajay -Why did you choose R as the background for Concerto? What other languages and platforms did you consider. Apart from Concerto, how else do you utilize R in your center, department and University?

Michal- R was a natural choice as it is open-source, free, and nicely integrates with a server environment. Also, we believe that it is becoming a universal statistical and data processing language in science. We put increasing emphasis on teaching R to our students and we hope that it will replace SPSS/PASW as a default statistical tool for social scientists.

Ajay -What all can Concerto do besides a computer adaptive test?

Michal- We did not plan it initially, but Concerto turned out to be extremely flexible. In a nutshell, it is a web interface to R engine with a built-in MySQL database and easy-to-use developer panel. It can be installed on both Windows and Unix systems and used over the network or locally.

Effectively, it can be used to build any kind of web application that requires a powerful and quickly deployable statistical engine. For instance, I envision an easy to use website (that could look a bit like SPSS) allowing students to analyse their data using a web browser alone (learning the underlying R code simultaneously). Also, the authors of R libraries (or anyone else) could use Concerto to build user-friendly web interfaces to their methods.

Finally, Concerto can be conveniently used to build simple non-adaptive tests and questionnaires. It might seem to be slightly less intuitive at first than popular questionnaire services (such us my favourite Survey Monkey), but has virtually unlimited flexibility when it comes to item format, test flow, feedback options, etc. Also, it’s free.

Ajay- How do you see the cloud computing paradigm growing? Do you think browser based computation is here to stay?

Michal – I believe that cloud infrastructure is the future. Dynamically sharing computational and network resources between online service providers has a great competitive advantage over traditional strategies to deal with network infrastructure. I am sure the security concerns will be resolved soon, finishing the transformation of the network infrastructure as we know it. On the other hand, however, I do not see a reason why client-side (or browser) processing of the information should cease to exist – I rather think that the border between the cloud and personal or local computer will continually dissolve.

About

Michal Kosinski is Director of Operations for The Psychometrics Centre and Leader of the e-Psychometrics Unit. He is also a research advisor to the Online Services and Advertising group at the Microsoft Research Cambridge, and a visiting lecturer at the Department of Mathematics in the University of Namur, Belgium. You can read more about him at http://www.michalkosinski.com/

You can read more about Concerto at http://code.google.com/p/concerto-platform/ and http://www.psychometrics.cam.ac.uk/page/300/concerto-testing-platform.htm

Note on Internet Privacy (Updated)and a note on DNSCrypt

I noticed the brouaha on Google’s privacy policy. I am afraid that social networks capture much more private information than search engines (even if they integrate my browser history, my social network, my emails, my search engine keywords) – I am still okay. All they are going to do is sell me better ads (maybe than just flood me with ads hoping to get a click). Of course Microsoft should take it one step forward and capture data from my desktop as well for better ads, that would really complete the curve. In any case , with the Patriot Act, most information is available to the Government anyway.

But it does make sense to have an easier to understand privacy policy, and one of my disappointments is the complete lack of visual appeal in such notices. Make things simple as possible, but no simpler, as Al-E said.

 

Privacy activists forget that ads run on models built on AGGREGATED data, and most models are scored automatically. Unless you do something really weird and fake like, chances are the data pertaining to you gets automatically collected, algorithmic-ally aggregated, then modeled and scored, and a corresponding ad to your score, or segment is shown to you. Probably no human eyes see raw data (but big G can clarify that)

 

( I also noticed Google gets a lot of free advice from bloggers. hey, if you were really good at giving advice to Google- they WILL hire you !)

on to another tool based (than legalese based approach to privacy)

I noticed tools like DNSCrypt increase internet security, so that all my integrated data goes straight to people I am okay with having it (ad sellers not governments!)

Unfortunately it is Mac Only, and I will wait for Windows or X based tools for a better review. I noticed some lag in updating these tools , so I can only guess that the boys of Baltimore have been there, so it is best used for home users alone.

 

Maybe they can find a chrome extension for DNS dummies.

http://www.opendns.com/technology/dnscrypt/

Why DNSCrypt is so significant

In the same way the SSL turns HTTP web traffic into HTTPS encrypted Web traffic, DNSCrypt turns regular DNS traffic into encrypted DNS traffic that is secure from eavesdropping and man-in-the-middle attacks.  It doesn’t require any changes to domain names or how they work, it simply provides a method for securely encrypting communication between our customers and our DNS servers in our data centers.  We know that claims alone don’t work in the security world, however, so we’ve opened up the source to our DNSCrypt code base and it’s available onGitHub.

DNSCrypt has the potential to be the most impactful advancement in Internet security since SSL, significantly improving every single Internet user’s online security and privacy.

and

http://dnscurve.org/crypto.html

The DNSCurve project adds link-level public-key protection to DNS packets. This page discusses the cryptographic tools used in DNSCurve.

Elliptic-curve cryptography

DNSCurve uses elliptic-curve cryptography, not RSA.

RSA is somewhat older than elliptic-curve cryptography: RSA was introduced in 1977, while elliptic-curve cryptography was introduced in 1985. However, RSA has shown many more weaknesses than elliptic-curve cryptography. RSA’s effective security level was dramatically reduced by the linear sieve in the late 1970s, by the quadratic sieve and ECM in the 1980s, and by the number-field sieve in the 1990s. For comparison, a few attacks have been developed against some rare elliptic curves having special algebraic structures, and the amount of computer power available to attackers has predictably increased, but typical elliptic curves require just as much computer power to break today as they required twenty years ago.

IEEE P1363 standardized elliptic-curve cryptography in the late 1990s, including a stringent list of security criteria for elliptic curves. NIST used the IEEE P1363 criteria to select fifteen specific elliptic curves at five different security levels. In 2005, NSA issued a new “Suite B” standard, recommending the NIST elliptic curves (at two specific security levels) for all public-key cryptography and withdrawing previous recommendations of RSA.

Some specific types of elliptic-curve cryptography are patented, but DNSCurve does not use any of those types of elliptic-curve cryptography.

 

Research on Social Games

Social Gaming is slightly different from arcade gaming, and the heavy duty PSP3, XBox, Wii world of gaming.  Some observations on my research ( 😉 ) on social gaming across internet is as follows-

There are mostly 3 types of social games-

1) Quest- Build a town/area/farm to earn in game money or points

2) Fight- fight other people /players /pigs earn in game money or points

3) Puzzle- Stack up, make three of a kind, etc

Most successful social games are a crossover between the above three kinds of social games (so build and fight, or fight and puzzle etc)

In addition most social games have some in game incentives that are peculiar to social networks only. In game incentives are mostly in game cash to build, energy to fight others, or shortcuts in puzzle games. These social gaming incentives are-

1) Some incentive to log in daily/regularly/visit game site more often

2) Some incentive to invite other players on the social network

A characteristic of this domain is blatant me-too, copying and ripping creative ideas (but not the creative itself)  from other social games. In general the successful game which is the early leader gets most of the players but other game studios can and do build up substantial long tail network of players by copying games. Thus there are a huge variety of games.

However there are massive hits like Farmville and Angry Birds, that prove that a single social game well executed can be very valuable and profitable to both itself as well as the primary social network hosting it.

Accordingly the leading game studios are Zynga, Electronic Arts and (yes) Microsoft while Google has been mostly a investor in these.

A good website for studying data about social games is http://www.appdata.com/ while a sister website for studying developments is  http://www.insidesocialgames.com/

As you can see below Appdata is a formidable data gatherer here (though I find the top App – Static HTML as both puzzling and a sign of un corrected automated data gathering),

but I expect more competition in this very lucrative segment.

 

 

Games on Google Plus get- Faster, Higher, Stronger

I am spending some time and some money on two games on Google Plus. One is Crime City at https://plus.google.com/games/865772480172 which I talk about in this post

http://www.decisionstats.com/google-plus-games-crime-city-or-fun-with-funzio-on-g/

and Global Warfare  https://plus.google.com/games/216622099218 (which is similar to Evony of the bad ads fame, and I will write on that in another post)

But the total number of games at Google Plus is increasingly and quietly getting better. It seems there is a distinct preference for existing blockbuster games , from both Zynga and non Zynga sources Even though Google is an investor in  Zynga, it clearly wants Google plus to avoid being so dependent on Zynga as Facebook clearly is. Continue reading “Games on Google Plus get- Faster, Higher, Stronger”