Thursday is for fun reading

Thats the world’s most widely read marketing textbook in slideshare format slides. You think you are a marketing guru expert at selling or promoting software- well spend 10 minutes flipping for a fun reading

and a presentation trying to be the worlds best presentation by putting social causes, geeky languages, hot looks in the same slides – Hi It is BO (not Barack Obama)

and if you are like me and suck at presentations , but unlike me would like to get better at presentations

if you are still reading this you probably have too much time on a Friday, so here is one YouTube poetry video I created while in a graphics design course in Vol State- it’s a mashuo of 12 poems, some Prezi, some music by  that big proft making Google machine called You Tub

Interesting R competition at Reddit

Image representing Reddit as depicted in Crunc...
Image via CrunchBase

Here is an interesting R competition going on at Reddit and it is to help Reddit make a recommendation engine 🙂

by ketralnis

As promised, here is the big dump of voting information that you guys donated to research. Warning: this contains much geekery that may result in discomfort for the nerd-challenged.

I’m trying to use it to build a recommender, and I’ve got some preliminary source code. I’m looking for feedback on all of these steps, since I’m not experienced at machine learning.

Here’s what I’ve done

  • I dumped all of the raw data that we’ll need to generate the public dumps. The queries are the comments in the two .pig files and it took about 52 minutes to do the dump against production. The result of this raw dump looks like:
    $ wc -l *.dump
     13,830,070 reddit_data_link.dump
    136,650,300 reddit_linkvote.dump
         69,489 reddit_research_ids.dump
     13,831,374 reddit_thing_link.dump
  • I filtered the list of votes for the list of users that gave us permission to use their data. For the curious, that’s 67,059 users: 62,763 with “public votes” and 6,726 with “allow my data to be used for research”. I’d really like to see that second category significantly increased, and hopefully this project will be what does it. This filtering is done by srrecs_researchers.pig and took 83m55.335s on my laptop.
  • I converted data-dumps that were in our DB schema format to a more useable format using srrecs.pig(about 13min)
  • From that dump I mapped all of the account_ids, link_ids, and sr_ids to salted hashes (using obscure() with a random seed, so even I don’t know it). This took about 13min on my laptop. The result of this, votes.dump is the file that is actually public. It is a tab-separated file consisting in:

    There are 23,091,688 votes from 43,976 users over 3,436,063 links in 11,675 reddits. (Interestingly these ~44k users represent almost 17% of our total votes). The dump is 2.2gb uncompressed, 375mb in bz2.

What to do with it

The recommendations system that I’m trying right now turns those votes into a set of affinities. That is, “67% of user #223’s votes on /r/ are upvotes and 52% on programming). To make these affinities (55m45.107s on my laptop):

 cat votes.dump | ./ "affinities_m()" | sort -S200m | ./ "affinities_r()" > affinities.dump

Then I turn the affinities into a sparse matrix representing N-dimensional co-ordinates in the vector space of affinities (scaled to -1..1 instead of 0..1), in the format used by R’s skmeans package (less than a minute on my laptop). Imagine that this matrix looks like

 pics       programming horseporn  bacon
          ---------- ---------- ----------- ---------  -----
ketralnis -0.5       (no votes) +0.45       (no votes) +1.0
jedberg   (no votes) -0.25      +0.95       +1.0       -1.0
raldi     +0.75      +0.75      +0.7        (no votes) +1.0

We build it like:

# they were already grouped by account_id, so we don't have to
# sort. changes to the previous step will probably require this
# step to have to sort the affinities first
cat affinities.dump | ./ "write_matrix('', 'affinities.clabel', 'affinities.rlabel')"

I pass that through an R program srrecs.r (if you don’t have R installed, you’ll need to install that, and the packageskmeans like install.packages('skmeans')). This program plots the users in this vector space finding clusters using a sperical kmeans clustering algorithm (on my laptop, takes about 10 minutes with 15 clusters and 16 minutes with 50 clusters, during which R sits at about 220mb of RAM)

# looks for the files created by write_matrix in the current directory
R -f ./srrecs.r

The output of the program is a generated list of cluster-IDs, corresponding in order to the order of user-IDs inaffinities.clabel. The numbers themselves are meaningless, but people in the same cluster ID have been clustered together.

Here are the files

These are torrents of bzip2-compressed files. If you can’t use the torrents for some reason it’s pretty trivial to figure out from the URL how to get to the files directly on S3, but please try the torrents first since it saves us a few bucks. It’s S3 seeding the torrents anyway, so it’s unlikely that direct-downloading is going to go any faster or be any easier.

  • votes.dump.bz2 — A tab-separated list of:
    account_id, link_id, sr_id, direction
  • For your convenience, a tab-separated list of votes already reduced to percent-affinities affinities.dump.bz2, formatted:
    account_id, sr_id, affinity (scaled 0..1)
  • For your convenience, affinities-matrix.tar.bz2 contains the R CLUTO format matrix files,affinities.clabelaffinities.rlabel

And the code

  • srrecs.pigsrrecs_researchers.pig — what I used to generate and format the dumps (you probably won’t need this)
  • — what I used to salt/hash the user information and generate the R CLUTO-format matrix files (you probably won’t need this unless you want different information in the matrix)
  • srrecs.r — the R-code to generate the clusters

Here’s what you can experiment with

  • The code isn’t nearly useable yet. We need to turn the generated clusters into an actual set of recommendations per cluster, preferably ordered by predicted match. We probably need to do some additional post-processing per user, too. (If they gave us an affinity of 0% to /r/askreddit, we shouldn’t recommend it, even if we predicted that the rest of their cluster would like it.)
  • We need a test suite to gauge the accuracy of the results of different approaches. This could be done by dividing the data-set in and using 80% for training and 20% to see if the predictions made by that 80% match.
  • We need to get the whole process to less than two hours, because that’s how often I want to run the recommender. It’s okay to use two or three machines to accomplish that and a lot of the steps can be done in parallel. That said we might just have to accept running it less often. It needs to run end-to-end with no user-intervention, failing gracefully on error
  • It would be handy to be able to idenfity the cluster of just a single user on-the-fly after generating the clusters in bulk
  • The results need to be hooked into the reddit UI. If you’re willing to dive into the codebase, this one will be important as soon as the rest of the process is working and has a lot of room for creativity
  • We need to find the sweet spot for the number of clusters to use. Put another way, how many different types of redditors do you think there are? This could best be done using the aforementioned test-suite and a good-old-fashioned binary search.

Some notes:

  • I’m not attached to doing this in R (I don’t even know much R, it just has a handy prebaked skmeans implementation). In fact I’m not attached to my methods here at all, I just want a good end-result.
  • This is my weekend fun project, so it’s likely to move very slowly if we don’t pick up enough participation here
  • The final version will run against the whole dataset, not just the public one. So even though I can’t release the whole dataset for privacy reasons, I can run your code and a test-suite against it



I am thinking of using Rattle and using the arules package, and running it on the EC2 to get the horsepower.

How else do you think you can tackle a recommendation engine problem.




Microsoft Online Games

No, this is not about the X Box kind of games. It is about Microsoft ‘s tactical shift in the online space from going it alone, and building stuff itself, –to partnering, and sometimes investing and exiting business.

In Blogs- It recently announced a migration of MS Live Spaces to – It gives Automattic 30 million more users- no small change consider there were 26 million existing WP users.

Microsoft Messenger, which is the oldest online app in the suite, now provides instant messaging services to about 350 million users, and from now on Windows Live Writer works specifically with the blog service by default. Hopefully Skype, and Google Voice will show MS the way to monitize that business app yet.

Google buying blogger-blogspot seems to have done little, but given Biz Stone room to create another content disruption-Twitter.

With the round of lawsuits by proxy, in Android -Motorola, or for acquisitions – MS is just doing what Marc Anderseen (who’s apparently a better VC than Paul Allen was), Sun and co did to it in the nineties.

Google seems to be regretting putting a spade in the Yahoo acquisition- that would have tied up a big chunk of Idle MS cash- leaving it little room for niche investments (like the 250 mill that helped Facebook ramp up in time).

The real surprise here could be Apple- it has shown little interest in cloud computing- and it seems to be testing the waters with Ping. But Apple sure smells competition- and Android is doing to Iphone what Windows did to the Mac in the early 1990’s.

Google lacks presence in online gaming (despite it’s own Zynga investment)- and needs to start monetizing properties like Android OS (say 10$ for every phone license ??), Google Maps (as an app for GPS) and Google Voice. Indeed it may be time for the big G to start thinking of spinning off atleast some products- earning better returns, while retaining control (dual stock splits) and killing those anti trust lawyer fees forever.

As the Ancient Chinese said, May you live in interesting times. Fun to watch the online games people play.



Blog Update

Some changes at Decisionstats-

1) We are back at and will point to that as well. The SEO effects would be interesting and so would be the Instant Pagerank or LinkRank or whatever Coffee/Percolator they use in Cali to index the site.

2) AsterData is no longer a sponsor- but Predictive Analytics Conference is. Welcome PAWS! I have been a blog partner to PAWS ever since it began- and it’s a great marketing fit. Expect to see a lot of exclusive content and interviews from great speakers at PAWS.

3) The Feedblitz newsletter (now at 404 subscribers) is now a weekly subscription to send one big big email rather than lots of email through the week- this is because my blogging frequency is moving up as I collect material for a new book on business analytics that I would probably release in 2011 (if all goes well, touchwood). Linkedin group would be getting a weekly update announcement. If you are connected to Decisionstats on Analyticbridge _ I would soon try to find a way to update the whole post automatically using RSS and . or not. Depends.

4) R continues to be a bigger focus. So will SPSS and maybe JMP. Newer softwares or older softwares that change more rapidly would get more coverage. Generally a particular software is covered if it has newer features, or an interesting techie conference, or it gets sued.

5) I will occasionally write a poem or post a video once a week randomly to prove geeks and nerds and analysts can have fun (much more fun actually dont we)

Thanks for reading this. Sept 2010 was the best ever for – we crossed 15,000 + visitors and thanks for that again! I promise to bore you less and less as we grow old together on the blog 😉

The Great Driving Challenge- coolest young couples

Here is one of the new startups in India. A batch mate from B school whom I owe too many beers, and too few

calculus notes —–well he asked me to help him vote. Treat this as shameless self promotion just like ‘s moustache and R rated R stats profanity on #rstats in twitter

Please do vote and read- they are a fun couple.

The Great Driving Challenge