Interesting R competition at Reddit

Image representing Reddit as depicted in Crunc...
Image via CrunchBase

Here is an interesting R competition going on at Reddit and it is to help Reddit make a recommendation engine 🙂

http://www.reddit.com/r/redditdev/comments/dtg4j/want_to_help_reddit_build_a_recommender_a_public/

by ketralnis

As promised, here is the big dump of voting information that you guys donated to research. Warning: this contains much geekery that may result in discomfort for the nerd-challenged.

I’m trying to use it to build a recommender, and I’ve got some preliminary source code. I’m looking for feedback on all of these steps, since I’m not experienced at machine learning.

Here’s what I’ve done

  • I dumped all of the raw data that we’ll need to generate the public dumps. The queries are the comments in the two .pig files and it took about 52 minutes to do the dump against production. The result of this raw dump looks like:
    $ wc -l *.dump
     13,830,070 reddit_data_link.dump
    136,650,300 reddit_linkvote.dump
         69,489 reddit_research_ids.dump
     13,831,374 reddit_thing_link.dump
    
  • I filtered the list of votes for the list of users that gave us permission to use their data. For the curious, that’s 67,059 users: 62,763 with “public votes” and 6,726 with “allow my data to be used for research”. I’d really like to see that second category significantly increased, and hopefully this project will be what does it. This filtering is done by srrecs_researchers.pig and took 83m55.335s on my laptop.
  • I converted data-dumps that were in our DB schema format to a more useable format using srrecs.pig(about 13min)
  • From that dump I mapped all of the account_ids, link_ids, and sr_ids to salted hashes (using obscure() insrrecs.py with a random seed, so even I don’t know it). This took about 13min on my laptop. The result of this, votes.dump is the file that is actually public. It is a tab-separated file consisting in:
    account_id,link_id,sr_id,dir
    

    There are 23,091,688 votes from 43,976 users over 3,436,063 links in 11,675 reddits. (Interestingly these ~44k users represent almost 17% of our total votes). The dump is 2.2gb uncompressed, 375mb in bz2.

What to do with it

The recommendations system that I’m trying right now turns those votes into a set of affinities. That is, “67% of user #223’s votes on /r/reddit.com are upvotes and 52% on programming). To make these affinities (55m45.107s on my laptop):

 cat votes.dump | ./srrecs.py "affinities_m()" | sort -S200m | ./srrecs.py "affinities_r()" > affinities.dump

Then I turn the affinities into a sparse matrix representing N-dimensional co-ordinates in the vector space of affinities (scaled to -1..1 instead of 0..1), in the format used by R’s skmeans package (less than a minute on my laptop). Imagine that this matrix looks like

          reddit.com pics       programming horseporn  bacon
          ---------- ---------- ----------- ---------  -----
ketralnis -0.5       (no votes) +0.45       (no votes) +1.0
jedberg   (no votes) -0.25      +0.95       +1.0       -1.0
raldi     +0.75      +0.75      +0.7        (no votes) +1.0
...

We build it like:

# they were already grouped by account_id, so we don't have to
# sort. changes to the previous step will probably require this
# step to have to sort the affinities first
cat affinities.dump | ./srrecs.py "write_matrix('affinities.cm', 'affinities.clabel', 'affinities.rlabel')"

I pass that through an R program srrecs.r (if you don’t have R installed, you’ll need to install that, and the packageskmeans like install.packages('skmeans')). This program plots the users in this vector space finding clusters using a sperical kmeans clustering algorithm (on my laptop, takes about 10 minutes with 15 clusters and 16 minutes with 50 clusters, during which R sits at about 220mb of RAM)

# looks for the files created by write_matrix in the current directory
R -f ./srrecs.r

The output of the program is a generated list of cluster-IDs, corresponding in order to the order of user-IDs inaffinities.clabel. The numbers themselves are meaningless, but people in the same cluster ID have been clustered together.

Here are the files

These are torrents of bzip2-compressed files. If you can’t use the torrents for some reason it’s pretty trivial to figure out from the URL how to get to the files directly on S3, but please try the torrents first since it saves us a few bucks. It’s S3 seeding the torrents anyway, so it’s unlikely that direct-downloading is going to go any faster or be any easier.

  • votes.dump.bz2 — A tab-separated list of:
    account_id, link_id, sr_id, direction
    
  • For your convenience, a tab-separated list of votes already reduced to percent-affinities affinities.dump.bz2, formatted:
    account_id, sr_id, affinity (scaled 0..1)
    
  • For your convenience, affinities-matrix.tar.bz2 contains the R CLUTO format matrix files affinities.cm,affinities.clabelaffinities.rlabel

And the code

  • srrecs.pigsrrecs_researchers.pig — what I used to generate and format the dumps (you probably won’t need this)
  • mr_tools.pysrrecs.py — what I used to salt/hash the user information and generate the R CLUTO-format matrix files (you probably won’t need this unless you want different information in the matrix)
  • srrecs.r — the R-code to generate the clusters

Here’s what you can experiment with

  • The code isn’t nearly useable yet. We need to turn the generated clusters into an actual set of recommendations per cluster, preferably ordered by predicted match. We probably need to do some additional post-processing per user, too. (If they gave us an affinity of 0% to /r/askreddit, we shouldn’t recommend it, even if we predicted that the rest of their cluster would like it.)
  • We need a test suite to gauge the accuracy of the results of different approaches. This could be done by dividing the data-set in and using 80% for training and 20% to see if the predictions made by that 80% match.
  • We need to get the whole process to less than two hours, because that’s how often I want to run the recommender. It’s okay to use two or three machines to accomplish that and a lot of the steps can be done in parallel. That said we might just have to accept running it less often. It needs to run end-to-end with no user-intervention, failing gracefully on error
  • It would be handy to be able to idenfity the cluster of just a single user on-the-fly after generating the clusters in bulk
  • The results need to be hooked into the reddit UI. If you’re willing to dive into the codebase, this one will be important as soon as the rest of the process is working and has a lot of room for creativity
  • We need to find the sweet spot for the number of clusters to use. Put another way, how many different types of redditors do you think there are? This could best be done using the aforementioned test-suite and a good-old-fashioned binary search.

Some notes:

  • I’m not attached to doing this in R (I don’t even know much R, it just has a handy prebaked skmeans implementation). In fact I’m not attached to my methods here at all, I just want a good end-result.
  • This is my weekend fun project, so it’s likely to move very slowly if we don’t pick up enough participation here
  • The final version will run against the whole dataset, not just the public one. So even though I can’t release the whole dataset for privacy reasons, I can run your code and a test-suite against it

——————————————————————————————-

 

I am thinking of using Rattle and using the arules package, and running it on the EC2 to get the horsepower.

How else do you think you can tackle a recommendation engine problem.

 

Ajay

 

Revolution R for Linux

Screenshot of the Redhat Enterprise Linux Desktop
Image via Wikipedia

New software just released from the guys in California (@RevolutionR) so if you are a Linux user and have academic credentials you can download it for free  (@Cmastication doesnt), you can test it to see what the big fuss is all about (also see http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php) –

Revolution Analytics has just released Revolution R Enterprise 4.0.1 for Red Hat Enterprise Linux, a significant step forward in enterprise data analytics. Revolution R Enterprise 4.0.1 is built on R 2.11.1, the latest release of the open-source environment for data analysis and graphics. Also available is the initial release of our deployment server solution, RevoDeployR 1.0, designed to help you deliver R analytics via the Web. And coming soon to Linux: RevoScaleR, a new package for fast and efficient multi-core processing of large data sets.

As a registered user of the Academic version of Revolution R Enterprise for Linux, you can take advantage of these improvements by downloading and installing Revolution R Enterprise 4.0.1 today. You can install Revolution R Enterprise 4.0.1 side-by-side with your existing Revolution R Enterprise installations; there is no need to uninstall previous versions.

Download Information

The following information is all you will need to download and install the Academic Edition.

Supported Platforms:

Revolution R Enterprise Academic edition and RevoDeployR are supported on Red Hat® Enterprise Linux® 5.4 or greater (64-bit processors).

Approximately 300MB free disk space is required for a full install of Revolution R Enterprise. We recommend at least 1GB of RAM to use Revolution R Enterprise.

For the full list of system requirements for RevoDeployR, refer to the RevoDeployR™ Installation Guide for Red Hat® Enterprise Linux®.

Download Links:

You will first need to download the Revolution R Enterprise installer.

Installation Instructions for Revolution R Enterprise Academic Edition

After downloading the installer, do the following to install the software:

  • Log in as root if you have not already.
  • Change directory to the directory containing the downloaded installer.
  • Unpack the installer using the following command:
    tar -xzf Revo-Ent-4.0.1-RHEL5-desktop.tar.gz
  • Change directory to the RevolutionR_4.0.1 directory created.
  • Run the installer by typing ./install.py and following the on-screen prompts.

Getting Started with the Revolution R Enterprise

After you have installed the software, launch Revolution R Enterprise by typing Revo64 at the shell prompt.

Documentation is available in the form of PDF documents installed as part of the Revolution R Enterprise distribution. Type Revo.home(“doc”) at the R prompt to locate the directory containing the manuals Getting Started with Revolution R (RevoMan.pdf) and the ParallelR User’s Guide(parRman.pdf).

Installation Instructions for RevoDeployR (and RServe)

After downloading the RevoDeployR distribution, use the following steps to install the software:

Note: These instructions are for an automatic install.  For more details or for manual install instructions, refer to RevoDeployR_Installation_Instructions_for_RedHat.pdf.

  1. Log into the operating system as root.
    su –
  2. Change directory to the directory containing the downloaded distribution for RevoDeployR and RServe.
  3. Unzip the contents of the RevoDeployR tar file. At prompt, type:
    tar -xzf deployrRedHat.tar.gz
  4. Change directories. At the prompt, type:
    cd installFiles
  5. Launch the automated installation script and follow the on-screen prompts. At the prompt, type:
    ./installRedHat.sh
    Note: Red Hat installs MySQL without a password.

Getting Started with RevoDeployR

After installing RevoDeployR, you will be directed to the RevoDeployR landing page. The landing page has links to documentation, the RevoDeployR management console, the API Explorer development tool, and sample code.

Support

For help installing this Academic Edition, please email support@revolutionanalytics.com

Also interestingly some benchmarks on Revolution R vs R.

http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php

R-25 Benchmarks

The simple R-benchmark-25.R test script is a quick-running survey of general R performance. The Community-developed test consists of three sets of small benchmarks, referred to in the script as Matrix Calculation, Matrix Functions, and Program Control.

R-25 Matrix Calculation R-25 Matrix Functions R-Matrix Program Control
R-25 Benchmarks Base R 2.9.2 Revolution R (1-core) Revolution R (4-core) Speedup (4 core)
Matrix Calculation 34 sec 6.6 sec 4.4 sec 7.7x
Matrix Functions 20 sec 4.4 sec 2.1 sec 9.5x
Program Control 4.7 sec 4 sec 4.2 sec Not Appreciable

Speedup = Slower time / Faster Time – 1   Test descriptions available at http://r.research.att.com/benchmarks

Additional Benchmarks

Revolution Analytics has created its own tests to simulate common real-world computations.  Their descriptions are explained below.

Matrix Multiply Cholesky Factorization
Singular Value Decomposition Principal Component Analysis Linear Discriminant Analysis
Linear Algebra Computation Base R 2.9.2 Revolution R (1-core) Revolution R (4-core) Speedup (4 core)
Matrix Multiply 243 sec 22 sec 5.9 sec 41x
Cholesky Factorization 23 sec 3.8 sec 1.1 sec 21x
Singular Value Decomposition 62 sec 13 sec 4.9 sec 12.6x
Principal Components Analysis 237 sec 41 sec 15.6 sec 15.2x
Linear Discriminant Analysis 142 sec 49 sec 32.0 sec 4.4x

Speedup = Slower time / Faster Time – 1

Matrix Multiply

This routine creates a random uniform 10,000 x 5,000 matrix A, and then times the computation of the matrix product transpose(A) * A.

set.seed (1)
m <- 10000
n <-  5000
A <- matrix (runif (m*n),m,n)
system.time (B <- crossprod(A))

The system will respond with a message in this format:

User   system elapsed
37.22    0.40   9.68

The “elapsed” times indicate total wall-clock time to run the timed code.

The table above reflects the elapsed time for this and the other benchmark tests. The test system was an INTEL® Xeon® 8-core CPU (model X55600) at 2.5 GHz with 18 GB system RAM running Windows Server 2008 operating system. For the Revolution R benchmarks, the computations were limited to 1 core and 4 cores by calling setMKLthreads(1) and setMKLthreads(4) respectively. Note that Revolution R performs very well even in single-threaded tests: this is a result of the optimized algorithms in the Intel MKL library linked to Revolution R. The slight greater than linear speedup may be due to the greater total cache available to all CPU cores, or simply better OS CPU scheduling–no attempt was made to pin execution threads to physical cores. Consult Revolution R’s documentation to learn how to run benchmarks that use less cores than your hardware offers.

Cholesky Factorization

The Cholesky matrix factorization may be used to compute the solution of linear systems of equations with a symmetric positive definite coefficient matrix, to compute correlated sets of pseudo-random numbers, and other tasks. We re-use the matrix B computed in the example above:

system.time (C <- chol(B))

Singular Value Decomposition with Applications

The Singular Value Decomposition (SVD) is a numerically-stable and very useful matrix decompisition. The SVD is often used to compute Principal Components and Linear Discriminant Analysis.

# Singular Value Deomposition
m <- 10000
n <- 2000
A <- matrix (runif (m*n),m,n)
system.time (S <- svd (A,nu=0,nv=0))

# Principal Components Analysis
m <- 10000
n <- 2000
A <- matrix (runif (m*n),m,n)
system.time (P <- prcomp(A))

# Linear Discriminant Analysis
require (‘MASS’)
g <- 5
k <- round (m/2)
A <- data.frame (A, fac=sample (LETTERS[1:g],m,replace=TRUE))
train <- sample(1:m, k)
system.time (L <- lda(fac ~., data=A, prior=rep(1,g)/g, subset=train))

The auto-suggest link/tags for WP.com blogs

WordPress.com blogs have a great new option for generating tags, and links and thus improving their search engine optimization for posts.

Just go to Users-Personal Settings- and check the options shown. Thats it every time you write a post it suggests links and tags. Links are helpful for your readers (like Wikipedia links to understand dense technical jargon, or associated websites). Tags help to classify your contents so that all visitors to the web site including spiders ,search engines and your readers can search it better.

The bad thing is I need to go back to all 1025 posts on this site and auto generate tags for the archives ! Oh well. Great collaboration between zementa and Automattic for this new feature.

PAW Reception and R Meetup

New DC meetup for R Users-

source- http://www.meetup.com/R-users-DC/calendar/14236478/

October’s R meet-up will be co-located with the Predictive Analytics World Conference (http://www.predictive…) taking place in Washington DC October 19-20. PAW is the premiere business-focused event for predictive analytics professionals, managers and commercial practitioners.

Agenda:

6:30 – 7:30 PAW Reception (open to meet-up attendees)
7:30 – 9:00 DC-R Meetup

Talks:
“How to speak ggplot2 like a native”
Harlan D. Harris, PhD @HarlanH

“Saving the world with R”
Michael Milton @michaelmilton

Important Registration Instructions:
You are welcome to RSVP here at meetup. The PAW organizers have requested that we register in the PAW site for the R meetup so they can provide badges to members which will give you access to the reception. There is no charge to register using the PAW site. Please click here to register.


Speaker Bios

Harlan D. Harris, PhD, is a statistical data scientist working for Kaplan Test Prep and Admissions in New York City. He has degrees from the University of Wisconsin-Madison and the University of Illinois at Urbana-Champaign. Prior to turning to the private sector, he worked as a researcher and lecturer in various areas of Artificial Intelligence and Cognitive Science at the University of Illinois, Columbia University, the University of Connecticut, and New York University.

Harlan’s talk is titled “How to speak ggplot2 like a native.”. One of the most innovative ideas in data visualization in recent years is that graphical images can be described using a grammar. Just as a fluent speaker of a language can talk more precisely and clearly than someone using a tourist phrasebook, graphics based on a grammar can yield more insights than graphics based on a limited set of templates (bar chart, pie graph, etc.). There are at least two implementations of the Grammar of Graphics idea in R, of which the most popular is the ggplot2 package written by Prof. Hadley Wickham. Just as with natural languages, ggplot2 has a surface structure made up of R vocabulary elements, as well as a deep structure that mediates the link between the vocabulary and the “semantic” representation of the data shown on a computer screen. In this introductory presentation, the links among these levels of representation are demonstrated, so that new ggplot2 users can build the mental models necessary for fluent and creative visualization of their data.

Michael Milton is a Client Manager at Blue State Digital. When he’s not saving the world by designing interactive marketing strategies that connect passionate users with causes and organizations, he writes about data and analytics. For O’Reilly Media, he wrote Head First Data Analysis and Head First Excel and has created the videos Great R: Level 1 and Getting the Most Out of Google Apps for Business.

Michael’s talk is called “How to Save the World Using R.” In this wide-ranging discussion, Michael will highlight individuals and organizations who are using R to help others as well as ways in which R can be used to promote good statistical thinking.

Data Mining 2010:SAS Conference in Vegas

An interesting conference which I attended last year, this year one of the main guests is an ex professor of mine at UTenn. I am India bound this year though for family reasons.

http://www.sas.com/events/dmconf/over.html

Latest News

Early Bird Special
Register for M2010 before Sept. 17 and save $200 on conference fees!

Additional Data Mining Resources
Find additional data mining resouces including links to whitepapers, webinars, audio seminars, videos, blogs and online communities.

Location
Caesars Palace
Las Vegas, NV

Conference: October 25-26
Pre-conference workshops: October 24
Post-conference training: October 27-29

The M2010 Data Mining Conference is an international educational conference and exhibition for data mining practitioners including analysts, statisticians, programmers, consultants and anyone involved with data management within their organization, Hosted by SAS, M2010 is now in its 13th year and has become the world’s largest data mining conference, attracting over 600 people from various industries including Financial Services, Retail, Insurance, Technology, Education, Healthcare, Pharmaceutical, Government and more.

This conference is the top-choice for serious education and career networking. Conference highlights include

  • 6 keynotes
  • 36 sessions
  • 6 session tracks
  • exhibit hall
  • poster session
  • SAS software training
  • educational workshops
  • special events
  • networking opportunities
  • predictive modeling certification testing event.

Session Topics

  • Business applications
  • Data augmentation
  • Perspectives from the financial services industry
  • Fraud detection
  • Perspectives from the healthcare industry
  • New and emerging technologies
  • Perspectives from the retail industry
  • Data mining in marketing
  • Retention and Life Cycle Analysis
  • Text mining
  • And more! (View session abstracts.)

GNU PSPP- The Open Source SPSS

If you are SPSS user (for statistics/ not data mining) you can also try 0ut GNU PSPP- which is the open source equivalent and quite eerily impressive in performance. It is available at http://www.gnu.org/software/pspp/ or http://pspp.awardspace.com/ and you can also read more at http://en.wikipedia.org/wiki/PSPP

PSPP is a program for statistical analysis of sampled data. It is a Free replacement for the proprietary program SPSS, and appears very similar to it with a few exceptions.

[ Image of Variable Sheet ]The most important of these exceptions are, that there are no “time bombs”; your copy of PSPP will not “expire” or deliberately stop working in the future. Neither are there any artificial limits on the number of cases or variables which you can use. There are no additional packages to purchase in order to get “advanced” functions; all functionality that PSPP currently supports is in the core package.

PSPP can perform descriptive statistics, T-tests, linear regression and non-parametric tests. Its backend is designed to perform its analyses as fast as possible, regardless of the size of the input data. You can use PSPP with its graphical interface or the more traditional syntax commands.

A brief list of some of the features of PSPP follows:

  • Supports over 1 billion cases.
  • Supports over 1 billion variables.
  • Syntax and data files are compatible with SPSS.
  • Choice of terminal or graphical user interface.
  • Choice of text, postscript or html output formats.
  • Inter-operates with GnumericOpenOffice.Org and other free software.
  • Easy data import from spreadsheets, text files and database sources.
  • Fast statistical procedures, even on very large data sets.
  • No license fees.
  • No expiration period.
  • No unethical “end user license agreements”.
  • Fully indexed user manual.
  • Free Software; licensed under GPLv3 or later.
  • Cross platform; Runs on many different computers and many different operating systems.

PSPP is particularly aimed at statisticians, social scientists and students requiring fast convenient analysis of sampled data.

and

Features

This software provides a basic set of capabilities: frequencies, cross-tabs comparison of means (T-tests and one-way ANOVA); linear regression, reliability (Cronbach’s Alpha, not failure or Weibull), and re-ordering data, non-parametric tests, factor analysis and more.

At the user’s choice, statistical output and graphics are done in asciipdfpostscript or html formats. A limited range of statistical graphs can be produced, such as histogramspie-charts and np-charts.

PSPP can import GnumericOpenDocument and Excel spreadsheetsPostgres databasescomma-separated values– and ASCII-files. It can export files in the SPSS ‘portable’ and ‘system’ file formats and to ASCII files. Some of the libraries used by PSPP can be accessed programmatically; PSPP-Perl provides an interface to the libraries used by PSPP.

Origins

The PSPP project (originally called “Fiasco”) is a free, open-source alternative to the proprietary statistics package SPSS. SPSS is closed-source and includes a restrictive licence anddigital rights management. The author of PSPP considered this ethically unacceptable, and decided to write a program which might with time become functionally identical to SPSS, except that there would be no licence expiry, and everyone would be permitted to copy, modify and share the program.

Release history

  • 0.7.5 June 2010 http://pspp.awardspace.com/
  • 0.6.2 October 2009
  • 0.6.1 October 2008
  • 0.6.0 June 2008
  • 0.4.0.1 August 2007
  • 0.4.0 August 2005
  • 0.3.0 April 2004
  • 0.2.4 January 2000
  • 0.1.0 August 1998

Third Party Reviews

In the book “SPSS For Dummies“, the author discusses PSPP under the heading of “Ten Useful Things You Can Find on the Internet” [1]. In 2006, the South African Statistical Association presented a conference which included an analysis of how PSPP can be used as a free replacement to SPSS [2].

Citation-

Please send FSF & GNU inquiries to gnu@gnu.org. There are also other ways to contact the FSF. Please send broken links and other corrections (or suggestions) to bug-gnu-pspp@gnu.org.

Copyright © 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007 Free Software Foundation, Inc., 51 Franklin St – Suite 330, Boston, MA 02110, USA – Verbatim copying and distribution of this entire article are permitted worldwide, without royalty, in any medium, provided this notice, and the copyright notice, are preserved.

Best Internet Site of 2009

Here is the best internet site of 2009.
It basically shows how many jobs have been created per dollar spent.
Funded by the debt of American Treasuries………

Here is the best internet site of 2009.
It basically shows how many jobs have been created per dollar spent.
Funded by the debt of American Treasuries
sold to Chinese.

Remember the Chinese Opium Wars.
Well the Chinese are hooked to American Treasuries and they probably need a Warship with Admiral to open their markets and currency. Oui!

Well anyway the website is called http://Recovery.gov