Data Science for Olympics and lack of Reproducible Research

Despite the plethora of data generated in Sports, there is not much open data for Olympics and one wonders why if sharing best practices and data openly on what works and what does not can reduce the level of Russian athletes being banned in a cylical cold war era game.

Some links I found useful

Could data mining techniques accurately predict the medal counts at the Olympics? A predictive model could give us an estimate of the number of medals each nation might win; but how close could we get to the actual outcomes? It was a tantalizing project …

Sochi-Ru By Dan Graettinger with Tim Graettinger

• Which nation will bring home the most medals at the upcoming Winter Olympics in Sochi, Russia?

• Will any nation from Africa, South America, or the Middle East finally break through and win a medal?

• Why do some nations win a bundle of medals while others win only a few?

• Can data mining give us the answers to these questions?


the Graettinger brothers do? They used a seemingly  standard methodology: learn from the past to predict the future.  More precisely, they used past Olympics results to build a predictive model.  Each country is represented by a feature vector, i.e. a set of quantities drawn form several categories:

  • Economic
  • Population
  • Human Development
  • Geography
  • Religion
  • Politics and Freedom

Then they used a standard technique known as linear regression to find which set of features were best for predicting medal count.  I was reading their blog post with great interest until I saw what were the most meaningful features found by the linear regression algorithm:

  • Geographic area
  • GDP per capita
  • Value of Exports
  • Latitude of Nation’s Capital


I was able to find data in many categories:

  • Economic
  • Population
  • Human Development
  • Geography
  • Religion
  • Politics and Freedom

Thankfully, there were some good sources out there[f3], and I collected enough data that I felt I had a good chance to predict some meaningful outcomes.  But would it be enough?  There is more than one way to go about predicting the medal count at the Olympics, and the route before me was the “30,000 feet” approach.

So any takers?

Hackers for Hacking the Olympics 🙂



Latest DecisionStats Intern

Congratulations to our latest intern for completing the intensive internship at DecisionStats . See work done by here here-

Her latest blog post tries to use Python to understand police shootings in USA


Previous Interns wrote great Python code and R code

see (Sarah Masud and Farheen)


Anshul Gupta

Cricket Analysis –



Chandan Routray


Some points for future interns at DecisionStats-

  1. We normally dont pay interns anything
  2. 80 % interns drop out or are let go because they cannot keep up with the assignments
  3. Remaining 20% usually learn a lot in the intensive program
  4. Internships are like a free boot camp
  5. No more internships till June 2017 because I am trying to write a book
  6. Some research assistantships might be available in December 2016 to help with some code or Lyx formatting for the former
  7. See my LinkedIn profile for reviews given by the 20% interns who manage to stick around
  8. I usually emphasize writing, polyglot tools (both R, SAS and Python) , logical thinking and concise communication for my interns
  9. I usually treat them as students since I dont work for or in a university. That might change as I try and transition out from business to academic research options for a non Phd


Movie Review: Sultan

I had just come back after 5 months in the land of the brave and free (  Trumpistan ) so it was my first Bollywood movie in some time and Salman Khan did not disappoint in his annual Id outing. Sultan followed a wrestler who became a wrestler to impress a girl, and then gets separated post losing the girl. Fairly standard sports movie fare of rise, fall and rise again, though with a great Indian twist and I really think they could make Sultan 2.The heroine should be Rhonda Rousy in that one though. Anushka is not credible enough as a skinny wrestling prodigy, though Salman easily does the best acting of his career (minus the dancing)

The songs were ok ok, though the dialogues were amazing. Bhai rocks




Movie Review: Star Trek Beyond

This Star Trek movie was a bit of a slow starter. There was no Khan /Benedict Cumberbatch to play the foil which would be a tough act to follow and no JJ Abrams too. No cameo by Spock prime, though the only emotional high points were his reminisces. new Chekov is dead and that was a bummer. Bones and Karl Urban still rock! new Sulu is gay, sigh , we live in politically correct times, but what a cheesy way to honor George Takeai’s work

I wish they showed a bit of Kirk Prime someday. new Spock and new Kirk clearly seemed a bit tired. Oh no, third movie of a Franchise already they said. the soundtrack is oki doki and actually the Beastie Boy song was the worst bot of casting since George Clooney was cast as the Dark Knight.

The plot seemed nice enough and so were the graphics, but Idris Elba just did not seem menacing enough. See it if you are a fan boy, but rather pop in an older version of the TV series if you really needed Vitamin  Trek. I think it is time they brought in Picard eh amigo

Send in Rouge One me hearties



Why law enforcement continues to be reactive rather than predictive

First of all, my dad worked 37 years in Indian Federal Law Enforcement so this is not a song to the heroes ( no heroes, they are doing a job they chose to do) or a left wing critique (it IS a demanding job). As with most of my writing I will try (and succeed a bit and fail a bit) in being objective.

  1. Law enforcement here means cops, police officers, first responders, peace officers. It does not included detectives, federal agencies and military based intelligence.
  2. It means people who try to deter crime or terrorism by either reacting fast, being present there in force or anticipating where crime or attacks are likely to happen
  3. It is usually a shift based system, involving employees who are lesser skilled than other agencies
  4. A rotation system involving other agencies would help with both coordination and up skills ( eg in some professional armies, the logistics people spend 3 years in infantry. Some policemen should spend time in federal agencies and vice versa as per rotation)
  5. Over investment in weapons ( hardware) , under investment in training (especially cross geography and cross agency) and up skilling (in the age of Internet, social media), and almost zero investment in formal analytics (spatial analytics, time series analytics, where is crime likely to happen, who or what is a significant factor, how effective have frisk-check-patrolling efforts been) means law enforcement and community relationships continue to be maligned in democracies like India, USA, France. In a non democracy, there  is no one to complain to and no one to complain of.
  6. In the national defense and homeland security budget the local law enforcement gets lower priority even though they remain the first line of defense in maximum harm

Gandhi said that for evil to flourish the only thing that needs to happen is for good men to do nothing. Law enforcement in democracies critically needs investment in cyber science and  data science to restore community relationships.


Making a fresh data science box by updating your Ubuntu

Operating System

  1. Find out if you are 32 bit or 64 bit


Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 69
Model name:            Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz
Stepping:              1
CPU MHz:               1698.273
CPU max MHz:           1700.0000
CPU min MHz:           800.0000
BogoMIPS:              3392.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              3072K
NUMA node0 CPU(s):     0-3

  1. Ubuntu 16 upgrade (download from and make a fresh startup USB see
  2. Download Chrome


Dependencies from Terminal

  • Package Manager Synaptic sudo apt-get install synaptic
  • Package Manager Software Centre sudo apt-get install software-center*
  • SSL sudo apt-get install libssl-dev
  • XML sudo apt-get install libxml2-dev
  • Curl sudo apt-get install libcurl4-openssl-dev
  • GIT sudo apt install git
  • GTK $sudo apt-get install wajig

$wajig install libgtk2.0-dev

  • Java
    java -version

    sudo apt-add-repository ppa:webupd8team/java
    sudo apt-get update
    sudo apt-get install oracle-java8-installer
    export JAVA_HOME=/usr/lib/jvm/java-8-oracle

    sudo R CMD javareconf

  • R for Linux
Manual Editing of Sources is by (use Ctrl O and Ctrl X to write file and exit nano)

sudo nano /etc/apt/sources.list 

# Grabs your version of Ubuntu as a BASH variable
CODENAME=`grep CODENAME /etc/lsb-release | cut -c 18-`

# Appends the CRAN repository to your sources.list file 
sudo sh -c 'echo "deb $CODENAME" >> /etc/apt/sources.list'


sudo apt-key adv –keyserver –recv-keys E084DAB9

sudo apt-get install r-base-dev

  • RStudio

Download RStudio

  • RStudio Addins


  • Data Mining


install.packages(c(“RcmdrPlugin.epack”, “RcmdrPlugin.KMggplot2”))




$cd Downloads

~/Downloads$ bash

$conda install -c r r-essentials


%d bloggers like this: