using R for Cricket Analysis #rstats

New Zealand just made it to their first ever world cup final ( yes it is cricket) and they made it with a thrilling six ( like a home run) for the last ball. Congrats to New Zealand .Of course R was created in New Zealand too and Hadley Wickham is from New Zealand

I recently installed the rvest package from https://github.com/hadley/rvest and its now on CRAN as well

 

rvest helps you scrape information from web pages. It is designed to work with magrittr to make it easy to express common web scraping tasks, inspired by libraries like beautiful soup.

library(rvest)
lego_movie <- html("http://www.imdb.com/title/tt1490017/")

rating <- lego_movie %>% 
  html_nodes("strong span") %>%
  html_text() %>%
  as.numeric()
rating
#> [1] 7.9

cast <- lego_movie %>%
  html_nodes("#titleCast .itemprop span") %>%
  html_text()
cast
#>  [1] "Will Arnett"     "Elizabeth Banks" "Craig Berry"    
#>  [4] "Alison Brie"     "David Burrows"   "Anthony Daniels"
#>  [7] "Charlie Day"     "Amanda Farinos"  "Keith Ferguson" 
#> [10] "Will Ferrell"    "Will Forte"      "Dave Franco"    
#> [13] "Morgan Freeman"  "Todd Hansen"     "Jonah Hill"

poster <- lego_movie %>%
  html_nodes("#img_primary img") %>%
  html_attr("src")
poster
#> [1] "http://ia.media-imdb.com/images/M/MV5BMTg4MDk1ODExN15BMl5BanBnXkFtZTgwNzIyNjg3MDE@._V1_SX214_AL_.jpg"

The most important functions in rvest are:

  • Create an html document from a url, a file on disk or a string containing html with html().
  • Select parts of a document using css selectors: html_nodes(doc, "table td") (or if you’ve a glutton for punishment, use xpath selectors with html_nodes(doc, xpath = "//table//td")). If you haven’t heard of selectorgadget, make sure to read vignette("selectorgadget") to learn about it.
  • Extract components with html_tag() (the name of the tag), html_text() (all text inside the tag), html_attr() (contents of a single attribute) and html_attrs() (all attributes).
  • (You can also use rvest with XML files: parse with xml(), then extract components using xml_node(), xml_attr(), xml_attrs(), xml_text() and xml_tag().)
  • Parse tables into data frames with html_table().
  • Extract, modify and submit forms with html_form(), set_values() and submit_form().
  • Detect and repair encoding problems with guess_encoding() and repair_encoding().
  • Navigate around a website as if you’re in a browser with html_session(), jump_to(), follow_link(), back(), forward(), submit_form() and so on. (This is still a work in progress, so I’d love your feedback.)

 

While Hadley Wickham seems busy with reading excel files ( see https://github.com/hadley/readxl) maybe using rvest can help in more sports analysis now!

http://decisionstats.com/2013/04/25/using-r-for-cricket-analysis-rstats-ipl/

Meanwhile I am searching for equivalent of readHTMLtable

A Writer’s Dilemma A Data Scientist’s Decision

CAM00682Writing sucks as a way of paying money. You have to constantly ask for gigs and favours so you can pay the bills, till your publisher sends you the royalties for writing statistics books (which is not much)

Recently I was approached by someone to do research on Indian nuclear policy . I mentioned my billing rates as 80 pounds per hour, but said person wanted me to raise it to 110 pounds per hour.

Only one small hitch. The ┬ásub part of Indian nuclear policy that I was asked to write a report on – was to locate , interview and find out India’s top nuclear scientists for small reactors.

India has of course a big research interest in nuclear energy but we seemed to be going in for big reactors and thorium reactors.

The only two small nuclear reactors in India- and one of them is in INS Arihant ( India’s nuclear submarine launched last year). In effect I was being asked to make a list of top 20 likely candidates who had helped India with a nuclear submarine reactor.

I have said no to the person, I have been subjected to verbal insults, threats and innuendo.

But Data Scientists trust in God. Everybody else has to work harder for the data.

I hope this is a lesson for fellow researchers , data scientists because as I said, writing is a lousy way of making money.

Was I wrong? Am I just living in a fantasy land? Do you believe I am a criminal and a thug?

2015-1

Training in R on the WeekendR in February

I have agreed to teach R on the Weekend. As a change from my usual online trainings these will be in the class. I am collaborating with http://weekendr.in/r-training.html

For an initial price the cost is Rs 5500 (~100 USD) for 8 sessions of 3 hours each in the classroom. This is only for New Delhi, India as of now.

You can review the course here http://weekendr.in/r-training.html

Screenshot from 2015-01-23 13:40:41