Using R for Cricket Analysis #rstats

ESPN Crincinfo is the best site for cricket data (you can see an earlier detailed post on the database  here  ), and using the XML package in R we can easily scrape and manipulate data

Here is the code.

#Note I can also break the url string and use paste command to modify this url with parameters
tables$"Overall figures"

#Now see this- since I only got 50 results in each page, I look at the url of next page

table1=tables$"Overall figures"
table2=tables$"Overall figures"

#Now I need to join these two tables vertically


Note-I can also automate the web scraping .
Now the data is within R, we can use something like Deducer to visualize.
Created by Pretty R at

Topic Models in R- search documents for similarity by frequency

Image via Wikipedia

From the marvelous lovely Journal of Statistical Software, ignored by mainstream corporatia, but beloved to academia. here is one more interesting and very timely paper.

Can be used to grade stdudents homework, catch terrorists as in plagiarists , search engine spam linkers. Enjoy!

Why search optimization can make you like Rebecca Black

Felicia Day, actress and web content producer.
Image via Wikipedia

A highly optimized blog post or web content can get you a lot of attention just like Rebecca Black’s video (provided it passes through the new quality metrics \change*/ in the Search Engine)

But if the underlying content is weak, or based on a shoddy understanding of the content-it can drive lots of horrid comments as well as ensuring that bad word of mouth is spread about the content or you/despite your hard work.

An example of this is copy and paste journalism especially in technology circles, where even a bigger Page Ranked website /blog can get away with scraping or stealing content from a lower page ranked website (or many websites)  after adding a cursory “expert comment”. This is also true when someone who is basically a corporate communication specialist (or PR -public relations) person is given a techinical text and encourage to write about it without completely understanding it.

A mild technical defect in the search engine algorithm is that it does not seem to pay attention to when the content was published, so the copying website or blog actually can get by as fresher content even if it is practically has 90% of the same words). The second flaw is over punishment or manual punishment of excessive linking – this can encourage search optimization minded people to hoard links or discourage trackbacks.

A free internet is one which promotes free sharing of content and does not encourage stealing or un-authorized scraping or content copying. Unfortunately current search engine optimization can encourage scraping and content copying without paying too much attention to origin of the words.

In addition the analytical rigor by which search algorithms search your inboxes (as in search all emails for a keyword) or media rich sites (like Youtube) are quite on a different level of quality altogether. The chances of garbage results are much more while searching for media content and/or emails.

R Commercial Software

Revolution Analytics Download- Official Screenshot-

• XL Solutions Download- Official Screenshot-

Information Builder Official Screenshot-

Blue Reference- Inference for R Download- Official Screenshot-

R for Excel


Also integrates R with Word, Open Office and Excel with Scilab

How to balance your online advertising and your offline conscience

Google in 1998, showing the original logo
Image via Wikipedia

I recently found an interesting example of  a website that both makes a lot of money and yet is much more efficient than any free or non profit. It is called ECOSIA

If you see a website that wants to balance administrative costs  plus have a transparent way to make the world better- this is a great example.

    You search with Ecosia.
  • Perhaps you click on an interesting sponsored link.
  • The sponsoring company pays Bing or Yahoo for the click.
  • Bing or Yahoo gives the bigger chunk of that money to Ecosia.
  • Ecosia donates at least 80% of this income to support WWF’s work in the Amazon.
  • If you like what we’re doing, help us spread the word!
  • Key facts about the park:

    • World’s largest tropical forest reserve (38,867 square kilometers, or about the size of Switzerland)
    • Home to about 14% of all amphibian species and roughly 54% of all bird species in the Amazon – not to mention large populations of at least eight threatened species, including the jaguar
    • Includes part of the Guiana Shield containing 25% of world’s remaining tropical rainforests – 80 to 90% of which are still pristine
    • Holds the last major unpolluted water reserves in the Neotropics, containing approximately 20% of all of the Earth’s water
    • One of the last tropical regions on Earth vastly unaltered by humans
    • Significant contributor to climatic regulation via heat absorption and carbon storage


    They claim to have donated 141,529.42 EUR !!!











    Well suppose you are the Web Admin of a very popular website like Wikipedia or etc

    One way to meet server costs is to say openly hey i need to balance my costs so i need some money.

    The other way is to use online advertising.

    I started mine with Google Adsense.

    Click per milli (or CPM)  gives you a very low low conversion compared to contacting ad sponsor directly.

    But its a great data experiment-

    as you can monitor which companies are likely to be advertised on your site (assume google knows more about their algols than you will)

    which formats -banner or text or flash have what kind of conversion rates

    what are the expected pay off rates from various keywords or companies (like business intelligence software, predictive analytics software and statistical computing software are similar but have different expected returns (if you remember your eco class)


    NOW- Based on above data, you know whats your minimum baseline to expect from a private advertiser than a public, crowd sourced search engine one (like Google or Bing)

    Lets say if you have 100000 views monthly. and assume one out of 1000 page views will lead to a click. Say the advertiser will pay you 1 $ for every 1 click (=1000 impressions)

    Then your expected revenue is $100.But if your clicks are priced at 2.5$ for every click , and your click through rate is now 3 out of 1000 impressions- (both very moderate increases that can done by basic placement optimization of ad type, graphics etc)-your new revenue is  750$.

    Be a good Samaritan- you decide to share some of this with your audience -like 4 Amazon books per month ( or I free Amazon book per week)- That gives you a cost of 200$, and leaves you with some 550$.

    Wait! it doesnt end there- Adam Smith‘s invisible hand moves on .

    You say hmm let me put 100 $ for an annual paper writing contest of $1000, donate $200 to one laptop per child ( or to Amazon rain forests or to Haiti etc etc etc), pay $100 to your upgraded server hosting, and put 350$ in online advertising. say $200 for search engines and $150 for Facebook.


    Month 1 would should see more people  visiting you for the first time. If you have a good return rate (returning visitors as a %, and low bounce rate (visits less than 5 secs)- your traffic should see atleast a 20% jump in new arrivals and 5-10 % in long term arrivals. Ignoring bounces- within  three months you will have one of the following

    1) An interesting case study on statistics on online and social media advertising, tangible motivations for increasing community response , and some good data for study

    2) hopefully better cost management of your server expenses

    3)very hopefully a positive cash flow


    you could even set a percentage and share the monthly (or annually is better actions) to your readers and advertisers.

    go ahead- change the world!

    the key paradigms here are sharing your traffic and revenue openly to everyone

    donating to a suitable cause

    helping increase awareness of the suitable cause

    basing fixed percentages rather than absolute numbers to ensure your site and cause are sustained for years.

    The auto-suggest link/tags for blogs blogs have a great new option for generating tags, and links and thus improving their search engine optimization for posts.

    Just go to Users-Personal Settings- and check the options shown. Thats it every time you write a post it suggests links and tags. Links are helpful for your readers (like Wikipedia links to understand dense technical jargon, or associated websites). Tags help to classify your contents so that all visitors to the web site including spiders ,search engines and your readers can search it better.

    The bad thing is I need to go back to all 1025 posts on this site and auto generate tags for the archives ! Oh well. Great collaboration between zementa and Automattic for this new feature.