How to Analyze Wikileaks Data – R SPARQL

Logo for R
Image via Wikipedia

Drew Conway- one of the very very few Project R voices I used to respect until recently. declared on his blog http://www.drewconway.com/zia/

Why I Will Not Analyze The New WikiLeaks Data

and followed it up with how HE analyzed the post announcing the non-analysis.

“If you have not visited the site in a week or so you will have missed my previous post on analyzing WikiLeaks data, which from the traffic and 35 Comments and 255 Reactions was at least somewhat controversial. Given this rare spotlight I thought it would be fun to use the infochimps API to map out the geo-location of everyone that visited the blog post over the last few days. Unfortunately, after nearly two years with the same web hosting service, only today did I realize that I was not capturing daily log files for my domain”

Anyways – non American users of R Project can analyze the Wikileaks data using the R SPARQL package I would advise American friends not to use this approach or attempt to analyze any data because technically the data is still classified and it’s possession is illegal (which is the reason Federal employees and organizations receiving federal funds have advised not to use this or any WikiLeaks dataset)

https://code.google.com/p/r-sparql/

Overview

R is a programming language designed for statistics.

R Sparql allows you to run SPARQL Queries inside R and store it as a R data frame.

The main objective is to allow the integration of Ontologies with Statistics.

It requires Java and rJava installed.

Example (in R console):

> library(sparql)> data <- query("SPARQL query>","RDF file or remote SPARQL Endpoint")

and the data in a remote SPARQL  http://www.ckan.net/package/cablegate

SPARQL is an easy language to pick  up, but dammit I am not supposed to blog on my vacations.

http://code.google.com/p/r-sparql/wiki/GettingStarted

Getting Started

1. Installation

1.1 Make sure Java is installed and is the default JVM:

$ sudo apt-get install sun-java6-bin sun-java6-jre sun-java6-jdk$ sudo update-java-alternatives -s java-6-sun

1.2 Configure R to use the correct version of Java

$ sudo R CMD javareconf

1.3 Install the rJava library

$ R> install.packages("rJava")> q()

1.4 Download and install the sparql library

Download: http://code.google.com/p/r-sparql/downloads/list

$ R CMD INSTALL sparql-0.1-X.tar.gz

2. Executing a SPARQL query

2.1 Start R

#Load the librarylibrary(sparql)#Run the queryresult <- query("SELECT ... ", "http://...")#Print the resultprint(result)

3. Examples

3.1 The Query can be a string or a local file:

query("SELECT ?date ?number ?season WHERE {  ... }", "local-file.rdf")
query("my-query.rq", "local-file.rdf")

The package will detect if my-query.rq exists and will load it from the file.

3.3 The uri can be a file or an url (for remote queries):

query("SELECT ... ","local-file.db")
query("SELECT ... ","http://dbpedia.org/sparql")

3.4 Get some examples here: http://code.google.com/p/r-sparql/downloads/list

SPARQL Tutorial-

http://openjena.org/ARQ/Tutorial/index.html

Also read-

http://webr3.org/blog/linked-data/virtuoso-6-sparqlgeo-and-linked-data/

and from the favorite blog of Project R- Also known as NY Times

http://bits.blogs.nytimes.com/2010/11/15/sorting-through-the-government-data-explosion/?twt=nytimesbits

In May 2009, the Obama administration started putting raw 
government data on the Web. 
It started with 47 data sets. Today, there are more than
 270,000 government data sets, spanning every imaginable 
category from public health to foreign aid.

Zen and the art of applying T tests to Spam Data

Decisionstats traffic seemed up mmm but Spam is way way up

Whos spamming my dear bloggie

hmm

is it the russians doing a link spam. unlikely they dont bot against Akismet that much (as they fail)

And Captcha can be failed by python (apparently. sigh)

Is there a co relation of certain tags of posts, and count of spam- hoping to distort say blogs’s search engine rankings for SAS WPS Lawsuit in Google or jet ski across  pacific in Google.

Sigh- an old retired outlaw black hat is never kept in peace. Try doing a blog search for R in Google- Revo  is now down to number 7 (which is hmm given Google Instant)

Of course I think too much about SEO, but I dont run CPC ads- I made much more money when traffic is low – say 5-10 small businesses needing to forecast their sales .

and enjoy your Thanksgiving. Remember the Indians bring the Turkeys.

 

New Google Ad Planner

Dusan's User Interface challenge
Image by moggs oceanlane via Flickr

The new Google Ad Planner is really nice-seems better than old Adwords interface, though needs a UI redesign before it can complete with the clean cut slice and dice of Facebook Ad Planner.

It’s the interface, stupid that makes an Iphone sell more than the Symbian even with 90% functionality. Same reasons why Google Storage is okay but Google Prediction API gets slower liftoff than Amazon Console (now with FREE instances) – though the R interface to Prediction API sure helps.

Prediction API is a terrific tool dying for oxygen out there (and will end up like Wave- I hope not)

Sometimes you need artists as well as engineers to design query tools, G Men- and guess the Double Click anti trust rumours have quietened down enough because why the heck did double click interface integration take so loooong.

( and btw why cant Google just get into the multi billion dashboard business if they can manage ALL the data IN THE INTERNET ——they sure can do it for specific companies- – but wait-

they are probably waiting for AsterData to stop sucking thumbs ,chanting on MapReduce SQL,  MapReduce SQL nursery rhymes and start inventing NEW STUFF again (or atleast creating two product brands from nCluster (when you and I were in school together giggle)

Btw the time Google make up their mind to enter BI or wait for Aster to finish- IBM would have gulped and burped all there it is- and thats the way that market rolls.

Back to Ad s and Mad Men.

Here are some screenshots-of the new Google Ad Planner-

I found it useful to review traffic for third party websites (even better than Google Trends) and thats a definite plus over Facebooks closed dormitory world of ads.

Click on them for some more views or go straight to http://google.com/adplanner and Enjoy Baby!

Which websites attract your target customers?

View a site listing: 

Ad Planner top 1,000 sites

Refine your online advertising with DoubleClick Ad Planner, a free media planning tool that can help you:

Identify websites your target customers are likely to visit

  • Define audiences by demographics and interests.
  • Search for websites relevant to your target audience.
  • Access unique users, page views, and other data for millions of websites from over 40 countries.

Easily build media plans for yourself or your clients

  • Create lists of websites where you’d like to advertise.
  • Generate aggregated website statistics for your media plan.

and

Take charge of your DoubleClick Ad Planner site listing

View a site listing: 

Ad Planner top 1,000 sites

DoubleClick Ad Planner is a media planning tool where advertisers find sites for their media buys. As a site owner, you can access the DoubleClick Ad Planner Publisher Center and
Market your site
Write a site description to present your audience and unique value to advertisers.
Help advertisers search for you
Choose categories for your site and ad formats you support.
Improve the data that advertisers see
Share your Google Analytics data to reflect the most accurate traffic numbers for your site.

 

The SEO mess on joining blog aggregators

 

Mug shot of Paris Hilton.
Image via Wikipedia

 

If you are an analytics blogger who writes, and is aggregated on an analytical community- read on- Here’s how blog aggregation communities can help you lose 30% of all future traffic long term, while giving you a short term.

The problem is not created by Blogging Communities (like R-Bloggers, or PlanteR, or Smart Data Collective or AnalyticBridge or even BeyeBlogs )

It is created by the way Google Page Rank is structured- you see given exactly the same content on two different we pages- Google Page Rank will place the higher Page Rank results higher. This is counter intutive and quite simple to rectify- The Google Spider can just use the Time Stamp for choosing which article was published where first (Obviously on your blog, AND then later to the aggregator).

How bad is the mess? Well joining ANY blog aggregation will lead to an instant lift of upto 10-50 % of your current traffic as similar bloggers try and read about you. However you can lose the long term 30% proportion which is a benchmark of search engine created traffic for you.

So do you opt out of blog aggregation? No. It’s a SEO mess and it’s unfair to punish your blog aggregator, most of whom are running on ad-supported sponsors or their own funds on dry fumes to publish your content. Most of the fore mentioned communities are created by excellent people I interacted with heavily- and they are genuinely motivated to give readers an easy way to keep up with blogs. Especially Smart Data Collective, Analyticbridge and R-bloggers whose founders I have known personally.

You can do one thing- create manual summaries in the excerpt feature of your blog posts- it’s just below the WordPress page. And switch your RSS feed to summary rather than full. It avoids losing keyword rank to other websites, it prevents the Blog Aggregation from gaining too much influence in key word related searches, and it keeps your whole eco system happy, Best of All it helps readers of Blog Aggregators- since most of them use a summary on the front page anyways.

An additional thought on Google Page Rank- something I have sulked over but not spoken for a long long time.  It ignores the value of reader- If Bill Gates, Steve Jobs, and 500 ceos from Fortune 500 companies read my blog but do not link to it- it will count daily traffic as 500. Probably it will give more weightage to Paris Hilton fans.

A suggestion-humbly- you can use IP Address lookup of visitors to see if traffic is coming from corporate sources or retail sources -Clicky from GetClicky does this. Use it as feedback in Google Analytics as well as Google Trends.

And maybe PageRank needs to add quantity and quality of visitors as additional variables . Do a A/B test guys some Chi Square juice- its not quite Mad Men Adverting but its still good fun.

 

PageRank
Image via Wikipedia

 

and the world is one big community as per xkcd


The Comic Water Games (aka Common Wealth Games)

We in Delhi, India are a tough people. With summer temperatures from 46 Degree Celcius (114 Degree Fahrenheit) and Winter temperatures from 2-3 Degree Celcius (just above freezing), high pollution levels, the worst traffic jams (and highest per capita cars)- there is very little that intimidates the Average Delhiite-

But the Return of the British Empire is scaring us- and it is called Common Wealth Games. The Common Wealth is a group of countries that used to be colonized by Britain in her colonial days ( USA is not a member though- as they probably kicked way too much British butt while gaining independence).

And every 4 years they have CommonWealth games (read games for the non US English speaking world). So when our commie neighborhood– the Chinese went and got themselves an Olympics- we decided to get ourselves this CWG games too. Big deal- national pride- rising economic power and all that.

So far the Games has meant the following- lots of roads dug up, lot of stadiums in various degrees of preparation, a total cost of 2 Billion USD, rampant allegations of corruption due to the ten times increase in budget – including rather suspicious looking documents procured by our local press (yes Indian press is free as it is a democracy)

And add divine grace. Delhi has the wettest monsoon since 1978- it rains cats and dogs in September- and we now have a mini dengue malaria epidemic. 4 countries have declared the living quarters for athletes as uninhabitable , some have walked out, the inevitable terrorists injured two Taiwanese tourists this weekend (in a semi ironic email they said they were prepared as the government was prepared- it isn’t)

Today a bridge collapsed-

http://www.nytimes.com/2010/09/22/sports/22iht-GAMES.html?_r=1&hp

On Tuesday afternoon, a bridge next to Jawaharlal Nehru Stadium, the main Games venue, fell apart. The footbridge collapsed into three pieces, taking several workers with it and uprooting one side of the arch that supported it.

A police officer at the scene said that 27 people had been injured, four of them seriously, in the collapse.

“This will not affect the Games,” said Raj Kumar Chauhan, a Delhi minister for development, who spoke on the scene. “We can put the bridge up again, or make a new one.”

and

http://www.nytimes.com/2010/09/20/world/asia/20india.html?ref=sports

“We really need to learn how to plan,” said Vrinda Walavalkar, a public relations executive who is not connected to the Games.

“Maybe we feel we have so many lifetimes to achieve things” that it does not matter if it gets done this time, she said.

Mr. Gupta, the shopkeeper, found a metaphor in Hindu wedding tradition.

The groom’s party, known as the barat, traditionally marches to the bride’s house on horseback with his friends and family, he explained. When the barat appears, the bride has to come to the door, he said.

“If the bride is not ready, you patch her up and try to hide all her defects,” Mr. Gupta said, and then you send her outside.

————————————————————————————————————–

To some this may be shocking. To the average Delhi-ite battling traffic and rain , this is one more episode in the chaotic Capital. As a small solace- Delhi still has the best and cheapest street food this part of the world- with golgappas, tikki and chat. If only you can beat the rain to get them !

Also see http://en.wikipedia.org/wiki/Delhi if you like to know more.

Software Lawsuits :Ergo

The latest round of software lawsuits makes things more interesting especially for Google. There are two notable developments

1) Google’s pact with Verizon for Even more Open Internet -From

http://googlepublicpolicy.blogspot.com/2010/08/joint-policy-proposal-for-open-internet.html

A provider that offers a broadband Internet access service
complying with the above principles could offer any other additional or differentiated services. Such other services would have to be distinguishable in scope and purpose from broadband . Internet access service, but could make use of or access Internet content, applications or services
and could include traffic prioritization.

2) Oracle’s lawsuit against Google for Intellectual Property enforcement of Java for Android. ( read here http://news.cnet.com/8301-30685_3-20013549-264.html

I once joked about nothing remains cool forever not even Google (see https://decisionstats.wordpress.com/2008/08/05/11-ways-to-beat-up-google/ ) and I did not foresee the big G beating itself into knots on its own.

It is hard to sympathize with Google (or Oracle or Verizon) but this is a mess that is created when lawyers (with a briefcase) steal value rather than a thousand engineers can create value.

Interestingly Google owns the IP for Map Reduce – so could it itself sue the Hadoop community over terms of royalty someday-like Oracle did with Java- hmmmmm interesting revenue stream

All in all I would be happy to see zero tiers on an internet (wireless or wired) and even Java developers to make some money on writing code. Open source is not free source.

The Plot to Kill Obama

The Plot to Kill Obama :Updated 28 August

Note- This is a continuous article since Feb 2008 ,collecting all data points on plots against Mr Barack Obama. The Media sources are authentic, the scenario and what if analysis is fictional, and readers are expected to use OBJECTIVITY rather than passion to draw their conclusions.

Barack Obama’s Assasination would pave the way for either McCain or Hillary Clinton to the White House. Neither of them are expected to seriously damage corporate interests , they wont be clamping down on outsourcing American Jobs, they are less likely to cut rebates for wealthy, and less likely to avoid lobbyists from the gun and health pharma lobby. They are less likely to expedite the end of Iraq war, and the tens of billions spent every month that go from tax payers to major corporations is less likely to be threatened by Obama out of the way. So a few people would definitely make serious money if Barack Hussein Obama goes back to his Kenyan family graveyard.

COUNTERPOINT– Also if Obama gets elected , gets in a cowboyish Missile crisis

,loses a bay of Pigs due to morals ,appoints a brother as an attorney general , who is not charged or questioned before an actress died, but assumes a high moral ground (rather than political ambition) and goes out to investigate Mafia men who may be critical allies in a new cold war,he may end up in a Harvey Dent state funeral with people self justifying the extreme use of force.The last famous political dynasty (which won a very close election, and had a non mainstream religion called Catholism were the Kennedys.

All three Kennedy brothers had close escapes with death in the 1960’s , Ted had two (one in a plane crash which gave him a back injury) and the car crash on the bridge.The Other two brothers didnt quite make it.

THINK….What would likely happen if someone killed Obama.

The question is – is there a credible plot to kill Obama?

As per Feb’s NY times –

From the NYT Story DALLAS There is a hushed worry on the minds of many supporters of Senator Barack Obama, echoing in conversations from state to state, rally to rally: Will he be safe?

Representative Bennie Thompson, Democrat of Mississippi and chairman of the House Homeland Security Committee, raised concerns in a letter in January to officials who oversee the Secret Service. While Mr. Obama was already receiving protection, Mr. Thompson said that the intense interest in the election prompted him to make sure that Mr. Obama and the other candidates were offered adequate security.

This means that there is a security threat perceived already.

As per Huffington Post- a recent Obama rally at Dallas had police complain about a security breach, when in order to speed up the queues forming they waved metal detection and frisking for people. And this at a Texas rally , where concealed weapons laws are the most liberal.http://www.star-telegram.com/667/story/486413.html

This means that hectic electioneering and Obama’s popularity , his strengths could be his fatal illness.

Obama seeks to take JFK’s legacy. We hope the corporate interests and interests controlling today’s Washington allow him the chance to do so. Except for the end of course.

Hopefully 2008 wont be like 1968

Update – July 3 2008- An officer guarding Obama’s Motorcade was injured in a collision when a van struck him.

http://www.huffingtonpost.com/2008/07/03/officer-injured-guarding_n_110696.html

On a fast ride Wednesday from the U.S. Air Force Academy to a luxury hotel where Obama was staying and holding a fundraiser, Orvin suffered minor injuries after a van crashed into his motorcycle, which was blocking traffic for the entourage.

Colorado Springs police Sgt. Scott Wisler said the driver of a van was cited for rear-ending the officer.

Updated –

http://thecaucus.blogs.nytimes.com/2008/07/07/obamas-plane-diverted/

Editorial Disclaimer – This is based on prior events and hypothesizes a fictional what if scenario. Any co relation is coincidental. The sources and written past events quoted are actual. Keep abusive comments to the minimum, and you DO have the right to articulate passionaltely your fears and misgivings. Read privacy policy before submitting false name comments .The post gets almost 2-3 visitors per week from search -engines with the key words “kill obama”.


25memo600.jpg

%d bloggers like this: