Google Plus API- statistical text mining anyone

For the past year and two I have noticed a lot of statistical analysis using #rstats /R on unstructured text generated in real time by the social network Twitter. From an analytic point of view , Google Plus is an interesting social network , as it is a social network that is new and arrived after the analytic tools are relatively refined. It is thus an interesting use case for evolution of people behavior measured globally AFTER analytic tools in text mining are evolved and we can thus measure how people behave and that behavior varies as the social network and its user interface evolves.

And it would also be  a nice benchmark to do sentiment analysis across multiple social networks.

Some interesting use cases of using Twitter that have been used in R.

  • Using R to search Twitter for analysis
http://www.franklincenterhq.org/2429/using-r-to-search-twitter-for-analysis/
  • Text Data Mining With Twitter And R
  • TWITTER FROM R… SURE, WHY NOT!
  • A package called TwitteR
  • slides from my R tutorial on Twitter text mining #rstats
  • Generating graphs of retweets and @-messages on Twitter using R and Gephi
But with Google Plus API now active

The Console lets you see and manage the following project information:

  • Activated APIs – Activate one or more APIs to enable traffic monitoring, filtering, and billing, and API-specific pages for your project. Read more about activating APIs here.
  • Traffic information – The Console reports traffic information for each activated API. Additionally, you can cap or filter usage by API. Read more about traffic reporting and request filtering here.
  • Billing information – When you activate billing, your activated APIs can exceed the courtesy usage quota. Usage fees are billed to the Google Checkout account that you specify. Read more about billing here.
  • Project keys – Each project is identified by either an API key or an OAuth 2.0 token. Use this key/token in your API requests to identify the project, in order to record usage data, enforce your filtering restrictions, and bill usage to the proper project. You can use the Console to generate or revoke API keys or OAuth 2.0 certificates to use in your application. Read more about keys here.
  • Team members – You can specify additional members with read, write, or ownership access to this project’s Console page. Read more about team members here.
Google+ API Courtesy limit: 1,000 queries/day

Effective limits:

API Per-User Limit Used Courtesy Limit
Google+ API 5.0 requests/second/user 0% 1,000 queries/day
API Calls
Most of the Google+ API follows a RESTful API design, meaning that you use standard HTTP methods to retrieve and manipulate resources. For example, to get the profile of a user, you might send an HTTP request like:

GET https://www.googleapis.com/plus/v1/people/userId

Common Parameters

Different API methods require parameters to be passed either as part of the URL path or as query parameters. Additionally, there are a few parameters that are common to all API endpoints. These are all passed as optional query parameters.

Parameter Name

Value

Description

callback

string

Specifies a JavaScript function that will be passed the response data for using the API with JSONP.

fields

string

Selector specifying which fields to include in a partial response.

key

string

API key. Your API key identifies your project and provides you with API access, quota, and reports. Required unless you provide an OAuth 2.0 token.

access_token

string

OAuth 2.0 token for the current user. Learn more about OAuth.

prettyPrint

boolean

If set to “true”, data output will include line breaks and indentation to make it more readable. If set to “false”, unnecessary whitespace is removed, reducing the size of the response. Defaults to “true”.

userIp

string

Identifies the IP address of the end user for whom the API call is being made. This allows per-user quotas to be enforced when calling the API from a server-side application. Learn more about Capping Usage.

Data Formats

Resources in the Google+ API are represented using JSON data formats. For example, retrieving a user’s profile may result in a response like:

{
  "kind": "plus#person",
  "id": "118051310819094153327",
  "displayName": "Chirag Shah",
  "url": "https://plus.google.com/118051310819094153327",
  "image": {
    "url": "https://lh5.googleusercontent.com/-XnZDEoiF09Y/AAAAAAAAAAI/AAAAAAAAYCI/7fow4a2UTMU/photo.jpg"
  }
}

Common Properties

While each type of resource will have its own unique representation, there are a number of common properties that are found in almost all resource representations.

Property Name

Value

Description

displayName

string

This is the name of the resource, suitable for displaying to a user.

id

string

This property uniquely identifies a resource. Every resource of a given kind will have a unique id. Even though an id may sometimes look like a number, it should always be treated as a string.

kind

string

This identifies what kind of resource a JSON object represents. This is particularly useful when programmatically determining how to parse an unknown object.

url

string

This is the primary URL, or permalink, for the resource.

Pagination

In requests that can respond with potentially large collections, such as Activities list, each response contains a limited number of items, set by maxResults(default: 20). Each response also contains a nextPageToken property. To obtain the next page of items, you pass this value of nextPageToken to the pageTokenproperty of the next request. Repeat this process to page through the full collection.

For example, calling Activities list returns a response with nextPageToken:

{
  "kind": "plus#activityFeed",
  "title": "Plus Public Activities Feed",
  "nextPageToken": "CKaEL",
  "items": [
    {
      "kind": "plus#activity",
      "id": "123456789",
      ...
    },
    ...
  ]
  ...
}

To get the next page of activities, pass the value of this token in with your next Activities list request:

https://www.googleapis.com/plus/v1/people/me/activities/public?pageToken=CKaEL

As before, the response to this request includes nextPageToken, which you can pass in to get the next page of results. You can continue this cycle to get new pages — for the last page, “nextPageToken” will be absent.

 

it would be interesting the first wave of analysis on this new social network and see if it is any different from others, if at all.
After all, an API is only as good as the analysis and applications  that can be done on the data it provides

 

US-CERT Incident Reporting System

Here are some resources if your cyber resources have been breached. Note the form doesnot use CAPTCHA at all

US-CERT Incident Reporting System (their head Randy Vickers quit last week)

https://forms.us-cert.gov/report/

Using the US-CERT Incident Reporting SystemIn order for us to respond appropriately, please answer the questions as completely and accurately as possible. Questions that must be answered are labeled “Required”. As always, we will protect your sensitive information. This web site uses Secure Sockets Layer (SSL) to provide secure communications. Your browser must allow at least 40-bit encryption. This method of communication is much more secure than unencrypted email.  Continue reading “US-CERT Incident Reporting System”

Tools for Hackers:Beginners

How to disguise your IP Address from your most wonderful glorious leaders-

From

 

https://www.torproject.org/projects/torbrowser.html.en

Tor Browser Bundle


The Tor software protects you by bouncing your communications around a distributed network of relays run by volunteers all around the world: it prevents somebody watching your Internet connection from learning what sites you visit, it prevents the sites you visit from learning your physical location, and it lets you access sites which are blocked.

The Tor Browser Bundle lets you use Tor on Windows, Mac OS X, or Linux without needing to install any software. It can run off a USB flash drive, comes with a pre-configured web browser, and is self-contained. The Tor IM Browser Bundleadditionally allows instant messaging and chat over Tor. If you would prefer to use your existing web browser, install Tor permanently, or if you don’t use Windows, see the other ways to download Tor.

Freedom House has produced a video on how to find and use the Tor Browser Bundle. If you don’t see a video below, view it at Youtube . Know of a better video or one translated into your language? Let us know!

 

 

 

and if you now want to see or check your own website for a Denial of Service attack , download this

http://sourceforge.net/projects/loic/

This is the software for which 32 Turkish teenagers got arrested for bringing down their govt websites. Do NOT USE it for ILLEGAL purposes,

because 1) it is hosted on a western website that due to Patriot Act would tracking downloads as well as most likely be inserting some logging code into your computer (especially if you are still on Windows)

2) Turkey being a NATO member got rather immediate notice of this – which makes it very likely that this tool is compromised in the Western Hemisphere. You can probably use this in Eastern Hemisphere country excluding Israel, Turkey, China, India ,Korea or Japan because these countries do have sophisticated hackers working for the government as well.

3) This is just a beginners tool to understand how flooding a website with requests work.

http://sourceforge.net/projects/loic/files/

Basically download, unzip the file

Enter URL and click Lock on to know IP address.

use HTTP Method. Make say 1000 threads.

Then press the IMMA CHARGING MY LAZER big button.

Note the Failed Tab tells you how good or bad this method is.

Note – it wont work on my blogs hosted on wordpress.com- but then those blogs had a root level breach some time back. It did work on both my blogspot and my tumblr blogs, and it completely shattered my son’s self hosted wordpress blog (see below)

 

 

2011 Forecast-ying

Free twitter badge
Image via Wikipedia

I had recently asked some friends from my Twitter lists for their take on 2011, atleast 3 of them responded back with the answer, 1 said they were still on it, and 1 claimed a recent office event.

Anyways- I take note of the view of forecasting from

http://www.uiah.fi/projekti/metodi/190.htm

The most primitive method of forecasting is guessing. The result may be rated acceptable if the person making the guess is an expert in the matter.

Ajay- people will forecast in end 2010 and 2011. many of them will get forecasts wrong, some very wrong, but by Dec 2011 most of them would be writing forecasts on 2012. almost no one will get called on by irate users-readers- (hey you got 4 out of 7 wrong last years forecast!) just wont happen. people thrive on hope. so does marketing. in 2011- and before

and some forecasts from Tom Davenport’s The International Institute for Analytics (IIA) at

http://iianalytics.com/2010/12/2011-predictions-for-the-analytics-industry/

Regulatory and privacy constraints will continue to hamper growth of marketing analytics.

(I wonder how privacy and analytics can co exist in peace forever- one view is that model building can use anonymized data suppose your IP address was anonymized using a standard secret Coco-Cola formula- then whatever model does get built would not be of concern to you individually as your privacy is protected by the anonymization formula)

Anyway- back to the question I asked-

What are the top 5 events in your industry (events as in things that occured not conferences) and what are the top 3 trends in 2011.

I define my industry as being online technology writing- research (with a heavy skew on stat computing)

My top 5 events for 2010 were-

1) Consolidation- Big 5 software providers in BI and Analytics bought more, sued more, and consolidated more.  The valuations rose. and rose. leading to even more smaller players entering. Thus consolidation proved an oxy moron as total number of influential AND disruptive players grew.

 

2) Cloudy Computing- Computing shifted from the desktop but to the mobile and more to the tablet than to the cloud. Ipad front end with Amazon Ec2 backend- yup it happened.

3) Open Source grew louder- yes it got more clients. and more revenue. did it get more market share. depends on if you define market share by revenues or by users.

Both Open Source and Closed Source had a good year- the pie grew faster and bigger so no one minded as long their slices grew bigger.

4) We didnt see that coming –

Technology continued to surprise with events (thats what we love! the surprises)

Revolution Analytics broke through R’s Big Data Barrier, Tableau Software created a big Buzz,  Wikileaks and Chinese FireWalls gave technology an entire new dimension (though not universally popular one).

people fought wars on emails and servers and social media- unfortunately the ones fighting real wars in 2009 continued to fight them in 2010 too

5) Money-

SAP,SAS,IBM,Oracle,Google,Microsoft made more money than ever before. Only Facebook got a movie named on itself. Venture Capitalists pumped in money in promising startups- really as if in a hurry to park money before tax cuts expired in some countries.

 

2011 Top Three Forecasts

1) Surprises- Expect to get surprised atleast 10 % of the time in business events. As internet grows the communication cycle shortens, the hype cycle amplifies buzz-

more unstructured data  is created (esp for marketing analytics) leading to enhanced volatility

2) Growth- Yes we predict technology will grow faster than the automobile industry. Game changers may happen in the form of Chrome OS- really its Linux guys-and customer adaptability to new USER INTERFACES. Design will matter much more in technology on your phone, on your desktop and on your internet. Packaging sells.

False Top Trend 3) I will write a book on business analytics in 2011. yes it is true and I am working with A publisher. No it is not really going to be a top 3 event for anyone except me,publisher and lucky guys who read it.

3) Creating technology and technically enabling creativity will converge at an accelerated rate. use of widgets, guis, snippets, ide will ensure creative left brains can code easier. and right brains can design faster and better due to a global supply chain of techie and artsy professionals.

 

 

Interesting R competition at Reddit

Image representing Reddit as depicted in Crunc...
Image via CrunchBase

Here is an interesting R competition going on at Reddit and it is to help Reddit make a recommendation engine 🙂

http://www.reddit.com/r/redditdev/comments/dtg4j/want_to_help_reddit_build_a_recommender_a_public/

by ketralnis

As promised, here is the big dump of voting information that you guys donated to research. Warning: this contains much geekery that may result in discomfort for the nerd-challenged.

I’m trying to use it to build a recommender, and I’ve got some preliminary source code. I’m looking for feedback on all of these steps, since I’m not experienced at machine learning.

Here’s what I’ve done

  • I dumped all of the raw data that we’ll need to generate the public dumps. The queries are the comments in the two .pig files and it took about 52 minutes to do the dump against production. The result of this raw dump looks like:
    $ wc -l *.dump
     13,830,070 reddit_data_link.dump
    136,650,300 reddit_linkvote.dump
         69,489 reddit_research_ids.dump
     13,831,374 reddit_thing_link.dump
    
  • I filtered the list of votes for the list of users that gave us permission to use their data. For the curious, that’s 67,059 users: 62,763 with “public votes” and 6,726 with “allow my data to be used for research”. I’d really like to see that second category significantly increased, and hopefully this project will be what does it. This filtering is done by srrecs_researchers.pig and took 83m55.335s on my laptop.
  • I converted data-dumps that were in our DB schema format to a more useable format using srrecs.pig(about 13min)
  • From that dump I mapped all of the account_ids, link_ids, and sr_ids to salted hashes (using obscure() insrrecs.py with a random seed, so even I don’t know it). This took about 13min on my laptop. The result of this, votes.dump is the file that is actually public. It is a tab-separated file consisting in:
    account_id,link_id,sr_id,dir
    

    There are 23,091,688 votes from 43,976 users over 3,436,063 links in 11,675 reddits. (Interestingly these ~44k users represent almost 17% of our total votes). The dump is 2.2gb uncompressed, 375mb in bz2.

What to do with it

The recommendations system that I’m trying right now turns those votes into a set of affinities. That is, “67% of user #223’s votes on /r/reddit.com are upvotes and 52% on programming). To make these affinities (55m45.107s on my laptop):

 cat votes.dump | ./srrecs.py "affinities_m()" | sort -S200m | ./srrecs.py "affinities_r()" > affinities.dump

Then I turn the affinities into a sparse matrix representing N-dimensional co-ordinates in the vector space of affinities (scaled to -1..1 instead of 0..1), in the format used by R’s skmeans package (less than a minute on my laptop). Imagine that this matrix looks like

          reddit.com pics       programming horseporn  bacon
          ---------- ---------- ----------- ---------  -----
ketralnis -0.5       (no votes) +0.45       (no votes) +1.0
jedberg   (no votes) -0.25      +0.95       +1.0       -1.0
raldi     +0.75      +0.75      +0.7        (no votes) +1.0
...

We build it like:

# they were already grouped by account_id, so we don't have to
# sort. changes to the previous step will probably require this
# step to have to sort the affinities first
cat affinities.dump | ./srrecs.py "write_matrix('affinities.cm', 'affinities.clabel', 'affinities.rlabel')"

I pass that through an R program srrecs.r (if you don’t have R installed, you’ll need to install that, and the packageskmeans like install.packages('skmeans')). This program plots the users in this vector space finding clusters using a sperical kmeans clustering algorithm (on my laptop, takes about 10 minutes with 15 clusters and 16 minutes with 50 clusters, during which R sits at about 220mb of RAM)

# looks for the files created by write_matrix in the current directory
R -f ./srrecs.r

The output of the program is a generated list of cluster-IDs, corresponding in order to the order of user-IDs inaffinities.clabel. The numbers themselves are meaningless, but people in the same cluster ID have been clustered together.

Here are the files

These are torrents of bzip2-compressed files. If you can’t use the torrents for some reason it’s pretty trivial to figure out from the URL how to get to the files directly on S3, but please try the torrents first since it saves us a few bucks. It’s S3 seeding the torrents anyway, so it’s unlikely that direct-downloading is going to go any faster or be any easier.

  • votes.dump.bz2 — A tab-separated list of:
    account_id, link_id, sr_id, direction
    
  • For your convenience, a tab-separated list of votes already reduced to percent-affinities affinities.dump.bz2, formatted:
    account_id, sr_id, affinity (scaled 0..1)
    
  • For your convenience, affinities-matrix.tar.bz2 contains the R CLUTO format matrix files affinities.cm,affinities.clabelaffinities.rlabel

And the code

  • srrecs.pigsrrecs_researchers.pig — what I used to generate and format the dumps (you probably won’t need this)
  • mr_tools.pysrrecs.py — what I used to salt/hash the user information and generate the R CLUTO-format matrix files (you probably won’t need this unless you want different information in the matrix)
  • srrecs.r — the R-code to generate the clusters

Here’s what you can experiment with

  • The code isn’t nearly useable yet. We need to turn the generated clusters into an actual set of recommendations per cluster, preferably ordered by predicted match. We probably need to do some additional post-processing per user, too. (If they gave us an affinity of 0% to /r/askreddit, we shouldn’t recommend it, even if we predicted that the rest of their cluster would like it.)
  • We need a test suite to gauge the accuracy of the results of different approaches. This could be done by dividing the data-set in and using 80% for training and 20% to see if the predictions made by that 80% match.
  • We need to get the whole process to less than two hours, because that’s how often I want to run the recommender. It’s okay to use two or three machines to accomplish that and a lot of the steps can be done in parallel. That said we might just have to accept running it less often. It needs to run end-to-end with no user-intervention, failing gracefully on error
  • It would be handy to be able to idenfity the cluster of just a single user on-the-fly after generating the clusters in bulk
  • The results need to be hooked into the reddit UI. If you’re willing to dive into the codebase, this one will be important as soon as the rest of the process is working and has a lot of room for creativity
  • We need to find the sweet spot for the number of clusters to use. Put another way, how many different types of redditors do you think there are? This could best be done using the aforementioned test-suite and a good-old-fashioned binary search.

Some notes:

  • I’m not attached to doing this in R (I don’t even know much R, it just has a handy prebaked skmeans implementation). In fact I’m not attached to my methods here at all, I just want a good end-result.
  • This is my weekend fun project, so it’s likely to move very slowly if we don’t pick up enough participation here
  • The final version will run against the whole dataset, not just the public one. So even though I can’t release the whole dataset for privacy reasons, I can run your code and a test-suite against it

——————————————————————————————-

 

I am thinking of using Rattle and using the arules package, and running it on the EC2 to get the horsepower.

How else do you think you can tackle a recommendation engine problem.

 

Ajay

 

The SEO mess on joining blog aggregators

 

Mug shot of Paris Hilton.
Image via Wikipedia

 

If you are an analytics blogger who writes, and is aggregated on an analytical community- read on- Here’s how blog aggregation communities can help you lose 30% of all future traffic long term, while giving you a short term.

The problem is not created by Blogging Communities (like R-Bloggers, or PlanteR, or Smart Data Collective or AnalyticBridge or even BeyeBlogs )

It is created by the way Google Page Rank is structured- you see given exactly the same content on two different we pages- Google Page Rank will place the higher Page Rank results higher. This is counter intutive and quite simple to rectify- The Google Spider can just use the Time Stamp for choosing which article was published where first (Obviously on your blog, AND then later to the aggregator).

How bad is the mess? Well joining ANY blog aggregation will lead to an instant lift of upto 10-50 % of your current traffic as similar bloggers try and read about you. However you can lose the long term 30% proportion which is a benchmark of search engine created traffic for you.

So do you opt out of blog aggregation? No. It’s a SEO mess and it’s unfair to punish your blog aggregator, most of whom are running on ad-supported sponsors or their own funds on dry fumes to publish your content. Most of the fore mentioned communities are created by excellent people I interacted with heavily- and they are genuinely motivated to give readers an easy way to keep up with blogs. Especially Smart Data Collective, Analyticbridge and R-bloggers whose founders I have known personally.

You can do one thing- create manual summaries in the excerpt feature of your blog posts- it’s just below the WordPress page. And switch your RSS feed to summary rather than full. It avoids losing keyword rank to other websites, it prevents the Blog Aggregation from gaining too much influence in key word related searches, and it keeps your whole eco system happy, Best of All it helps readers of Blog Aggregators- since most of them use a summary on the front page anyways.

An additional thought on Google Page Rank- something I have sulked over but not spoken for a long long time.  It ignores the value of reader- If Bill Gates, Steve Jobs, and 500 ceos from Fortune 500 companies read my blog but do not link to it- it will count daily traffic as 500. Probably it will give more weightage to Paris Hilton fans.

A suggestion-humbly- you can use IP Address lookup of visitors to see if traffic is coming from corporate sources or retail sources -Clicky from GetClicky does this. Use it as feedback in Google Analytics as well as Google Trends.

And maybe PageRank needs to add quantity and quality of visitors as additional variables . Do a A/B test guys some Chi Square juice- its not quite Mad Men Adverting but its still good fun.

 

PageRank
Image via Wikipedia

 

and the world is one big community as per xkcd