Ajay Ohri

Google Plus API- statistical text mining anyone

For the past year and two I have noticed a lot of statistical analysis using #rstats /R on unstructured text generated in real time by the social network Twitter. From an analytic point of view , Google Plus is an interesting social network , as it is a social network that is new and arrived after the analytic tools are relatively refined. It is thus an interesting use case for evolution of people behavior measured globally AFTER analytic tools in text mining are evolved and we can thus measure how people behave and that behavior varies as the social network and its user interface evolves.

And it would also be a nice benchmark to do sentiment analysis across multiple social networks.

Some interesting use cases of using Twitter that have been used in R.

Using R to search Twitter for analysis

http://www.franklincenterhq.org/2429/using-r-to-search-twitter-for-analysis/

Text Data Mining With Twitter And R

http://heuristically.wordpress.com/2011/04/08/text-data-mining-twitter-r/

TWITTER FROM R… SURE, WHY NOT!

http://www.cerebralmastication.com/2009/06/twitter-from-r-sure-why-not/

A package called TwitteR

http://cran.r-project.org/web/packages/twitteR/

http://cran.r-project.org/web/packages/twitteR/vignettes/twitteR.pdf

slides from my R tutorial on Twitter text mining #rstats

http://jeffreybreen.wordpress.com/2011/07/04/twitter-text-mining-r-slides/

Generating graphs of retweets and @-messages on Twitter using R and Gephi

http://blog.ynada.com/339

But with Google Plus API now active

The Console lets you see and manage the following project information:

Activated APIs – Activate one or more APIs to enable traffic monitoring, filtering, and billing, and API-specific pages for your project. Read more about activating APIs here.
Traffic information – The Console reports traffic information for each activated API. Additionally, you can cap or filter usage by API. Read more about traffic reporting and request filtering here.
Billing information – When you activate billing, your activated APIs can exceed the courtesy usage quota. Usage fees are billed to the Google Checkout account that you specify. Read more about billing here.
Project keys – Each project is identified by either an API key or an OAuth 2.0 token. Use this key/token in your API requests to identify the project, in order to record usage data, enforce your filtering restrictions, and bill usage to the proper project. You can use the Console to generate or revoke API keys or OAuth 2.0 certificates to use in your application. Read more about keys here.
Team members – You can specify additional members with read, write, or ownership access to this project’s Console page. Read more about team members here.

https://code.google.com/apis/console/b/0/

Google+ API			Courtesy limit: 1,000 queries/day

Effective limits:

API	Per-User Limit	Used	Courtesy Limit
Google+ API	5.0 requests/second/user	0%	1,000 queries/day

http://developers.google.com/+/api/

API Calls

Most of the Google+ API follows a RESTful API design, meaning that you use standard HTTP methods to retrieve and manipulate resources. For example, to get the profile of a user, you might send an HTTP request like:

GET https://www.googleapis.com/plus/v1/people/userId

Common Parameters

Different API methods require parameters to be passed either as part of the URL path or as query parameters. Additionally, there are a few parameters that are common to all API endpoints. These are all passed as optional query parameters.

Parameter Name	Value	Description
`callback`	`string`	Specifies a JavaScript function that will be passed the response data for using the API with JSONP.
`fields`	`string`	Selector specifying which fields to include in a partial response.
`key`	`string`	API key. Your API key identifies your project and provides you with API access, quota, and reports. Required unless you provide an OAuth 2.0 token.
`access_token`	`string`	OAuth 2.0 token for the current user. Learn more about OAuth.
`prettyPrint`	`boolean`	If set to “true”, data output will include line breaks and indentation to make it more readable. If set to “false”, unnecessary whitespace is removed, reducing the size of the response. Defaults to “true”.
`userIp`	`string`	Identifies the IP address of the end user for whom the API call is being made. This allows per-user quotas to be enforced when calling the API from a server-side application. Learn more about Capping Usage.

Data Formats

Resources in the Google+ API are represented using JSON data formats. For example, retrieving a user’s profile may result in a response like:

{
  "kind": "plus#person",
  "id": "118051310819094153327",
  "displayName": "Chirag Shah",
  "url": "https://plus.google.com/118051310819094153327",
  "image": {
    "url": "https://lh5.googleusercontent.com/-XnZDEoiF09Y/AAAAAAAAAAI/AAAAAAAAYCI/7fow4a2UTMU/photo.jpg"
  }
}

Common Properties

While each type of resource will have its own unique representation, there are a number of common properties that are found in almost all resource representations.

Property Name	Value	Description
`displayName`	`string`	This is the name of the resource, suitable for displaying to a user.
`id`	`string`	This property uniquely identifies a resource. Every resource of a given kind will have a unique `id`. Even though an `id` may sometimes look like a number, it should always be treated as a string.
`kind`	`string`	This identifies what kind of resource a JSON object represents. This is particularly useful when programmatically determining how to parse an unknown object.
`url`	`string`	This is the primary URL, or permalink, for the resource.

Pagination

In requests that can respond with potentially large collections, such as Activities list, each response contains a limited number of items, set by maxResults(default: 20). Each response also contains a nextPageToken property. To obtain the next page of items, you pass this value of nextPageToken to the pageTokenproperty of the next request. Repeat this process to page through the full collection.

For example, calling Activities list returns a response with nextPageToken:

{
  "kind": "plus#activityFeed",
  "title": "Plus Public Activities Feed",
  "nextPageToken": "CKaEL",
  "items": [
    {
      "kind": "plus#activity",
      "id": "123456789",
      ...
    },
    ...
  ]
  ...
}

To get the next page of activities, pass the value of this token in with your next Activities list request:

https://www.googleapis.com/plus/v1/people/me/activities/public?pageToken=CKaEL

As before, the response to this request includes nextPageToken, which you can pass in to get the next page of results. You can continue this cycle to get new pages — for the last page, “nextPageToken” will be absent.

it would be interesting the first wave of analysis on this new social network and see if it is any different from others, if at all.

After all, an API is only as good as the analysis and applications that can be done on the data it provides

More fun on Google Plus

I have been posting cool stuff from my G+ stream almost since the social network got released so continuing the series of posts on great stuff I get in my Google Plus stream

1) Photographers are good sharers

Anna Rumiantseva originally shared this post:

Photos from our recent trip to Santa Fe, NM. These are of Loretto Chapel which has the Miraculous Staircase. This staircase has a mystery to it has it is said to be built without nails by a carpenter who showed up after the sisters of the chapel prayed for 9 days. It took several months to be built by this carpenter who then left without pay and could not be found. The sisters believe it was St. Joseph himself that built the staircase and answered their prayers.
Please share if you like!

New Mexico (4 photos)

2) Cool Designer Retro Stuff

the water cooler at my workplace.

3) Social Media Experts-

Jay Jaboneta originally shared this post:

GMA Network launched an online campaign to raise awareness about the responsible use of social media, so please think before you click.

Jay Jaboneta changed his profile photo.

4) No you cant share gifs on Facebook

amazing rescue in Utah. http://www.sltrib.com/sltrib/news/52574438-78/story.csp?page=1

5) Cool Art

Monica Rocha originally shared this post:

6) Toons

Rupesh Nandy originally shared this post:

Birthdays – Then & Now

8) Geeks rock!

David Smith originally shared this post:

Yet another instance of the Golden Ratio in Nature: Irene.

lastly 9) Digital art

Marcelo Almeida originally shared this post:

behind the smile

But Willie Nelson rules them all

Willie Nelson covers Coldplay. Sounds pretty good! This reminds me of Johnny Cash’s cover ofHurt. (Yes, this is a Chipotle ad. It’s still pretty cool.)

Back to the Start

youtube.com – Coldplay’s haunting classic ‘The Scientist’ is performed by country music legend Willie Nelson

https://www.youtube-nocookie.com/v/aMfSGt6rHos?version=3&hl=en_US&rel=0

– see earlier posts at

Warning- this and earlier post deals with cute memes that can take a lot of time and energy!

Cloud Computing using Python

I liked the new features in PiCloud , which is a cloud computing way to use Python. Python is increasingly popular as a computational language, and the cloud is the way where HW is headed to atleast as of 2011-12

http://www.picloud.com/

The new features allows you to publish your own functions as urls.

By publishing your Python functions to URLs. Why would you want to publish a function?

To call your Python functions from a programming language other than Python.

To use PiCloud from Google AppEngine, which does not support our native client library.

To easily setup a scalable RPC system.

Here’s a peek at the interface:

You publish a Python function

cloud.rest.publish(your_func, ‘myfunction’)

We give you a URL Back

https://api.picloud.com/r/2/myfunction/

You make an HTTP request using your method of choice to the URL

curl -k -u ‘key:secret_key’ https://api.picloud.com/r/2/myfunction/

It certainly is an interesting development and I am wondering how other languages can adopt this paradigm as well.

For R, as of now http://www.cloudnumbers.com/ seems to be the only player in the cloud.

It would be exciting to see more players in the cloud statistical analytical space.

Page Mathematics

I was looking at the site http://www.google.com/adplanner/static/top1000/index.html

and I saw this list (Below) and using a Google Doc at https://docs.google.com/spreadsheet/pub?hl=en_US&hl=en_US&key=0AtYMMvghK2ytdE9ybmVQeUxMeXdjWlVKYzRlMkxjX0E&output=html.

I then decided to divide pageviews by users to check the maths

Facebook is AAAAAmazing! and the Russian social network is amazing too!

The maths is wrong! (maybe sampling, maybe virtual pageviews caused by friendstream refresh)

but the average of 1,136 page views per unique visitor per month means 36 page views /visitor a Day!

Rank Site     Category        Unique Visitors (users) Page Views Views/Visitors

1  facebook.com	  Social Networks	880000000 1000000000000	1,136
29 linkedin.com	  Social Networks	80000000     2500000000	31
38 orkut.com	  Social Networks	66000000     4000000000	61
40 orkut.com.br	  Social Networks	62000000    43000000000	694
65 weibo.com	  Social Networks	42000000     2800000000	67
66 renren.com	  Social Networks	42000000     3300000000	79
84 odnoklassniki.ru Social Networks	37000000    13000000000	351
90 scribd.com	  Social Networks	34000000      140000000	4
95 vkontakte.ru	  Social Networks	34000000    48000000000	1,412

and

Rank Site	Category  Unique Visitors (users)Page Views	Page Views/Visitors
1 facebook.com	Social Networks	880000000	1000000000000	1,136
2 youtube.com	Online Video	800000000	100000000000	125
3 yahoo.com	Web Portals	590000000	77000000000	131
4 live.com	Search Engines	490000000	84000000000	171
5 msn.com	Web Portals	440000000	20000000000	45
6 wikipedia.org	Dict    	410000000	6000000000	15
7 blogspot.com	Blogging	340000000	4900000000	14
8 baidu.com	Search Engines	300000000	110000000000	367
9 microsoft.com	Software	250000000	2500000000	10
10	 qq.com	Web Portals	250000000	39000000000	156

see complete list at http://www.google.com/adplanner/static/top1000/index.html Continue reading “Page Mathematics”

A Sacrifice of Statistics

From an advertisement placed by Govt of Pakistan in Wall Street Journal,

Only Pakistan= Making sacrifices statistics cannot reflect.

Oh dear! What would the statisticians say?

Also see http://blogs.wsj.com/indiarealtime/2011/09/13/pakistan-wsj-ad-unlikely-to-change-narrative/

The ad cites a series of statistics. Almost 22,000 Pakistani civilians have died or been seriously injured in the fight against terrorism, the ad said. The army has lost almost 3,000 soldiers. More than 3.5 million people have been displaced by the fighting and the damage to the economy over the past decade is estimated at $68 billion, it added.

People will quibble with these statistics from a country where reporters often find it difficult to get basic data.

Interview Dan Steinberg Founder Salford Systems

Here is an interview with Dan Steinberg, Founder and President of Salford Systems (http://www.salford-systems.com/ )

Ajay- Describe your journey from academia to technology entrepreneurship. What are the key milestones or turning points that you remember.

Dan- When I was in graduate school studying econometrics at Harvard, a number of distinguished professors at Harvard (and MIT) were actively involved in substantial real world activities. Professors that I interacted with, or studied with, or whose software I used became involved in the creation of such companies as Sun Microsystems, Data Resources, Inc. or were heavily involved in business consulting through their own companies or other influential consultants. Some not involved in private sector consulting took on substantial roles in government such as membership on the President’s Council of Economic Advisors. The atmosphere was one that encouraged free movement between academia and the private sector so the idea of forming a consulting and software company was quite natural and did not seem in any way inconsistent with being devoted to the advancement of science.

Ajay- What are the latest products by Salford Systems? Any future product plans or modification to work on Big Data analytics, mobile computing and cloud computing.

Dan- Our central set of data mining technologies are CART, MARS, TreeNet, RandomForests, and PRIM, and we have always maintained feature rich logistic regression and linear regression modules. In our latest release scheduled for January 2012 we will be including a new data mining approach to linear and logistic regression allowing for the rapid processing of massive numbers of predictors (e.g., one million columns), with powerful predictor selection and coefficient shrinkage. The new methods allow not only classic techniques such as ridge and lasso regression, but also sub-lasso model sizes. Clear tradeoff diagrams between model complexity (number of predictors) and predictive accuracy allow the modeler to select an ideal balance suitable for their requirements.

The new version of our data mining suite, Salford Predictive Modeler (SPM), also includes two important extensions to the boosted tree technology at the heart of TreeNet. The first, Importance Sampled learning Ensembles (ISLE), is used for the compression of TreeNet tree ensembles. Starting with, say, a 1,000 tree ensemble, the ISLE compression might well reduce this down to 200 reweighted trees. Such compression will be valuable when models need to be executed in real time. The compression rate is always under the modeler’s control, meaning that if a deployed model may only contain, say, 30 trees, then the compression will deliver an optimal 30-tree weighted ensemble. Needless to say, compression of tree ensembles should be expected to be lossy and how much accuracy is lost when extreme compression is desired will vary from case to case. Prior to ISLE, practitioners have simply truncated the ensemble to the maximum allowable size. The new methodology will substantially outperform truncation.

The second major advance is RULEFIT, a rule extraction engine that starts with a TreeNet model and decomposes it into the most interesting and predictive rules. RULEFIT is also a tree ensemble post-processor and offers the possibility of improving on the original TreeNet predictive performance. One can think of the rule extraction as an alternative way to explain and interpret an otherwise complex multi-tree model. The rules extracted are similar conceptually to the terminal nodes of a CART tree but the various rules will not refer to mutually exclusive regions of the data.

Ajay- You have led teams that have won multiple data mining competitions. What are some of your favorite techniques or approaches to a data mining problem.

Dan- We only enter competitions involving problems for which our technology is suitable, generally, classification and regression. In these areas, we are partial to TreeNet because it is such a capable and robust learning machine. However, we always find great value in analyzing many aspects of a data set with CART, especially when we require a compact and easy to understand story about the data. CART is exceptionally well suited to the discovery of errors in data, often revealing errors created by the competition organizers themselves. More than once, our reports of data problems have been responsible for the competition organizer’s decision to issue a corrected version of the data and we have been the only group to discover the problem.

In general, tackling a data mining competition is no different than tackling any analytical challenge. You must start with a solid conceptual grasp of the problem and the actual objectives, and the nature and limitations of the data. Following that comes feature extraction, the selection of a modeling strategy (or strategies), and then extensive experimentation to learn what works best.

Ajay- I know you have created your own software. But are there other software that you use or liked to use?

Dan- For analytics we frequently test open source software to make sure that our tools will in fact deliver the superior performance we advertise. In general, if a problem clearly requires technology other than that offered by Salford, we advise clients to seek other consultants expert in that other technology.

Ajay- Your software is installed at 3500 sites including 400 universities as per http://www.salford-systems.com/company/aboutus/index.html What is the key to managing and keeping so many customers happy?

Dan- First, we have taken great pains to make our software reliable and we make every effort to avoid problems related to bugs. Our testing procedures are extensive and we have experts dedicated to stress-testing software . Second, our interface is designed to be natural, intuitive, and easy to use, so the challenges to the new user are minimized. Also, clear documentation, help files, and training videos round out how we allow the user to look after themselves. Should a client need to contact us we try to achieve 24-hour turn around on tech support issues and monitor all tech support activity to ensure timeliness, accuracy, and helpfulness of our responses. WebEx/GotoMeeting and other internet based contact permit real time interaction.

Ajay- What do you do to relax and unwind?

Dan- I am in the gym almost every day combining weight and cardio training. No matter how tired I am before the workout I always come out energized so locating a good gym during my extensive travels is a must. I am also actively learning Portuguese so I look to watch a Brazilian TV show or Portuguese dubbed movie when I have time; I almost never watch any form of video unless it is available in Portuguese.

Biography-

http://www.salford-systems.com/blog/dan-steinberg.html

Dan Steinberg, President and Founder of Salford Systems, is a well-respected member of the statistics and econometrics communities. In 1992, he developed the first PC-based implementation of the original CART procedure, working in concert with Leo Breiman, Richard Olshen, Charles Stone and Jerome Friedman. In addition, he has provided consulting services on a number of biomedical and market research projects, which have sparked further innovations in the CART program and methodology.

Dr. Steinberg received his Ph.D. in Economics from Harvard University, and has given full day presentations on data mining for the American Marketing Association, the Direct Marketing Association and the American Statistical Association. After earning a PhD in Econometrics at Harvard Steinberg began his professional career as a Member of the Technical Staff at Bell Labs, Murray Hill, and then as Assistant Professor of Economics at the University of California, San Diego. A book he co-authored on Classification and Regression Trees was awarded the 1999 Nikkei Quality Control Literature Prize in Japan for excellence in statistical literature promoting the improvement of industrial quality control and management.

His consulting experience at Salford Systems has included complex modeling projects for major banks worldwide, including Citibank, Chase, American Express, Credit Suisse, and has included projects in Europe, Australia, New Zealand, Malaysia, Korea, Japan and Brazil. Steinberg led the teams that won first place awards in the KDDCup 2000, and the 2002 Duke/TeraData Churn modeling competition, and the teams that won awards in the PAKDD competitions of 2006 and 2007. He has published papers in economics, econometrics, computer science journals, and contributes actively to the ongoing research and development at Salford.

Congrats to Matt Stromberg- Winner 2 free passes to PAW New York

Here is a big congrats to Matt Stromberg of San Diego for winning 2 free passes to Predictive Analytics World. Each pass can be used for 2 days of the conference, and it is exclusive to that conference alone.

Connect to Matt ?

https://www.facebook.com/profile.php?id=3611395 or http://www.linkedin.com/pub/matt-stromberg/6/a3b/47a

A coincidence- its his birthday today. Happy Birthday Matt and enjoy NY and PAW Con

WINNER- Matt Stromberg

Mgr., Project Management & Business Analytics

Greater San Diego Area

Effective limits:

Common Parameters

Data Formats

Common Properties

Pagination

Please share:

Please share:

Please share:

Please share:

Please share:

Please share:

Please share: