Google introduces Analytics Academy for e-learning

I really liked this and promptly signed up at https://analyticsacademy.withgoogle.com/course

I of course passed the test some 2 years back-Google Web Analytics IQ test (but its only valid for 18 months)

Digital Analytics Fundamentals

This three-week course provides a foundation for marketers and analysts seeking to understand the core principles of digital analytics and to improve business performance through better digital measurement.

Course highlights include:

An overview of today’s digital measurement landscape
Guidance on how to build an effective measurement plan
Best practices for collecting actionable data
Descriptions of key digital measurement concepts, terminology and analysis techniques
Deep-dives into Google Analytics reports with specific examples for evaluating your digital marketing performance

View lessons from experts

Watch or read lessons from digital analytics advocate Justin Cutroni, all at your own pace.
Test your knowledge

Apply what you learn in the course by completing short quizzes and practice exercises.
Join the learning community

Engage with other course participants and analytics experts in the course forum and on Google+.

Algorithms.io is Dataweek Startup of September

One of the guys I keep shooting ideas with on a ir-regular basis Andy Bartley ‘s startup , Algorithms.io just won a startup competition

http://dataweek.co/algorithms-io-wins-data-2-0-summit-2013-startup-pitch-competition/

Andy was kind enough to mention me at link above ( I extracted it here)–what is really cool is they are now going to demo on analytics for wearable computing. That’s right- Analytics + Google Glass ? Any takers..? 🙂

See-

isit Algorithms.io tomorrow and Thursday at Dataweek 2013 at the Fort Mason center in San Francisco. We will be in booth #118 giving a live demo of our new machine learning platform for wearable devices.

This new platform intelligently classifies streaming data from wearable devices into actionable events that can be used to build predictive applications. It combines a data scientist, dev ops engineer, and developer all into one simple service.

Geoff: Is Algorithms.io a “marketplace for algorithms” or do you plan on producing / curating most of the algorithms internally?

Andy: Right now we are performing the curation internally. When you get past the marketing hype around Big Data, Machine Learning, Predictive Analytics, etc. what you’ll find is most companies still aren’t sure exactly how these technologies can benefit their business. We talk with Fortune 500 companies every week who have few if any data scientists in house, and aren’t using any intelligent algorithms. Our main focus right now is working with those companies to help them understand the use cases and how they integrate with the business model.

Longer term, we think there is an opportunity for an algorithm marketplace. This isn’t a new topic, one of our advisors Ajay Ohri, also the author of Springer’s book on R, wrote about this idea back in 2011 (http://readwrite.com/2011/06/01/an-app-store-for-algorithms#awesm=~ohfvTpPiq6Jmt5). We’ve discussed this topic with folks at some of the potential players like Google who could be interested in this type of marketplace. Two of the primary gating factors for an algorithm marketplace are data quality and use cases. Data quality is still a fundamental challenge, and the really compelling business use cases today can be tackled with a relatively limited set of algorithms. As companies get more sophisticated data infrastructure in the next 2 – 3 years, the bar will begin to rise and an opportunity could emerge for commerce around algorithms. We’re doing a number of things on the technology and IP fronts to position us to play in this space when it emerges.

Using R for random number creation from time stamps #rstats

Suppose – let us just suppose- you want to create random numbers that are reproducible , and derived from time stamps

Here is the code in R

> a=as.numeric(Sys.time())
> set.seed(a)
> rnorm(log(a))

Note- you can create a custom function ( I used the log) for generating random numbers of the system time too. This creates a random numbered list of pseudo random numbers (since nothing machine driven is purely random in the strict philosophy of the word)

a=as.numeric(Sys.time())
set.seed(a)
abs(100000000*rnorm(abs(log(a))))

[1] 39621645 99451316 109889294 110275233 278994547 6554596 38654159 68748122 8920823 13293010
[11] 57664241 24533980 174529340 105304151 168006526 39173857 12810354 145341412 241341095 86568818
[21] 105672257

Possible applications- things that need both random numbers (like encryption keys) and time stamps (like events , web or industrial logs or as pseudo random pass codes in Google 2 factor authentication )

Note I used the rnorm function but you could possibly draw the functions also as a random input (rnorm or rcauchy)

Again I would trust my own random ness than one generated by an arm of US Govt (see http://www.nist.gov/itl/csd/ct/nist_beacon.cfm )

Update- Random numbers in R

http://stat.ethz.ch/R-manual/R-patched/library/base/html/Random.html

Details

The currently available RNG kinds are given below. kind is partially matched to this list. The default is "Mersenne-Twister".

"Wichmann-Hill"

The seed, .Random.seed[-1] == r[1:3] is an integer vector of length 3, where each r[i] is in 1:(p[i] - 1), where p is the length 3 vector of primes, p = (30269, 30307, 30323). The Wichmann–Hill generator has a cycle length of 6.9536e12 (= prod(p-1)/4, see Applied Statistics (1984) 33, 123 which corrects the original article).

"Marsaglia-Multicarry":

A multiply-with-carry RNG is used, as recommended by George Marsaglia in his post to the mailing list ‘sci.stat.math’. It has a period of more than 2^60 and has passed all tests (according to Marsaglia). The seed is two integers (all values allowed).

"Super-Duper":

Marsaglia’s famous Super-Duper from the 70’s. This is the original version which does not pass the MTUPLE test of the Diehard battery. It has a period of about 4.6*10^18 for most initial seeds. The seed is two integers (all values allowed for the first seed: the second must be odd).

We use the implementation by Reeds et al. (1982–84).

The two seeds are the Tausworthe and congruence long integers, respectively. A one-to-one mapping to S’s .Random.seed[1:12] is possible but we will not publish one, not least as this generator is not exactly the same as that in recent versions of S-PLUS.

"Mersenne-Twister":

From Matsumoto and Nishimura (1998). A twisted GFSR with period 2^19937 – 1 and equidistribution in 623 consecutive dimensions (over the whole period). The ‘seed’ is a 624-dimensional set of 32-bit integers plus a current position in that set.

"Knuth-TAOCP-2002":

A 32-bit integer GFSR using lagged Fibonacci sequences with subtraction. That is, the recurrence used is

X[j] = (X[j-100] – X[j-37]) mod 2^30

and the ‘seed’ is the set of the 100 last numbers (actually recorded as 101 numbers, the last being a cyclic shift of the buffer). The period is around 2^129.

"Knuth-TAOCP":

An earlier version from Knuth (1997).

The 2002 version was not backwards compatible with the earlier version: the initialization of the GFSR from the seed was altered. R did not allow you to choose consecutive seeds, the reported ‘weakness’, and already scrambled the seeds.

Initialization of this generator is done in interpreted R code and so takes a short but noticeable time.

"L'Ecuyer-CMRG":

A ‘combined multiple-recursive generator’ from L’Ecuyer (1999), each element of which is a feedback multiplicative generator with three integer elements: thus the seed is a (signed) integer vector of length 6. The period is around 2^191.

The 6 elements of the seed are internally regarded as 32-bit unsigned integers. Neither the first three nor the last three should be all zero, and they are limited to less than 4294967087 and 4294944443 respectively.

This is not particularly interesting of itself, but provides the basis for the multiple streams used in package parallel.

"user-supplied":

Use a user-supplied generator.

Function RNGkind allows user-coded uniform and normal random number generators to be supplied.

BigML goes on hyperdrive- exciting new features

Some changes in BigML.com whose CEO I have interviewed here

Thier earlier innovation in making a market place for models (like similar market place for apps) was written about here

I like the concept of BigMLer a command line tool https://bigml.com/bigmler

New changes are-

1) Text Analysis now available- It seems like rudimentary tdm (term document matrix) but I have yet to test it whether I can do clustering within text data too

2) A Cloud Server called BigML Predict Server- making adoption faster due to data hygiene for sensitive industries like finance etc

3) Confusion Matrix – to evaluate- a long overdue step . Maybe some curves should be added to evaluation here 😉

4) Misc technical upgrades- that are more complex to execute and less interesting to write about

multi label classification
secret urls for sharing models (view model only not data)
export to MS Excel ( maybe add Google docs export ?)
etc

Overall , with the addition of training courses as well- this is a new phase in this data science startup that I have been tracking for past few years.

-related

Life Cycle of a Data Science Project

It is best to use CRISP -DM, SEMMA and/or KDD for a systematic approach

1) Understanding Business Requirements from Client

2) Converting Business Problem to a Statistical Problem

what data to collect
what is the cost of data
how can I enhance the data
data quality issues

3) Solving Statistical Problem with Tools (R, SAS, Excel)

import
data quality
outlier and missing value treatment
exploratory analysis
data visualization
hypothesis and problem framing
data mining and pattern identification
create success parameters for statistical solution

4) Converting Statistical Solution to Business Solution

project report template
assumptions and caveats
feedback from stakeholders

5) Communicating Business Solution to Client

presentation
report
customer satisfaction
monitoring of results

Using R and TwitteR together on Windows #rstats

You need to add the following apparently. on a Windows OS

options(RCurlOptions = list(cainfo = system.file(“CurlSSL”, “cacert.pem”, package = “RCurl”)))
download.file(url=”http://curl.haxx.se/ca/cacert.pem”, destfile=”cacert.pem”)

twitCred$handshake(cainfo=”cacert.pem”)

I am still investigating this to update my tutorial in previous post to be a complete stand alone Tutorial from Beginning to End

Using Twitter Data with R #rstats updated for API changes

Step 1

Install Package twitteR

install.packages("twitteR")
> install.packages("twitteR")
Installing package(s) into ‘/home/R/library’
(as ‘lib’ is unspecified)
also installing the dependencies ‘ROAuth’, ‘rjson’

trying URL 'http://cran.rstudio.com/src/contrib/ROAuth_0.9.3.tar.gz'
Content type 'application/x-gzip' length 6202 bytes
opened URL
==================================================
downloaded 6202 bytes

trying URL 'http://cran.rstudio.com/src/contrib/rjson_0.2.13.tar.gz'
Content type 'application/x-gzip' length 98132 bytes (95 Kb)
opened URL
==================================================
downloaded 95 Kb

trying URL 'http://cran.rstudio.com/src/contrib/twitteR_1.1.7.tar.gz'
Content type 'application/x-gzip' length 121696 bytes (118 Kb)
opened URL
==================================================
downloaded 118 Kb

* installing *source* package ‘ROAuth’ ...
** package ‘ROAuth’ successfully unpacked and MD5 sums checked
** R
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded

* DONE (ROAuth)
* installing *source* package ‘rjson’ ...
** package ‘rjson’ successfully unpacked and MD5 sums checked
** libs
g++ -I/usr/share/R/include -DNDEBUG      -fpic  -O3 -pipe  -g  -c dump.cpp -o dump.o
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG      -fpic  -O3 -pipe  -g  -c parser.c -o parser.o
g++ -shared -o rjson.so dump.o parser.o -L/usr/lib/R/lib -lR
installing to /home/R/library/rjson/libs
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
   ‘json_rpc_server.Rnw’ 
** testing if installed package can be loaded

* DONE (rjson)
* installing *source* package ‘twitteR’ ...
** package ‘twitteR’ successfully unpacked and MD5 sums checked
** R
** inst
** preparing package for lazy loading
Creating a generic function for ‘as.data.frame’ from package ‘base’ in package ‘twitteR’
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded

* DONE (twitteR)

The downloaded source packages are in
	‘/tmp/RtmpvY7yMN/downloaded_packages’

Step 2

Load Package twitteR

library(twitteR)
> library(twitteR)
Loading required package: ROAuth
Loading required package: RCurl
Loading required package: bitops
Loading required package: digest
Loading required package: rjson

Step 3

Using your Twitter Login ID and Password Login to https://dev.twitter.com/apps/

In case you forget the twitter.com username and password, click on Forgot Password to reset Password

Step 4

Create a new app for yourself by navigating to My Applications

Step 5

Your Apps are here

https://dev.twitter.com/apps

Click on New Application (button on top right)

Step 6

Fill the options here- leave the callback url blank

Name should be Unique

Description should be atleast 10 Charachters

Website can be a placeholder as of now (or your blog address)

Agree to Terms and Conditions

Type the Spam Check Number and Letters

Step 7

Note these details from your new APP

Consumer Key

Consumer Secret

On the Bottom –

Click on Create your OAuth Token

Finally your APP page should look like this (dont worry i will be deleting this app so you cant hack my twitter yet)

Step 8

Go to R

Type the following code after you changed the two consumer keys (IMPORTANT- You will need to change your Consumer Key and Consumer Secret to the one specific to YOUR app)

NOTE- WordPress makes some changes when you copy and paste code to your blog

(like adding &8221 to lines 2-4 below- Ignore this please)

THE final formatted code is at very end of the post

library(twitteR)
reqURL <- “https://api.twitter.com/oauth/request_token”
accessURL <- “https://api.twitter.com/oauth/access_token”
authURL <- “https://api.twitter.com/oauth/authorize”
consumerKey <- “2uQlGBBMMXdDffcK2IkAsg”
consumerSecret <- “xrGr71kTfdT3ypWFURGxyJOC4Oqf46Rwu4qxyxoEfM”
twitCred <- OAuthFactory$new(consumerKey=consumerKey,
consumerSecret=consumerSecret,
requestURL=reqURL,
accessURL=accessURL,
authURL=authURL)

Step 9

Do the Twitter Handshake by pasting this

command in R Console

twitCred$handshake()

You will see a message like this from R

> twitCred$handshake()

To enable the connection, please direct your web browser to: http://api.twitter.com/oauth/authorize?oauth_token=pJqojAg2gxmqip3SprJAyOckdcD1nB3MvlbP2dWUDGQ When complete, record the PIN given to you and provide it here:

Step 10

Go to the link above given by R

You will get this message

Click on blue button

Authorize app

Step 11 Entering the Pin

Now you see a pin here

like this

-You cant copy and paste it. Write it down and then type in your R console

Step 12

Now register the credentials using

registerTwitterOAuth(twitCred)

you will see this

if done correctly

> registerTwitterOAuth(twitCred) [1] TRUE

Step 13

Search Twitter using commands like here. Note it returns only 499 tweets

> a=searchTwitter(“#rstats”, n=2000)

Warning message: In doRppAPICall(“search/tweets”, n, params = params, retryOnRateLimit = retryOnRateLimit, : 2000 tweets were requested but the API can only return 499

Step 14 Now you can start analyzing the data

Convert the data into a data frame tweets_df = twListToDF(a)

Install Packages tm (for textmining and wordcloud)

> install.packages(c(“tm”, “wordcloud”))

Load the Packages

library(tm)

library(wordcloud)

Basic Word Cloud can be created using code below

b=Corpus(VectorSource(tweets_df$text), readerControl = list(language = “eng”))

b<- tm_map(b, tolower) #Changes case to lower case

b<- tm_map(b, stripWhitespace) #Strips White Space

b <- tm_map(b, removePunctuation) #Removes Punctuation

inspect(b)

tdm <- TermDocumentMatrix(b)

m1 <- as.matrix(tdm)

v1<- sort(rowSums(m1),decreasing=TRUE)

d1<- data.frame(word = names(v1),freq=v1)

wordcloud(d1$word,d1$freq)

For more detailed analysis on what you can do with Twitter and R, read this http://cran.r-project.org/web/packages/twitteR/twitteR.pdf or this https://sites.google.com/site/miningtwitter/

Step 15 Keep your OAuth keys safely, and do your homework with out bothering your instructor too much.

If you try and copy the paste the code from a website, be sure to change the quotation marks “” manually in your R console

Also see on text mining https://decisionstats.com/2012/03/19/text-mining-barack-obama/

FINAL CODE

install.packages("twitteR")
library(twitteR)
reqURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
consumerKey <- "rR16FxDLkTYmuVhqH4s4EQ"
consumerSecret <- "xrGr71kTfdT3ypWFURGxyJOC4Oqf46Rwu4qxyxoEfM"
twitCred <- OAuthFactory$new(consumerKey=consumerKey,
                             consumerSecret=consumerSecret,
                             requestURL=reqURL,
                             accessURL=accessURL,
                             authURL=authURL)
twitCred$handshake() #Pause here for the Handshake Pin Code
registerTwitterOAuth(twitCred) #Wait till you see True


a=searchTwitter("#rstats", n=2000) #Get the Tweets

 tweets_df = twListToDF(a) #Convert to Data Frame
 install.packages(c("tm", "wordcloud"))
 library(tm) 
library(wordcloud)   
  b=Corpus(VectorSource(tweets_df$text), readerControl = list(language = "eng"))
 b<- tm_map(b, tolower) #Changes case to lower case 
b<- tm_map(b, stripWhitespace) #Strips White Space 
b <- tm_map(b, removePunctuation) #Removes Punctuation
 inspect(b) 
tdm <- TermDocumentMatrix(b) 
m1 <- as.matrix(tdm) 
v1<- sort(rowSums(m1),decreasing=TRUE) 
d1<- data.frame(word = names(v1),freq=v1) 
wordcloud(d1$word,d1$freq)