Cloud Computing using Python

I liked the new features in PiCloud , which is a cloud computing way to use Python. Python is increasingly popular as a computational language, and the cloud is the way where HW is headed to atleast as of 2011-12

http://www.picloud.com/

The new features allows you to publish your own functions as urls.

 By publishing your Python functions to URLs. Why would you want to publish a function?

  • To call your Python functions from a programming language other than Python.
  • To use PiCloud from Google AppEngine, which does not support our native client library.
  • To easily setup a scalable RPC system.

Here’s a peek at the interface:

You publish a Python function

cloud.rest.publish(your_func, ‘myfunction’)

We give you a URL Back

https://api.picloud.com/r/2/myfunction/

You make an HTTP request using your method of choice to the URL

curl -k -u ‘key:secret_key’ https://api.picloud.com/r/2/myfunction/

It certainly is an interesting development and I am wondering how other languages can adopt this paradigm as well.
For R, as of now http://www.cloudnumbers.com/ seems to be the only player in the cloud.
It would be exciting to see more players in the cloud statistical analytical space.

 

Using Google Fusion Tables from #rstats

But after all that- I was quite happy to see Google Fusion Tables within Google Docs. Databases as a service ? Not quite but still quite good, and lets see how it goes.

https://www.google.com/fusiontables/DataSource?dsrcid=implicit&hl=en_US&pli=1

http://googlesystem.blogspot.com/2011/09/fusion-tables-new-google-docs-app.html

 

But what interests me more is

http://code.google.com/apis/fusiontables/docs/developers_guide.html

The Google Fusion Tables API is a set of statements that you can use to search for and retrieve Google Fusion Tables data, insert new data, update existing data, and delete data. The API statements are sent to the Google Fusion Tables server using HTTP GET requests (for queries) and POST requests (for inserts, updates, and deletes) from a Web client application. The API is language agnostic: you can write your program in any language you prefer, as long as it provides some way to embed the API calls in HTTP requests.

The Google Fusion Tables API does not provide the mechanism for submitting the GET and POST requests. Typically, you will use an existing code library that provides such functionality; for example, the code libraries that have been developed for the Google GData API. You can also write your own code to implement GET and POST requests.

Also see http://code.google.com/apis/fusiontables/docs/sample_code.html

 

Google Fusion Tables API Sample Code

Libraries

SQL API

Language Library Public repository Samples
Python Fusion Tables Python Client Library fusion-tables-client-python/ Samples
PHP Fusion Tables PHP Client Library fusion-tables-client-php/ Samples

Featured Samples

An easy way to learn how to use an API can be to look at sample code. The table above provides links to some basic samples for each of the languages shown. This section highlights particularly interesting samples for the Fusion Tables API.

SQL API

Language Featured samples API version
cURL
  • Hello, cURLA simple example showing how to use curl to access Fusion Tables.
SQL API
Google Apps Script SQL API
Java
  • Hello, WorldA simple walkthrough that shows how the Google Fusion Tables API statements work.
  • OAuth example on fusion-tables-apiThe Google Fusion Tables team shows how OAuth authorization enables you to use the Google Fusion Tables API from a foreign web server with delegated authorization.
SQL API
Python
  • Docs List ExampleDemonstrates how to:
    • List tables
    • Set permissions on tables
    • Move a table to a folder
Docs List API
Android (Java)
  • Basic Sample ApplicationDemo application shows how to create a crowd-sourcing application that allows users to report potholes and save the data to a Fusion Table.
SQL API
JavaScript – FusionTablesLayer Using the FusionTablesLayer, you can display data on a Google Map

Also check out FusionTablesLayer Builder, which generates all the code necessary to include a Google Map with a Fusion Table Layer on your own website.

FusionTablesLayer, Google Maps API
JavaScript – Google Chart Tools Using the Google Chart Tools, you can request data from Fusion Tables to use in visualizations or to display directly in an HTML page. Note: responses are limited to 500 rows of data.

Google Chart Tools

External Resources

Google Fusion Tables is dedicated to providing code examples that illustrate typical uses, best practices, and really cool tricks. If you do something with the Google Fusion Tables API that you think would be interesting to others, please contact us at googletables-feedback@google.com about adding your code to our Examples page.

  • Shape EscapeA tool for uploading shape files to Fusion Tables.
  • GDALOGR Simple Feature Library has incorporated Fusion Tables as a supported format.
  • Arc2CloudArc2Earth has included support for upload to Fusion Tables via Arc2Cloud.
  • Java and Google App EngineODK Aggregate is an AppEngine application by the Open Data Kit team, uses Google Fusion Tables to store survey data that is collected through input forms on Android mobile phones. Notable code:
  • R packageAndrei Lopatenko has written an R interface to Fusion Tables so Fusion Tables can be used as the data store for R.
  • RubySimon Tokumine has written a Ruby gem for access to Fusion Tables from Ruby.

 

Updated-You can use Google Fusion Tables from within R from http://andrei.lopatenko.com/rstat/fusion-tables.R

 

ft.connect <- function(username, password) {
  url = "https://www.google.com/accounts/ClientLogin";
  params = list(Email = username, Passwd = password, accountType="GOOGLE", service= "fusiontables", source = "R_client_API")
 connection = postForm(uri = url, .params = params)
 if (length(grep("error", connection, ignore.case = TRUE))) {
 	stop("The wrong username or password")
 	return ("")
 }
 authn = strsplit(connection, "\nAuth=")[[c(1,2)]]
 auth = strsplit(authn, "\n")[[c(1,1)]]
 return (auth)
}

ft.disconnect <- function(connection) {
}

ft.executestatement <- function(auth, statement) {
      url = "http://tables.googlelabs.com/api/query"
      params = list( sql = statement)
      connection.string = paste("GoogleLogin auth=", auth, sep="")
      opts = list( httpheader = c("Authorization" = connection.string))
      result = postForm(uri = url, .params = params, .opts = opts)
      if (length(grep("<HTML>\n<HEAD>\n<TITLE>Parse error", result, ignore.case = TRUE))) {
      	stop(paste("incorrect sql statement:", statement))
      }
      return (result)
}

ft.showtables <- function(auth) {
   url = "http://tables.googlelabs.com/api/query"
   params = list( sql = "SHOW TABLES")
   connection.string = paste("GoogleLogin auth=", auth, sep="")
   opts = list( httpheader = c("Authorization" = connection.string))
   result = getForm(uri = url, .params = params, .opts = opts)
   tables = strsplit(result, "\n")
   tableid = c()
   tablename = c()
   for (i in 2:length(tables[[1]])) {
     	str = tables[[c(1,i)]]
   	    tnames = strsplit(str,",")
   	    tableid[i-1] = tnames[[c(1,1)]]
   	    tablename[i-1] = tnames[[c(1,2)]]
   	}
   	tables = data.frame( ids = tableid, names = tablename)
    return (tables)
}

ft.describetablebyid <- function(auth, tid) {
   url = "http://tables.googlelabs.com/api/query"
   params = list( sql = paste("DESCRIBE", tid))
   connection.string = paste("GoogleLogin auth=", auth, sep="")
   opts = list( httpheader = c("Authorization" = connection.string))
   result = getForm(uri = url, .params = params, .opts = opts)
   columns = strsplit(result,"\n")
   colid = c()
   colname = c()
   coltype = c()
   for (i in 2:length(columns[[1]])) {
     	str = columns[[c(1,i)]]
   	    cnames = strsplit(str,",")
   	    colid[i-1] = cnames[[c(1,1)]]
   	    colname[i-1] = cnames[[c(1,2)]]
   	    coltype[i-1] = cnames[[c(1,3)]]
   	}
   	cols = data.frame(ids = colid, names = colname, types = coltype)
    return (cols)
}

ft.describetable <- function (auth, table_name) {
   table_id = ft.idfromtablename(auth, table_name)
   result = ft.describetablebyid(auth, table_id)
   return (result)
}

ft.idfromtablename <- function(auth, table_name) {
    tables = ft.showtables(auth)
	tableid = tables$ids[tables$names == table_name]
	return (tableid)
}

ft.importdata <- function(auth, table_name) {
	tableid = ft.idfromtablename(auth, table_name)
	columns = ft.describetablebyid(auth, tableid)
	column_spec = ""
	for (i in 1:length(columns)) {
		column_spec = paste(column_spec, columns[i, 2])
		if (i < length(columns)) {
			column_spec = paste(column_spec, ",", sep="")
		}
	}
	mdata = matrix(columns$names,
	              nrow = 1, ncol = length(columns),
	              dimnames(list(c("dummy"), columns$names)), byrow=TRUE)
	select = paste("SELECT", column_spec)
	select = paste(select, "FROM")
	select = paste(select, tableid)
	result = ft.executestatement(auth, select)
    numcols = length(columns)
    rows = strsplit(result, "\n")
    for (i in 3:length(rows[[1]])) {
    	row = strsplit(rows[[c(1,i)]], ",")
    	mdata = rbind(mdata, row[[1]])
   	}
   	output.frame = data.frame(mdata[2:length(mdata[,1]), 1])
   	for (i in 2:ncol(mdata)) {
   		output.frame = cbind(output.frame, mdata[2:length(mdata[,i]),i])
   	}
   	colnames(output.frame) = columns$names
    return (output.frame)
}

quote_value <- function(value, to_quote = FALSE, quote = "'") {
	 ret_value = ""
     if (to_quote) {
     	ret_value = paste(quote, paste(value, quote, sep=""), sep="")
     } else {
     	ret_value = value
     }
     return (ret_value)
}

converttostring <- function(arr, separator = ", ", column_types) {
	con_string = ""
	for (i in 1:(length(arr) - 1)) {
		value = quote_value(arr[i], column_types[i] != "number")
		con_string = paste(con_string, value)
	    con_string = paste(con_string, separator, sep="")
	}

    if (length(arr) >= 1) {
    	value = quote_value(arr[length(arr)], column_types[length(arr)] != "NUMBER")
    	con_string = paste(con_string, value)
    }
}

ft.exportdata <- function(auth, input_frame, table_name, create_table) {
	if (create_table) {
       create.table = "CREATE TABLE "
       create.table = paste(create.table, table_name)
       create.table = paste(create.table, "(")
       cnames = colnames(input_frame)
       for (columnname in cnames) {
         create.table = paste(create.table, columnname)
    	 create.table = paste(create.table, ":string", sep="")
    	   if (columnname != cnames[length(cnames)]){
    		  create.table = paste(create.table, ",", sep="")
           }
       }
      create.table = paste(create.table, ")")
      result = ft.executestatement(auth, create.table)
    }
    if (length(input_frame[,1]) > 0) {
    	tableid = ft.idfromtablename(auth, table_name)
	    columns = ft.describetablebyid(auth, tableid)
	    column_spec = ""
	    for (i in 1:length(columns$names)) {
		   column_spec = paste(column_spec, columns[i, 2])
		   if (i < length(columns$names)) {
			  column_spec = paste(column_spec, ",", sep="")
		   }
	    }
    	insert_prefix = "INSERT INTO "
    	insert_prefix = paste(insert_prefix, tableid)
    	insert_prefix = paste(insert_prefix, "(")
    	insert_prefix = paste(insert_prefix, column_spec)
    	insert_prefix = paste(insert_prefix, ") values (")
    	insert_suffix = ");"
    	insert_sql_big = ""
    	for (i in 1:length(input_frame[,1])) {
    		data = unlist(input_frame[i,])
    		values = converttostring(data, column_types  = columns$types)
    		insert_sql = paste(insert_prefix, values)
    		insert_sql = paste(insert_sql, insert_suffix) ;
    		insert_sql_big = paste(insert_sql_big, insert_sql)
    		if (i %% 500 == 0) {
    			ft.executestatement(auth, insert_sql_big)
    			insert_sql_big = ""
    		}
    	}
        ft.executestatement(auth, insert_sql_big)
    }
}

#rstats -Basic Data Manipulation using R

Continuing my series of basic data manipulation using R. For people knowing analytics and
new to R.
1 Keeping only some variables

Using subset we can keep only the variables we want-

Sitka89 <- subset(Sitka89, select=c(size,Time,treat))

Will keep only the variables we have selected (size,Time,treat).

2 Dropping some variables

Harman23.cor$cov.arm.span <- NULL
This deletes the variable named cov.arm.span in the dataset Harman23.cor

3 Keeping records based on character condition

Titanic.sub1<-subset(Titanic,Sex=="Male")

Note the double equal-to sign
4 Keeping records based on date/time condition

subset(DF, as.Date(Date) >= '2009-09-02' & as.Date(Date) <= '2009-09-04')

5 Converting Date Time Formats into other formats

if the variable dob is “01/04/1977) then following will convert into a date object

z=strptime(dob,”%d/%m/%Y”)

and if the same date is 01Apr1977

z=strptime(dob,"%d%b%Y")

6 Difference in Date Time Values and Using Current Time

The difftime function helps in creating differences in two date time variables.

difftime(time1, time2, units='secs')

or

difftime(time1, time2, tz = "", units = c("auto", "secs", "mins", "hours", "days", "weeks"))

For current system date time values you can use

Sys.time()

Sys.Date()

This value can be put in the difftime function shown above to calculate age or time elapsed.

7 Keeping records based on numerical condition

Titanic.sub1<-subset(Titanic,Freq >37)

For enhanced usage-
you can also use the R Commander GUI with the sub menu Data > Active Dataset

8 Sorting Data

Sorting A Data Frame in Ascending Order by a variable

AggregatedData<- sort(AggregatedData, by=~ Package)

Sorting a Data Frame in Descending Order by a variable

AggregatedData<- sort(AggregatedData, by=~ -Installed)

9 Transforming a Dataset Structure around a single variable

Using the Reshape2 Package we can use melt and acast functions

library("reshape2")

tDat.m<- melt(tDat)

tDatCast<- acast(tDat.m,Subject~Item)

If we choose not to use Reshape package, we can use the default reshape method in R. Please do note this takes longer processing time for bigger datasets.

df.wide <- reshape(df, idvar="Subject", timevar="Item", direction="wide")

10 Type in Data

Using scan() function we can type in data in a list

11 Using Diff for lags and Cum Sum function forCumulative Sums

We can use the diff function to calculate difference between two successive values of a variable.

Diff(Dataset$X)

Cumsum function helps to give cumulative sum

Cumsum(Dataset$X)

> x=rnorm(10,20) #This gives 10 Randomly distributed numbers with Mean 20

> x

[1] 20.76078 19.21374 18.28483 20.18920 21.65696 19.54178 18.90592 20.67585

[9] 20.02222 18.99311

> diff(x)

[1] -1.5470415 -0.9289122 1.9043664 1.4677589 -2.1151783 -0.6358585 1.7699296

[8] -0.6536232 -1.0291181 >

cumsum(x)

[1] 20.76078 39.97453 58.25936 78.44855 100.10551 119.64728 138.55320

[8] 159.22905 179.25128 198.24438

> diff(x,2) # The diff function can be used as diff(x, lag = 1, differences = 1, ...) where differences is the order of differencing

[1] -2.4759536 0.9754542 3.3721252 -0.6474195 -2.7510368 1.1340711 1.1163064

[8] -1.6827413

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.

12 Merging Data

Deducer GUI makes it much simpler to merge datasets. The simplest syntax for a merge statement is

totalDataframeZ <- merge(dataframeX,dataframeY,by=c("AccountId","Region"))

13 Aggregating and group processing of a variable

We can use multiple methods for aggregating and by group processing of variables.
Two functions we explore here are aggregate and Tapply.

Refering to the R Online Manual at
[http://stat.ethz.ch/R-manual/R-patched/library/stats/html/aggregate.html]

## Compute the averages for the variables in 'state.x77', grouped

## according to the region (Northeast, South, North Central, West) that

## each state belongs to

aggregate(state.x77, list(Region = state.region), mean)

Using TApply

## tapply(Summary Variable, Group Variable, Function)

Reference

[http://www.ats.ucla.edu/stat/r/library/advanced_function_r.htm#tapply]

We can also use specialized packages for data manipulation.

For additional By-group processing you can see the doBy package as well as Plyr package
 for data manipulation.Doby contains a variety of utilities including:
 1) Facilities for groupwise computations of summary statistics and other facilities for working with grouped data.
 2) General linear contrasts and LSMEANS (least-squares-means also known as population means),
 3) HTMLreport for autmatic generation of HTML file from R-script with a minimum of markup, 4) various other utilities and is available at[ http://cran.r-project.org/web/packages/doBy/index.html]
Also Available at [http://cran.r-project.org/web/packages/plyr/index.html],
Plyr is a set of tools that solves a common set of problems:
you need to break a big problem down into manageable pieces,
operate on each pieces and then put all the pieces back together.
 For example, you might want to fit a model to each spatial location or
 time point in your study, summarise data by panels or collapse high-dimensional arrays
 to simpler summary statistics.

Ten steps to analysis using R

I am just listing down a set of basic R functions that allow you to start the task of business analytics, or analyzing a dataset(data.frame). I am doing this both as a reference for myself as well as anyone who wants to learn R- quickly.

I am not putting in data import functions, because data manipulation is a seperate baby altogether. Instead I assume you have a dataset ready for analysis and what are the top R commands you would need to analyze it.

 

For anyone who thought R was too hard to learn- here is ten functions to learning R

1) str(dataset) helps you with the structure of dataset

2) names(dataset) gives you the names of variables

3)mean(dataset) returns the mean of numeric variables

4)sd(dataset) returns the standard deviation of numeric variables

5)summary(variables) gives the summary quartile distributions and median of variables

That about gives me the basic stats I need for a dataset.

> data(faithful)
> names(faithful)
[1] "eruptions" "waiting"
> str(faithful)
'data.frame':   272 obs. of  2 variables:
 $ eruptions: num  3.6 1.8 3.33 2.28 4.53 ...
 $ waiting  : num  79 54 74 62 85 55 88 85 51 85 ...
> summary(faithful)
   eruptions        waiting
 Min.   :1.600   Min.   :43.0
 1st Qu.:2.163   1st Qu.:58.0
 Median :4.000   Median :76.0
 Mean   :3.488   Mean   :70.9
 3rd Qu.:4.454   3rd Qu.:82.0
 Max.   :5.100   Max.   :96.0

> mean(faithful)
eruptions   waiting
 3.487783 70.897059
> sd(faithful)
eruptions   waiting
 1.141371 13.594974

6) I can do a basic frequency analysis of a particular variable using the table command and $ operator (similar to dataset.variable name in other statistical languages)

> table(faithful$waiting)

43 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 62 63 64 65 66 67 68 69 70
 1  3  5  4  3  5  5  6  5  7  9  6  4  3  4  7  6  4  3  4  3  2  1  1  2  4
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 96
 5  1  7  6  8  9 12 15 10  8 13 12 14 10  6  6  2  6  3  6  1  1  2  1  1
or I can do frequency analysis of the whole dataset using
> table(faithful)
         waiting
eruptions 43 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 62 63 64 65 66 67
    1.6    0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0
    1.667  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0
    1.7    0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0
    1.733  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0
.....output truncated
7) plot(dataset)
It helps plot the dataset

8) hist(dataset$variable) is better at looking at histograms

hist(faithful$waiting)

9) boxplot(dataset)

10) The tenth function for a beginner would be cor(dataset$var1,dataset$var2)

> cor(faithful)
          eruptions   waiting
eruptions 1.0000000 0.9008112
waiting   0.9008112 1.0000000

 

I am assuming that as a beginner you would use the list of GUI at http://rforanalytics.wordpress.com/graphical-user-interfaces-for-r/  to import and export Data. I would deal with ten steps to data manipulation in R another post.

 

Careers in #Rstats

I saw a posting for career with Revolution Analytics. Now I am probably on the wrong side of a H1 visa and the C,R skill-o-meter, but these look great for any aspiring R coder. Includes one free lance opp as well.

http://www.revolutionanalytics.com/aboutus/careers.php

We have many opportunities opening up—among them:

Job Title Location
Pre-sales Consultants / Technical Sales Palo Alto, CA
Parallel Computing Developer Palo Alto, CA or Seattle, WA
R Programmer (Freelance) Palo Alto, CA
Software Training Course Developer (Freelance) Palo Alto, CA
Build / Release Engineer Seattle, WA
QA Engineer Seattle, WA
Technical Writer Seattle, WA

 

Please send your resume to careers@revolutionanalytics.com

2) Indeed.com

Searching for “R” jobs and not just , R jobs, gives better results in search engines and job sites. It is still a tough keyword to search but it is getting better.

You can use this RSS feed http://www.indeed.co.in/rss?q=%22R%22++analytics+jobs or send by email option to get alerts

3) http://icrunchdata.com/

 

I Crunch Data has a good number of Analytics Jobs, and again using the keyword as R within quotes of “R” you can see lots of jobs here

http://www.icrunchdata.com/ViewJob.aspx?id=334914&keys=%22R%22

There used to be a Google Group on R jobs, but is too low volume compared to the actual number of R jobs out there.

Note the big demand is for analytics, and knowing more than one platform helps you in the job search than knowing just a single language.

 

 

 

Using #Rstats for online data access

There are multiple packages in R to read data straight from online datasets.
These are as follows- Continue reading “Using #Rstats for online data access”

Revolution #Rstats Webinar

David Smith of Revo presents a nice webinar on the capabilities and abilities of Revolution R- if you are R curious and wonder how the commercial version has matured- you may want to take a look.

click below to view an executive Webinar

——————————————————————————————-

Revolution R Enterprise—presented by author and blogger David Smith:

Revolution R: 100% R and More
On-Demand Webinar

This Webinar covers how R users can upgrade to:

  • Multi-processor speed improvements and parallel processing
  • Productivity and debugging with an integrated development environment (IDE) for the R language
  • “Big Data” analysis, with out-of-memory storage of multi-gigabyte data sets
  • Web Services for R, to integrate R computations and graphics into 3rd-Party applications like Excel and BI Dashboards
  • Expert technical support and consulting services for R

This webinar will be of value to current R users who want to learn more about the additional capabilities of Revolution R Enterprise to enhance the productivity, ease of use, and enterprise readiness of open source R. R users in academia will also find this webinar valuable: we will explain how all members of the academic community can obtain Revolution R Enterprise free of charge.

—————————————————————————————

contact -1-855-GET-REVO or via online form.
info@revolutionanalytics.com | (650) 330-0553 | Twitter @RevolutionR