Here comes the Cloud- VMWare

Cloud

From http://vcloudexpress.terremark.com

From the fascinating umbrella group at http://twitter.com/cloudcomp_group

If you are a Desktop software expert who thinks the Cloud is nothing to worry, remember what happened to Punching Tapes, Mainframes, Vaccuum tubes.

With the current pricing, even smaller software package creators can offer Statistics as a Service ( STaS).

Great move by VMware !

MS Smacks Google Docs with Slideshare

Our favorite drop outs from the Phd Program just learned that they should not moon the giant. The company founded in Paul Allen building at Stanford, also known as Gogol /Google announced they would create a Cloud OS with much fan fare. Only to find their own cloud prodocutivity offering Google Docs bested by Slideshare.

Now you can import your Gmail attachments Google docs into slideshare, for much better professional sharing within your office.

Here is an embedded SlideShare ppt called Google Hacks, note the much better visual appeal in this vis a vis your Google Docs.

Well as for the Stanford dropouts this is what happens when you dont complete your Phd education.

Citation- http://www.slideshare.net/rickdog/google-hacks

As per Cloud Computing and Office productivity goes,

Harvard Dropouts (Microsoft) 1- Stanford Dropouts ( Google) 0

Unless Google creates a cloud version of Open Office- but who needs that anyway?

who needs search- just ctrl F

Google Hacks

View more documents from rickdog.
Disclaimer- The author uses Google Docs extensively. If you are from Google. Please do not block his Gmail id , guys.
Academic Disclaimer-The author intends to complete his Phd. these are his personal views only.

Hive Tutorial: Cloud Computing

Here is a nice video from Cloudera on a HIVE tutorial. I wonder what would happen if they put a real analytical system and not just basic analytics and reporting … like R or SPSS or JMP or SAS on big database system like Hadoop (including some text mined data from legacy company documents)

Unlike Oracle or other data base systems, Hadoop is free now and in reasonable future  (like MySQL used to be before acquired by big fish Sun acquired by bigger Oracle).

Citation-

http://wiki.apache.org/hadoop/Hive

Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files

Hive is based on Hadoop which is a batch processing system. Accordingly, this system does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real time queries. As a result it should not be compared with systems like Oracle where analysis is done on a significantly smaller amount of data but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes. For Hive queries response times for even the smallest jobs can be of the order of 5-10 minutes and for larger jobs this may even run into hours.

If your input data is small you can execute a query in a short time. For example, if a table has 100 rows you can ‘set mapred.reduce.tasks=1’ and ‘set mapred.map.tasks=1’ and the query time will be ~15 seconds.


Zementis News

From a Zementis Newsletter- interesting advances on the R on the cloud front. Thanks to Rom Ramos for sending this, and I hope Zementis and some one like Google/ Biocep team up so all I need to make a model is some data and a browser. 🙂

The R Journal – A Refereed Journal for the R Project Launches

As a sign of the open source R project for statistical computing gaining momentum, the R newsletter has been transformed into The R Journal, a refereed journal for articles covering topics that are of interest to users or developers of R.  As a supporter of the R PMML Package (see blog and video tutorial), we are honored that our article “PMML: An Open Standard for Sharing Models” which emphasizes the importance of the Predictive Model Markup Language (PMML) standard is part of the inaugural issue.  If you already develop your models in R, export them via PMML, then deploy and scale your models in ADAPA on the Amazon EC2 cloud. Read the full story.

Integrating Predictive Analytics via Web Services

Predictive analytics will deliver more value and become more pervasive across the enterprise, once we manage to seamlessly integrate predictive models into any business process.  In order to execute predictive models on-demand, in real-time or in batch mode, the integration via web services presents a simple and effective way to leverage scoring results within different applications.  For most scenarios, the best way to incorporate predictive models into the business process is as a decision service.  Query the model(s) daily, hourly, or in real-time, but if at all possible try to design a loosely coupled system following a Service Oriented Architecture (SOA).

Using web services, for example, one can quickly improve existing systems and processes by adding predictive decision models.  Following the idea of a loosely coupled architecture, it is even possible to use integration tools like Jitterbit or Microsoft SQL Service Integration Services (SSIS) to embed predictive mode ls that are deployed in ADAPA on the Amazon Elastic Compute Cloud without the need to write any code.  Of course, there is also the option to use custom Java code or MS SQL Server SSIS Scripting for which we provide a sample client application.  Read the full story.

About ADAPA®:

A fast real-time deployment environment for Predictive Analytics Models – a stand alone scoring engine that reads .xml based PMML descriptions of models and scores streams of data. Developed by Zementis – a fully hosted Software-as-a Service (SaaS) solution on the Amazon Elastic Computing Cloud.  It’s easy to use and remarkably inexpensive starting at only $0.99 per instance hour.

R and Hadoop

Here is an exciting project for using R on the cloud computing environment ( two of my favorite things). It is called RHIPE

R and Hadoop Integrated Processing Environment v.0.38

cloud

The website source is http://ml.stat.purdue.edu/rhipe/

RHIPE(phonetic spelling: hree-pay’ 1) is a java package that integrates the R environment with Hadoop, the open source implementation of Google’s mapreduce. Using RHIPE it is possible to code map-reduce algorithms in R e.g

m <- function(key,val){
  words <- strsplit(val," +")[[1]]
  wc <- table(words)
  cln <- names(wc)
  names(wc)<-NULL; names(cln)<-NULL;
  return(sapply(1:length(wc),function(r) list(key=cln[r],value=wc[[r]]),simplify=F))
}
r <- function(key,value){
  value <- do.call("rbind",value)
  return(list(list(key=key,value=sum(value))))
}
rhmr(map=m,reduce=r,combiner=T,input.folder="X",output.folder="Y")
rhapply packages the user's request into an R vector object. This is serialized and sent to the RHIPE server. The RHIPE server picks apart the object creating a job request that Hadoop can understand. Each element of the provided list is processed by the users function during the Map stage of mapreduce. The results are returned and if the output is to a file, these results are serialized and written to a Hadoop Sequence file, the values can be read back into R using the rhsq* functions.

2 rhlapply

rhlapply <- function( list.object,
                    func,
                    configure=expression(),
                    output.folder='',
                    shared.files=c(),
                    hadoop.mapreduce=list(),
                    verbose=T,
                    takeAll=T)
list.object
This can either be a list or a single scalar. In case of the former, the function given by func will be applied to each element of list.object. In case of a scalar, the function will be applied to the list 1:n where n is the value of the scalar
func
A function that takes one parameter: an element of the list.
configure
An configuration expression to run before the func is executed. Executed once for every JVM. If you need variables, data frames, use rhsave or rhsave.image , use rhput to copy it to the DFS and then use shared.files

config = expression({
              library(lattice)
              load("mydataset.Rdata")
})
output.folder
Any file that is created by the function is stored in the output.folder. This is deleted first. If not given, the files created will not be copied. For side effect files to be copies create them in tmp e.g pdf("tmp/x.pdf"), note no leading slash.The directory will contain a slew of part* files, as many as there maps. These contain the binary key-value pairs.
shared.files
The function or the preload expression might require the presence resource files e.g *.Rdata files. The user could copy it from the HDFS in the R code or just load it from the local work directory were the files present. This is the role of shared.files. It is a vector of paths to files on the HDFS, each of these will be copied to the work directory where the R code is run. e.g c('/tmp/x.Rdata','/foo.tgz'), then the first file can be loaded via load("x.Rdata") . For those familiar with Hadoop terminology, this is implemented via DistributedCache .
hadoop.mapreduce
a list of Hadoop specific options e.g

list(mapreduce.map.tasks=10,mapreduce.reduce.tasks=3)
takeAll
if takeALL is true, the value returned is a list each entry the return value of the the function, not in order so element 1 of the returned list is not the result of func(list.object=1=) .
verbose
If True, the user will see the job progress in the R console. If False, the web url to the jobtracker will be displayed. Cancelling the command with CTRL-C will not cancel the job, use rhkill for that.
Mapreduce in R.

rhmr <- function(map,reduce,input.folder,configure=list(map=expression(),reduce=expression()),
                close=list(map=expression(),reduce=expression())
                 output.folder='',combiner=F,step=F,
                 shared.files=c(),inputformat="TextInputFormat",
                 outputformat="TextOutputFormat",
                 hadoop.mapreduce=list(),verbose=T,libjars=c())

Execute map reduce algorithms from within R. A discussion of the parameters follow.

input.folder
A folder on the DFS containing the files to process. Can be a vector.
output.folder
A folder on the DFS where output will go to.
inputformat
Either of TextInputFormat or SequenceFileInputFormat. Use the former for text files and the latter for sequence files created from within R or as outputs from RHIPE(e.g rhlapply or rhmr). Note, one can't use any sequence file, they must have been created via a RHIPE function. Custom Input formats are also possible. Download the source and look at code/java/RXTextInputFormat.java
outputformat
Either of TextOutputFormat or SequenceFileOutputFormat. In case of the former, the return value from the mapper or reducer is converted to character and written to disk. The following code is used to convert to character.

paste(key,sep='',collapse=field_separator)

Custom output formats are also possible. Download the source and look at code/java/RXTextOutputFormat.java

If custom formats implement their own writables, it must subclass RXWritable or use one of the writables presents in RHIPE

shared.files
same as in rhlapply, see that for documentation.
verbose
If T, the job progress is displayed. If false, then the job URL is displayed.

At any time in the configure, close, map or reduce function/expression, the variable mapred.task.is.map will be equal to "true" if it is map task,"false" otherwise (both strings) Also, mapred.iswhat is mapper, reducer, combiner in their respective environments.

configure
A list with either one element (an expression) or two elements map and reduce both of which must be expressions. These expressions are called in their respective environments, i.e the map expression is called during the map configure and similarly for the reduce expression. The reduce expression is called for the combiner configure method.If only one list element, the expression is used for both the map and reduce
close
Same as configure .
map
a function that takes two values key and value. Should return a list of lists. Each list entry must contain two elements key and value , e.g

...
ret <- list()
ret[[1]] <-  list(key=c(1,2), value=c('x','b'))
return(ret)

If any of key/value are missing the output is not collected, e.g. return NULL to skip this record. If the input format is a TextInputFormat, the value is the entire line and the key is probably useless to the user( it is a number indicating bytes into the file). If the input format is SequenceFileInputFormat, the key and value are taken from the sequence file.

reduce
Not needed if mapred.reduce.tasks is 0. Takes a key and a list of values( all values emitted from the maps that share the same map output key ). If step is True, then not a list. Must return a list of lists each element of which must have two elements key and value. This collects all the values and sends them to function. If NULL is returned or the return value is not conforming to the above nothing is collected the Hadoop collector.
step
If step is TRUE, then the reduce function is called for every value corresponding to a key that is once for every value.

  • The variable red.status is equal to 1 on the first call.
  • red.status is equal to 0 for every subsequent calls including the last value
  • The reducer function is called one last time with red.status equal to -1. The value is NULL.Anything returned at any of these stages is written to disk The close function is called once every value for a given key has been processed, but returning anything has no effect. To a assign to the global environment use the <<- operator.
combiner
T or F, to use the reducer as a combiner. Using a combiner makes computation more efficient. If combiner is true, the reduce function will be called as a combiner (0 or more times, it may never be called during the combine stage even if combiner is T) .The value of mapred.task.is.map is 'true' or '*'false*' (both strings) if the combiner is being executed as part of the map stage or reduce stage respectively.

Whether knowledge of this is useful or not is something I'm not sure of. However, if combiner is T , keep in mind,your reduce function must be able to handle inputs sent from the map or inputs sent from the reduce function(itself).

libjars
If specifying a custom input/output format, the user might need to specify jar files here.
hadoop.mapreduce
set RHIPE and Hadoop options via this.

1.1 RHIPE Options for mapreduce

Option Default Explanation
rhipejob.rport 8888 The port on which Rserve runs, should be same across all machines
rhipejob.charsxp.short 0 If 1, RHIPE optimize serialization for character vectors. This reduces the length of the serialization
rhipejob.getmapbatches 100 If the reduce/mapper emits several key,values, how many to get from Rserve at a time. A higher number reduce the number of network reads(the network reads are to localhost)
rhipejob.outfmt.is.text 1 if TextInputFormat Must be 1 if the output is textual
rhipejob.textoutput.fieldsep ' ' The field separator for any text based output format
rhipejob.textinput.comment '#' In the TextInputFormat, lines beginning with this are skipped
rhipejob.combinerspill 100,000 The combiner is run after collecting at most this many items
rhipejob.tor.batch 200,000 Number of values for the same key to collate before sending to the Reducer, if you have dollops of memory, set this larger. However, too large and you hit Java's heap space limit
rhipejob.max.count.reduce Java's INT_MAX (about 2BN) the total number of values for a given key to be collected, note the values are not ordered by any variable.
rhipejob.inputformat.keyclass The default is chosen depending on TextInputFormat or SequenceFileInputFormat Provide the full Java URL to the keyclass e.g org.saptarshiguha.rhipe.hadoop.RXWritableText, when using a Custom InputFormat implement RXWritable and implement the methods
rhipejob.inputformat.valueclass The default is chosen depending on TextInputFormat or SequenceFileInputFormat Provide the full Java URL to the valueclass e.g org.saptarshiguha.rhipe.hadoop.RXWritableText when using a Custom InputFormat implement RXWritable and implement the methods
mapred.input.format.class As above, the default is either org.saptarshiguha.rhipe.hadoop.RXTextInputFormat or org.apache.hadoop.mapred.SequenceFileInputFormat specify yours here
rhipejob.outputformat.keyclass The default is chosen depending on TextInputFormat or SequenceFileInputFormat Provide the full Java URL to the keyclass e.g org.saptarshiguha.rhipe.hadoop.RXWritableText , also the keyclass must implement RXWritable and
rhipejob.outputformat.valueclass The default is chosen depending on TextInputFormat or SequenceFileInputFormat Provide the full Java URL to the value e.g org.saptarshiguha.rhipe.hadoop.RXWritableText , also the valueclass must implement RXWritable
mapred.output.format.class As above, the default is either org.saptarshiguha.rhipe.hadoop.RXTextOutputFormat or org.apache.hadoop.mapred.SequenceFileInputFormat specify yours here, provide libjars if required

Citation:http://ml.stat.purdue.edu/rhipe/index.html

Great exciting news for the world of computing remotely.

Cloud say hello to R. R say hello to Cloud.

image

Here is a terrific project from Biocep which I have covered before in January at http://www.decisionstats.com/2009/01/r-and-cloud-computing/

But with some exciting steps ahead at http://biocep-distrib.r-forge.r-project.org/

Basically add open source R , create a user friendly GUI, host it on a cloud computer to better crunch data, and save hardware costs as well. Basically upload, crunch data, download.

Save hardware costs and software costs in recession. Before your boss decides to save his staffing costs.

image

    Biocep combines the capabilities of R and the flexibility of a Java based distributed system to create a tool of considerable power and utility. A Biocep based R virtualization infrastructure has been successfully deployed on the British National Grid Service, demonstrating its usability and usefulness for researchers. 

    The virtual workbench enhances the user experience and the productivity of anyone working with R.

A lovely presentation on it is here

and I am taking an extract

What is missing now

•High Level Java API for Accessing R

•Stateful, Resuable, Remotable R Components

•Scalable, Distributed, R Based Infrastructure

•Safe multiple clients framework for components usage as a pool of indistinguishable Remote Resources

•User friendly Interface for the remote resources creation, tracking and debugging

    Citation: Karim Chine, "Biocep, Towards a Federative, Collaborative, User-Centric,Grid-Enabled and Cloud-Ready Computational Open Platform,"
    escience,pp.321-322, 2008 Fourth IEEE International Conference on eScience, 2008

Ajay- With thanks to Bob Marcus for pointing this out from an older post of mine. I did write on this in August on the Ohri framework but that was before recession moved me out from cloud computing to blog computing.