Hive Tutorial: Cloud Computing

Here is a nice video from Cloudera on a HIVE tutorial. I wonder what would happen if they put a real analytical system and not just basic analytics and reporting … like R or SPSS or JMP or SAS on big database system like Hadoop (including some text mined data from legacy company documents)

Unlike Oracle or other data base systems, Hadoop is free now and in reasonable future  (like MySQL used to be before acquired by big fish Sun acquired by bigger Oracle).

Citation-

http://wiki.apache.org/hadoop/Hive

Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files

Hive is based on Hadoop which is a batch processing system. Accordingly, this system does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real time queries. As a result it should not be compared with systems like Oracle where analysis is done on a significantly smaller amount of data but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes. For Hive queries response times for even the smallest jobs can be of the order of 5-10 minutes and for larger jobs this may even run into hours.

If your input data is small you can execute a query in a short time. For example, if a table has 100 rows you can ‘set mapred.reduce.tasks=1’ and ‘set mapred.map.tasks=1’ and the query time will be ~15 seconds.


Reactions to IBM -SPSS takeover.

The business intelligence -business analytics- data mining industry ( or as James Taylor would say Decision Management Industry) have some reactions on IBM – SPSS ( which was NOT a surprise to many including me). Really.

From SAS Institute, Anne Milley

http://blogs.sas.com/sascom/index.php?/archives/557-Analytics-is-still-our-middle-name.html

Besides SAS, SPSS was one of the last independent analytic software companies. A colleague says, “It’s the end of the analytics cold war.”

I’ve been saying all along that analytics is required for success. Yes, data integration, data quality, and query & reporting are important too but, as W. Edwards Deming says, “The object of taking data is to provide a basis for action.”

The end of the analytics cold war- hmm. We all know what the end of real cold war brought us- Google, Cloud Computing, and other non technical issues.

From KXEN, Roger Hadaad

“The price paid for SPSS of four times revenues and 25 times earnings shows just how valuable this sector really is,” says Haddad. “But the deal has also created a tremendous opportunity for the sector’s remaining independent vendors that

KXEN is well placed to capitalize on. “There is no For Sale sign hanging in our window,” continues Haddad. “We launched KXEN in 1998 to democratize the benefits of data mining and predictive analytics, making them practical and affordable across the whole enterprise and not just the exclusive preserve of a few specialists. It’s going to take up to two years for the dust to settle following the IBM

“Former SPSS partners, systems integrators and distributors will face uncertainty.”

I think the PE multiple was still low- SPSS was worth more if you count the client base, active community, brand itself in the valuation. Tremendous cross sell opportunities and IBM with it’s nice research and development is a good supporter of pure science.  Yes, next two years would be facing increasing consolidation and more “surprising” news. At 4 times earnings, anyone can be bought in the present market if it is a public listed company. 😉

From the rather subdued voices on SPSS list, some subjective and non quantitative ‘strategic” forecasts.

http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0907&L=spssx-l&F=&S=&P=36324

I think the Ancient Chinese said it best “May you live in interesting times”.

Having worked with some flavors of Cognos and SPSS, I think there could be areas for technical integration for querying and GUI based forecasting as well, apart from financial mergers and administrative re adjustments. I mean people pull data not just to report it, but to estimate what comes next as well.

This could also spell the end of uni platform skilled analysts. You now need to learn atleast two different platforms like SAS,SPSS or KXEN, R or Cognos, Business Objects to hedge your chances of getting offshored (Note- I worked in offshoring for almost 4 years in India in data analytics).

Answering what IBM will do with SPSS and it’s open source commitment to R and consequences for employees, customers, vendors,partners who have more choices now than ever.

…. well it depends. Who is John Galt?

Interesting community event by R/Statistical community

Citation-
http://en.oreilly.com/oscon2009/public/schedule/detail/10432

StackOverflow Flash Mob for the R User Community
Moderated by: Michael E. Driscoll
7:00pm Wednesday, 07/22/2009
Location: Ballroom A2

In concert with users online across the country, this session will lead a flashmob to populate StackOverflow with R language content.

R, the open source statistical language, has a notoriously steep learning curve. The same technical questions tend be asked repeatedly on the R-help mailing lists, to the detriment of both R experts (who tire of repeating themselves) and the learners (who often receive a technically correct, but terse response).

We have developed a list of the most common 100 technical R questions, based on an analysis of (i) queries sent to the RSeek.org web portal, and (ii) an examination of the R-help list archives, and (iii) a survey of members of R Users Groups in San Francisco, LA, and New York City.

In the first hour, participants will pair up to claim a question, formulate it on StackOverflow, and provide a comprehensive answer. In the second hour, participants will rate, review, and comment on the set of submitted questions and answers.

While Stackoverflow currently lacks content for the R language, we believe this effort will provide the spark to attract more R users, and emerge as a valuable resource to the growing R community.

This is an interesting example of a statistical software community using twitter for a tech help event. I hope this trend/ event gets replicated again and again-

Statisticians worldwide unite in the language of maths !!!

Please follow @rstatsmob to participate. See you at 7 PM PST!

twitter.com/Rstatsmob

R language on the GPU

Here are some nice articles on using R on Graphical Processing Units (GPU) mainly made by NVidia. Think of a GPU as a customized desktop with specialized computing equivalent to much faster computing. i.e. Matlab users can read the webinars here http://www.nvidia.com/object/webinar.html

Now a slightly better definition of GPU computing is from http://www.nvidia.com/object/GPU_Computing.html

GPU computing is the use of a GPU (graphics processing unit) to do general purpose scientific and engineering computing.
The model for GPU computing is to use a CPU and GPU together in a heterogeneous computing model. The sequential part of the application runs on the CPU and the computationally-intensive part runs on the GPU. From the user’s perspective, the application just runs faster because it is using the high-performance of the GPU to boost performance.

rgpu

Citation:

http://brainarray.mbni.med.umich.edu/brainarray/rgpgpu/

R is the most popular open source statistical environment in the biomedical research community. However, most of the popular R function implementations involve no parallelism and they can only be executed as separate instances on multicore or cluster hardware for large data-parallel analysis tasks. The arrival of modern graphics processing units (GPUs) with user friendly programming tools, such as nVidia’s CUDA toolkit (http://www.nvidia.com/cuda), provides a possibility of increasing the computational efficiency of many common tasks by more than one order of magnitude (http://gpgpu.org/). However, most R users are not trained to program a GPU, a key obstacle for the widespread adoption of GPUs in biomedical research.

The research project at the page mentioned above has developed special packages for the above need- R on a GPU.

he initial package is hosted by CRAN as gputools a sorce package for UNIX and Linux systems. Be sure to set the environment variable CUDA_HOME to the root of your CUDA toolkit installation. Then install the package in the usual R manner. The installation process will automatically make use of nVidia’s nvcc compiler and CUBLAS shared library.

and some figures

speedupFigure 1 provides performance comparisons between original R functions assuming a four thread data parallel solution on Intel Core i7 920 and our GPU enabled R functions for a GTX 295 GPU. The speedup test consisted of testing each of three algorithms with five randomly generated data sets. The Granger causality algorithm was tested with a lag of 2 for 200, 400, 600, 800, and 1000 random variables with 10 observations each. Complete hierarchical clustering was tested with 1000, 2000, 4000, 6000, and 8000 points. Calculation of Kendall’s correlation coefficient was tested with 20, 30, 40, 50, and 60 random variables with 10000 observations each

Ajay- For hard core data mining people ,customized GPU’s for accelerated analytics and data mining sounds like fun and common sense. Are there other packages for customization on a GPU – let me know.

Citation:

http://brainarray.mbni.med.umich.edu/brainarray/rgpgpu/

Download

Download the gputools package for R on a Linux platform here: version 0.01.

Zementis News

From a Zementis Newsletter- interesting advances on the R on the cloud front. Thanks to Rom Ramos for sending this, and I hope Zementis and some one like Google/ Biocep team up so all I need to make a model is some data and a browser. 🙂

The R Journal – A Refereed Journal for the R Project Launches

As a sign of the open source R project for statistical computing gaining momentum, the R newsletter has been transformed into The R Journal, a refereed journal for articles covering topics that are of interest to users or developers of R.  As a supporter of the R PMML Package (see blog and video tutorial), we are honored that our article “PMML: An Open Standard for Sharing Models” which emphasizes the importance of the Predictive Model Markup Language (PMML) standard is part of the inaugural issue.  If you already develop your models in R, export them via PMML, then deploy and scale your models in ADAPA on the Amazon EC2 cloud. Read the full story.

Integrating Predictive Analytics via Web Services

Predictive analytics will deliver more value and become more pervasive across the enterprise, once we manage to seamlessly integrate predictive models into any business process.  In order to execute predictive models on-demand, in real-time or in batch mode, the integration via web services presents a simple and effective way to leverage scoring results within different applications.  For most scenarios, the best way to incorporate predictive models into the business process is as a decision service.  Query the model(s) daily, hourly, or in real-time, but if at all possible try to design a loosely coupled system following a Service Oriented Architecture (SOA).

Using web services, for example, one can quickly improve existing systems and processes by adding predictive decision models.  Following the idea of a loosely coupled architecture, it is even possible to use integration tools like Jitterbit or Microsoft SQL Service Integration Services (SSIS) to embed predictive mode ls that are deployed in ADAPA on the Amazon Elastic Compute Cloud without the need to write any code.  Of course, there is also the option to use custom Java code or MS SQL Server SSIS Scripting for which we provide a sample client application.  Read the full story.

About ADAPA®:

A fast real-time deployment environment for Predictive Analytics Models – a stand alone scoring engine that reads .xml based PMML descriptions of models and scores streams of data. Developed by Zementis – a fully hosted Software-as-a Service (SaaS) solution on the Amazon Elastic Computing Cloud.  It’s easy to use and remarkably inexpensive starting at only $0.99 per instance hour.

R and Hadoop

Here is an exciting project for using R on the cloud computing environment ( two of my favorite things). It is called RHIPE

R and Hadoop Integrated Processing Environment v.0.38

cloud

The website source is http://ml.stat.purdue.edu/rhipe/

RHIPE(phonetic spelling: hree-pay’ 1) is a java package that integrates the R environment with Hadoop, the open source implementation of Google’s mapreduce. Using RHIPE it is possible to code map-reduce algorithms in R e.g

m <- function(key,val){
  words <- strsplit(val," +")[[1]]
  wc <- table(words)
  cln <- names(wc)
  names(wc)<-NULL; names(cln)<-NULL;
  return(sapply(1:length(wc),function(r) list(key=cln[r],value=wc[[r]]),simplify=F))
}
r <- function(key,value){
  value <- do.call("rbind",value)
  return(list(list(key=key,value=sum(value))))
}
rhmr(map=m,reduce=r,combiner=T,input.folder="X",output.folder="Y")
rhapply packages the user's request into an R vector object. This is serialized and sent to the RHIPE server. The RHIPE server picks apart the object creating a job request that Hadoop can understand. Each element of the provided list is processed by the users function during the Map stage of mapreduce. The results are returned and if the output is to a file, these results are serialized and written to a Hadoop Sequence file, the values can be read back into R using the rhsq* functions.

2 rhlapply

rhlapply <- function( list.object,
                    func,
                    configure=expression(),
                    output.folder='',
                    shared.files=c(),
                    hadoop.mapreduce=list(),
                    verbose=T,
                    takeAll=T)
list.object
This can either be a list or a single scalar. In case of the former, the function given by func will be applied to each element of list.object. In case of a scalar, the function will be applied to the list 1:n where n is the value of the scalar
func
A function that takes one parameter: an element of the list.
configure
An configuration expression to run before the func is executed. Executed once for every JVM. If you need variables, data frames, use rhsave or rhsave.image , use rhput to copy it to the DFS and then use shared.files

config = expression({
              library(lattice)
              load("mydataset.Rdata")
})
output.folder
Any file that is created by the function is stored in the output.folder. This is deleted first. If not given, the files created will not be copied. For side effect files to be copies create them in tmp e.g pdf("tmp/x.pdf"), note no leading slash.The directory will contain a slew of part* files, as many as there maps. These contain the binary key-value pairs.
shared.files
The function or the preload expression might require the presence resource files e.g *.Rdata files. The user could copy it from the HDFS in the R code or just load it from the local work directory were the files present. This is the role of shared.files. It is a vector of paths to files on the HDFS, each of these will be copied to the work directory where the R code is run. e.g c('/tmp/x.Rdata','/foo.tgz'), then the first file can be loaded via load("x.Rdata") . For those familiar with Hadoop terminology, this is implemented via DistributedCache .
hadoop.mapreduce
a list of Hadoop specific options e.g

list(mapreduce.map.tasks=10,mapreduce.reduce.tasks=3)
takeAll
if takeALL is true, the value returned is a list each entry the return value of the the function, not in order so element 1 of the returned list is not the result of func(list.object=1=) .
verbose
If True, the user will see the job progress in the R console. If False, the web url to the jobtracker will be displayed. Cancelling the command with CTRL-C will not cancel the job, use rhkill for that.
Mapreduce in R.

rhmr <- function(map,reduce,input.folder,configure=list(map=expression(),reduce=expression()),
                close=list(map=expression(),reduce=expression())
                 output.folder='',combiner=F,step=F,
                 shared.files=c(),inputformat="TextInputFormat",
                 outputformat="TextOutputFormat",
                 hadoop.mapreduce=list(),verbose=T,libjars=c())

Execute map reduce algorithms from within R. A discussion of the parameters follow.

input.folder
A folder on the DFS containing the files to process. Can be a vector.
output.folder
A folder on the DFS where output will go to.
inputformat
Either of TextInputFormat or SequenceFileInputFormat. Use the former for text files and the latter for sequence files created from within R or as outputs from RHIPE(e.g rhlapply or rhmr). Note, one can't use any sequence file, they must have been created via a RHIPE function. Custom Input formats are also possible. Download the source and look at code/java/RXTextInputFormat.java
outputformat
Either of TextOutputFormat or SequenceFileOutputFormat. In case of the former, the return value from the mapper or reducer is converted to character and written to disk. The following code is used to convert to character.

paste(key,sep='',collapse=field_separator)

Custom output formats are also possible. Download the source and look at code/java/RXTextOutputFormat.java

If custom formats implement their own writables, it must subclass RXWritable or use one of the writables presents in RHIPE

shared.files
same as in rhlapply, see that for documentation.
verbose
If T, the job progress is displayed. If false, then the job URL is displayed.

At any time in the configure, close, map or reduce function/expression, the variable mapred.task.is.map will be equal to "true" if it is map task,"false" otherwise (both strings) Also, mapred.iswhat is mapper, reducer, combiner in their respective environments.

configure
A list with either one element (an expression) or two elements map and reduce both of which must be expressions. These expressions are called in their respective environments, i.e the map expression is called during the map configure and similarly for the reduce expression. The reduce expression is called for the combiner configure method.If only one list element, the expression is used for both the map and reduce
close
Same as configure .
map
a function that takes two values key and value. Should return a list of lists. Each list entry must contain two elements key and value , e.g

...
ret <- list()
ret[[1]] <-  list(key=c(1,2), value=c('x','b'))
return(ret)

If any of key/value are missing the output is not collected, e.g. return NULL to skip this record. If the input format is a TextInputFormat, the value is the entire line and the key is probably useless to the user( it is a number indicating bytes into the file). If the input format is SequenceFileInputFormat, the key and value are taken from the sequence file.

reduce
Not needed if mapred.reduce.tasks is 0. Takes a key and a list of values( all values emitted from the maps that share the same map output key ). If step is True, then not a list. Must return a list of lists each element of which must have two elements key and value. This collects all the values and sends them to function. If NULL is returned or the return value is not conforming to the above nothing is collected the Hadoop collector.
step
If step is TRUE, then the reduce function is called for every value corresponding to a key that is once for every value.

  • The variable red.status is equal to 1 on the first call.
  • red.status is equal to 0 for every subsequent calls including the last value
  • The reducer function is called one last time with red.status equal to -1. The value is NULL.Anything returned at any of these stages is written to disk The close function is called once every value for a given key has been processed, but returning anything has no effect. To a assign to the global environment use the <<- operator.
combiner
T or F, to use the reducer as a combiner. Using a combiner makes computation more efficient. If combiner is true, the reduce function will be called as a combiner (0 or more times, it may never be called during the combine stage even if combiner is T) .The value of mapred.task.is.map is 'true' or '*'false*' (both strings) if the combiner is being executed as part of the map stage or reduce stage respectively.

Whether knowledge of this is useful or not is something I'm not sure of. However, if combiner is T , keep in mind,your reduce function must be able to handle inputs sent from the map or inputs sent from the reduce function(itself).

libjars
If specifying a custom input/output format, the user might need to specify jar files here.
hadoop.mapreduce
set RHIPE and Hadoop options via this.

1.1 RHIPE Options for mapreduce

Option Default Explanation
rhipejob.rport 8888 The port on which Rserve runs, should be same across all machines
rhipejob.charsxp.short 0 If 1, RHIPE optimize serialization for character vectors. This reduces the length of the serialization
rhipejob.getmapbatches 100 If the reduce/mapper emits several key,values, how many to get from Rserve at a time. A higher number reduce the number of network reads(the network reads are to localhost)
rhipejob.outfmt.is.text 1 if TextInputFormat Must be 1 if the output is textual
rhipejob.textoutput.fieldsep ' ' The field separator for any text based output format
rhipejob.textinput.comment '#' In the TextInputFormat, lines beginning with this are skipped
rhipejob.combinerspill 100,000 The combiner is run after collecting at most this many items
rhipejob.tor.batch 200,000 Number of values for the same key to collate before sending to the Reducer, if you have dollops of memory, set this larger. However, too large and you hit Java's heap space limit
rhipejob.max.count.reduce Java's INT_MAX (about 2BN) the total number of values for a given key to be collected, note the values are not ordered by any variable.
rhipejob.inputformat.keyclass The default is chosen depending on TextInputFormat or SequenceFileInputFormat Provide the full Java URL to the keyclass e.g org.saptarshiguha.rhipe.hadoop.RXWritableText, when using a Custom InputFormat implement RXWritable and implement the methods
rhipejob.inputformat.valueclass The default is chosen depending on TextInputFormat or SequenceFileInputFormat Provide the full Java URL to the valueclass e.g org.saptarshiguha.rhipe.hadoop.RXWritableText when using a Custom InputFormat implement RXWritable and implement the methods
mapred.input.format.class As above, the default is either org.saptarshiguha.rhipe.hadoop.RXTextInputFormat or org.apache.hadoop.mapred.SequenceFileInputFormat specify yours here
rhipejob.outputformat.keyclass The default is chosen depending on TextInputFormat or SequenceFileInputFormat Provide the full Java URL to the keyclass e.g org.saptarshiguha.rhipe.hadoop.RXWritableText , also the keyclass must implement RXWritable and
rhipejob.outputformat.valueclass The default is chosen depending on TextInputFormat or SequenceFileInputFormat Provide the full Java URL to the value e.g org.saptarshiguha.rhipe.hadoop.RXWritableText , also the valueclass must implement RXWritable
mapred.output.format.class As above, the default is either org.saptarshiguha.rhipe.hadoop.RXTextOutputFormat or org.apache.hadoop.mapred.SequenceFileInputFormat specify yours here, provide libjars if required

Citation:http://ml.stat.purdue.edu/rhipe/index.html

Great exciting news for the world of computing remotely.

Training on R

Here is an interesting training from Revolution Computing

New Training Course from REvolution Computing
High-Performance Computing with R
July 31, 2009 – Washington, DC – Prior to JSM
Time: 9am – 5pm
$600 commercial delegates, $450 government, $250 academic

Click Here to Register Now!

An overview of available HPC technologies for the R language to enable faster, scalable analytics that can take advantage of multiprocessor capability will be presented in a one-day course. This will include a comprehensive overview of REvolution’s recently released R packages foreach and iterators, making parallel programming easier than ever before for R programmers, as well as other available technologies such as RMPI, SNOW and many more. We will demonstrate each technology with simple examples that can be used as starting points for more sophisticated work. The agenda will also cover:

  • Identifying performance problems
  • Profiling R programs
  • Multithreading, using compiled code, GPGPU
  • Multiprocess computing
  • SNOW, MPI, NetWorkSpaces, and more
  • Batch queueing systems
  • Dealing with lots of data

Attendees should have basic familiarity with the R language—we will keep examples elementary but relevant to real-world applications.

This course will be conducted hands-on, classroom style. Computers will not be provided. Registrants are required to bring their own laptops.

For the full agenda Click Here or  Click Here to Register Now!”

Source; www.revolution-computing.com

Disclaimer- I am NOT commerically related to REvolution, just love R. I do hope REvolution chaps do spend  tiny bit of time improving the user GUI as well not just for HPC purposes.

They recently released some new packages free to the CRAN community as well

The release of 3 new packages for R designed to allow all R users to more quickly handle large, complex sets of data: iterators, foreach and doMC.

* iterators implements the “iterator” data structure familiar to users of languages like Java,

C# and Python to make it easy to program useful sequences – from all the prime numbers to the columns of a matrix or the rows of an external database.

* foreach builds on the “iterators” package to introduce a new way of programming loops in R. Unlike the traditional “for” loop, foreach runs multiple iterations simultaneously, in parallel. This makes loops run faster on a multi-core laptop, and enables distribution of large parallel-processing problems to multiple workstations in a cluster or in the cloud, without additional complicated programming. foreach works with parallel programming backends for R from the open-source and commercial domains.

* doMC is an open source parallel programming backend to enable parallel computation with “foreach” on Unix/Linux machines. It automatically enables foreach and iterator functions to work with the “multicore” package from R Core member Simon Urbanek

The new packages have been developed by REvolution Computing and released under open source licenses to the R community, enabling all existing R users

citation:

http://www.revolution-computing.com/aboutus/news-room/2009/breakthrough-parallel-packages-and-functions-for-r.php