This is a great data mining tutorial from John Elder. Visit his site at http://datamininglab.com/
for more great video tutorials- all very lucid, easy to understand and powerful.
This is a great data mining tutorial from John Elder. Visit his site at http://datamininglab.com/
for more great video tutorials- all very lucid, easy to understand and powerful.
From a Zementis Newsletter- interesting advances on the R on the cloud front. Thanks to Rom Ramos for sending this, and I hope Zementis and some one like Google/ Biocep team up so all I need to make a model is some data and a browser. 🙂
The R Journal – A Refereed Journal for the R Project Launches
As a sign of the open source R project for statistical computing gaining momentum, the R newsletter has been transformed into The R Journal, a refereed journal for articles covering topics that are of interest to users or developers of R. As a supporter of the R PMML Package (see blog and video tutorial), we are honored that our article “PMML: An Open Standard for Sharing Models” which emphasizes the importance of the Predictive Model Markup Language (PMML) standard is part of the inaugural issue. If you already develop your models in R, export them via PMML, then deploy and scale your models in ADAPA on the Amazon EC2 cloud. Read the full story.
Integrating Predictive Analytics via Web Services
Predictive analytics will deliver more value and become more pervasive across the enterprise, once we manage to seamlessly integrate predictive models into any business process. In order to execute predictive models on-demand, in real-time or in batch mode, the integration via web services presents a simple and effective way to leverage scoring results within different applications. For most scenarios, the best way to incorporate predictive models into the business process is as a decision service. Query the model(s) daily, hourly, or in real-time, but if at all possible try to design a loosely coupled system following a Service Oriented Architecture (SOA).Using web services, for example, one can quickly improve existing systems and processes by adding predictive decision models. Following the idea of a loosely coupled architecture, it is even possible to use integration tools like Jitterbit or Microsoft SQL Service Integration Services (SSIS) to embed predictive mode ls that are deployed in ADAPA on the Amazon Elastic Compute Cloud without the need to write any code. Of course, there is also the option to use custom Java code or MS SQL Server SSIS Scripting for which we provide a sample client application. Read the full story.
About ADAPA®:A fast real-time deployment environment for Predictive Analytics Models – a stand alone scoring engine that reads .xml based PMML descriptions of models and scores streams of data. Developed by Zementis – a fully hosted Software-as-a Service (SaaS) solution on the Amazon Elastic Computing Cloud. It’s easy to use and remarkably inexpensive starting at only $0.99 per instance hour.
Here is an exciting project for using R on the cloud computing environment ( two of my favorite things). It is called RHIPE
R and Hadoop Integrated Processing Environment v.0.38
The website source is http://ml.stat.purdue.edu/rhipe/
RHIPE(phonetic spelling: hree-pay’ 1) is a java package that integrates the R environment with Hadoop, the open source implementation of Google’s mapreduce. Using RHIPE it is possible to code map-reduce algorithms in R e.g
m <- function(key,val){ words <- strsplit(val," +")[[1]] wc <- table(words) cln <- names(wc) names(wc)<-NULL; names(cln)<-NULL; return(sapply(1:length(wc),function(r) list(key=cln[r],value=wc[[r]]),simplify=F)) } r <- function(key,value){ value <- do.call("rbind",value) return(list(list(key=key,value=sum(value)))) } rhmr(map=m,reduce=r,combiner=T,input.folder="X",output.folder="Y")rhapplypackages the user's request into an R vector object. This is serialized and sent to the RHIPE server. The RHIPE server picks apart the object creating a job request that Hadoop can understand. Each element of the provided list is processed by the users function during the Map stage of mapreduce. The results are returned and if the output is to a file, these results are serialized and written to a Hadoop Sequence file, the values can be read back into R using the rhsq* functions.2 rhlapply
rhlapply <- function( list.object, func, configure=expression(), output.folder='', shared.files=c(), hadoop.mapreduce=list(), verbose=T, takeAll=T)
- list.object
- This can either be a list or a single scalar. In case of the former, the function given by
funcwill be applied to each element oflist.object. In case of a scalar, the function will be applied to the list1:nwherenis the value of the scalar- func
- A function that takes one parameter: an element of the list.
- configure
- An configuration expression to run before the
funcis executed. Executed once for every JVM. If you need variables, data frames, userhsaveorrhsave.image, userhputto copy it to the DFS and then useshared.filesconfig = expression({ library(lattice) load("mydataset.Rdata") })
- output.folder
- Any file that is created by the function is stored in the
output.folder. This is deleted first. If not given, the files created will not be copied. For side effect files to be copies create them intmpe.g pdf("tmp/x.pdf"), note no leading slash.The directory will contain a slew ofpart*files, as many as there maps. These contain the binary key-value pairs.- shared.files
- The function or the preload expression might require the presence resource files e.g *.Rdata files. The user could copy it from the HDFS in the R code or just load it from the local work directory were the files present. This is the role of shared.files. It is a vector of paths to files on the HDFS, each of these will be copied to the work directory where the R code is run. e.g
c('/tmp/x.Rdata','/foo.tgz'), then the first file can be loaded viaload("x.Rdata"). For those familiar with Hadoop terminology, this is implemented via DistributedCache .- hadoop.mapreduce
- a list of Hadoop specific options e.g
list(mapreduce.map.tasks=10,mapreduce.reduce.tasks=3)- takeAll
- if takeALL is true, the value returned is a list each entry the return value of the the function, not in order so element 1 of the returned list is not the result of func(list.object=1=) .
- verbose
- If True, the user will see the job progress in the R console. If False, the web url to the jobtracker will be displayed. Cancelling the command with CTRL-C will not cancel the job, use
rhkillfor that.Mapreduce in R.rhmr <- function(map,reduce,input.folder,configure=list(map=expression(),reduce=expression()), close=list(map=expression(),reduce=expression()) output.folder='',combiner=F,step=F, shared.files=c(),inputformat="TextInputFormat", outputformat="TextOutputFormat", hadoop.mapreduce=list(),verbose=T,libjars=c())Execute map reduce algorithms from within R. A discussion of the parameters follow.
- input.folder
- A folder on the DFS containing the files to process. Can be a vector.
- output.folder
- A folder on the DFS where output will go to.
- inputformat
- Either of
TextInputFormatorSequenceFileInputFormat. Use the former for text files and the latter for sequence files created from within R or as outputs from RHIPE(e.g rhlapply or rhmr). Note, one can't use any sequence file, they must have been created via a RHIPE function. Custom Input formats are also possible. Download the source and look atcode/java/RXTextInputFormat.java- outputformat
- Either of
TextOutputFormatorSequenceFileOutputFormat. In case of the former, the return value from the mapper or reducer is converted to character and written to disk. The following code is used to convert to character.paste(key,sep='',collapse=field_separator)Custom output formats are also possible. Download the source and look at
code/java/RXTextOutputFormat.javaIf custom formats implement their own writables, it must subclass RXWritable or use one of the writables presents in RHIPE
- shared.files
- same as in
rhlapply, see that for documentation.- verbose
- If
T, the job progress is displayed. If false, then the job URL is displayed.At any time in the
configure,close,maporreducefunction/expression, the variablemapred.task.is.mapwill be equal to "true" if it is map task,"false" otherwise (both strings) Also,mapred.iswhatis mapper, reducer, combiner in their respective environments.
- configure
- A list with either one element (an expression) or two elements map and reduce both of which must be expressions. These expressions are called in their respective environments, i.e the map expression is called during the map configure and similarly for the reduce expression. The reduce expression is called for the combiner configure method.If only one list element, the expression is used for both the map and reduce
- close
- Same as
configure.- map
- a function that takes two values key and value. Should return a list of lists. Each list entry must contain two elements key and value , e.g
... ret <- list() ret[[1]] <- list(key=c(1,2), value=c('x','b')) return(ret)If any of key/value are missing the output is not collected, e.g. return
NULLto skip this record. If the input format is aTextInputFormat, the value is the entire line and the key is probably useless to the user( it is a number indicating bytes into the file). If the input format isSequenceFileInputFormat, the key and value are taken from the sequence file.- reduce
- Not needed if
mapred.reduce.tasksis 0. Takes a key and a list of values( all values emitted from the maps that share the same map output key ). Ifstepis True, then not a list. Must return a list of lists each element of which must have two elements key and value. This collects all the values and sends them to function. If NULL is returned or the return value is not conforming to the above nothing is collected the Hadoop collector.- step
- If
stepis TRUE, then thereducefunction is called for every value corresponding to a key that is once for every value.
- The variable
red.statusis equal to 1 on the first call.red.statusis equal to 0 for every subsequent calls including the last value- The reducer function is called one last time with
red.statusequal to -1. The value is NULL.Anything returned at any of these stages is written to disk Theclosefunction is called once every value for a given key has been processed, but returning anything has no effect. To a assign to the global environment use the<<-operator.- combiner
TorF, to use the reducer as a combiner. Using a combiner makes computation more efficient. If combiner is true, the reduce function will be called as a combiner (0 or more times, it may never be called during the combine stage even if combiner isT) .The value ofmapred.task.is.mapis 'true' or '*'false*' (both strings) if the combiner is being executed as part of the map stage or reduce stage respectively.Whether knowledge of this is useful or not is something I'm not sure of. However, if combiner is
T, keep in mind,your reduce function must be able to handle inputs sent from the map or inputs sent from the reduce function(itself).- libjars
- If specifying a custom input/output format, the user might need to specify jar files here.
- hadoop.mapreduce
- set RHIPE and Hadoop options via this.
1.1 RHIPE Options for mapreduce
Option Default Explanation rhipejob.rport 8888 The port on which Rserve runs, should be same across all machines rhipejob.charsxp.short 0 If 1, RHIPE optimize serialization for character vectors. This reduces the length of the serialization rhipejob.getmapbatches 100 If the reduce/mapper emits several key,values, how many to get from Rserve at a time. A higher number reduce the number of network reads(the network reads are to localhost) rhipejob.outfmt.is.text 1 if TextInputFormat Must be 1 if the output is textual rhipejob.textoutput.fieldsep ' ' The field separator for any text based output format rhipejob.textinput.comment '#' In the TextInputFormat, lines beginning with this are skipped rhipejob.combinerspill 100,000 The combiner is run after collecting at most this many items rhipejob.tor.batch 200,000 Number of values for the same key to collate before sending to the Reducer, if you have dollops of memory, set this larger. However, too large and you hit Java's heap space limit rhipejob.max.count.reduce Java's INT_MAX (about 2BN) the total number of values for a given key to be collected, note the values are not ordered by any variable. rhipejob.inputformat.keyclass The default is chosen depending on TextInputFormat or SequenceFileInputFormat Provide the full Java URL to the keyclass e.g org.saptarshiguha.rhipe.hadoop.RXWritableText, when using a Custom InputFormat implement RXWritable and implement the methodsrhipejob.inputformat.valueclass The default is chosen depending on TextInputFormat or SequenceFileInputFormat Provide the full Java URL to the valueclass e.g org.saptarshiguha.rhipe.hadoop.RXWritableTextwhen using a Custom InputFormat implement RXWritable and implement the methodsmapred.input.format.class As above, the default is either org.saptarshiguha.rhipe.hadoop.RXTextInputFormatororg.apache.hadoop.mapred.SequenceFileInputFormatspecify yours here rhipejob.outputformat.keyclass The default is chosen depending on TextInputFormat or SequenceFileInputFormat Provide the full Java URL to the keyclass e.g org.saptarshiguha.rhipe.hadoop.RXWritableText, also the keyclass must implementRXWritableandrhipejob.outputformat.valueclass The default is chosen depending on TextInputFormat or SequenceFileInputFormat Provide the full Java URL to the value e.g org.saptarshiguha.rhipe.hadoop.RXWritableText, also the valueclass must implementRXWritablemapred.output.format.class As above, the default is either org.saptarshiguha.rhipe.hadoop.RXTextOutputFormatororg.apache.hadoop.mapred.SequenceFileInputFormatspecify yours here, provide libjars if required Citation:http://ml.stat.purdue.edu/rhipe/index.html
Great exciting news for the world of computing remotely.
Here is an interesting training from Revolution Computing
New Training Course from REvolution Computing
High-Performance Computing with R
July 31, 2009 – Washington, DC – Prior to JSM
Time: 9am – 5pm
$600 commercial delegates, $450 government, $250 academicAn overview of available HPC technologies for the R language to enable faster, scalable analytics that can take advantage of multiprocessor capability will be presented in a one-day course. This will include a comprehensive overview of REvolution’s recently released R packages foreach and iterators, making parallel programming easier than ever before for R programmers, as well as other available technologies such as RMPI, SNOW and many more. We will demonstrate each technology with simple examples that can be used as starting points for more sophisticated work. The agenda will also cover:
- Identifying performance problems
- Profiling R programs
- Multithreading, using compiled code, GPGPU
- Multiprocess computing
- SNOW, MPI, NetWorkSpaces, and more
- Batch queueing systems
- Dealing with lots of data
Attendees should have basic familiarity with the R language—we will keep examples elementary but relevant to real-world applications.
This course will be conducted hands-on, classroom style. Computers will not be provided. Registrants are required to bring their own laptops.
For the full agenda Click Here or Click Here to Register Now!”
Source; www.revolution-computing.com
Disclaimer- I am NOT commerically related to REvolution, just love R. I do hope REvolution chaps do spend tiny bit of time improving the user GUI as well not just for HPC purposes.
They recently released some new packages free to the CRAN community as well
The release of 3 new packages for R designed to allow all R users to more quickly handle large, complex sets of data: iterators, foreach and doMC.
* iterators implements the “iterator” data structure familiar to users of languages like Java,
C# and Python to make it easy to program useful sequences – from all the prime numbers to the columns of a matrix or the rows of an external database.
* foreach builds on the “iterators” package to introduce a new way of programming loops in R. Unlike the traditional “for” loop, foreach runs multiple iterations simultaneously, in parallel. This makes loops run faster on a multi-core laptop, and enables distribution of large parallel-processing problems to multiple workstations in a cluster or in the cloud, without additional complicated programming. foreach works with parallel programming backends for R from the open-source and commercial domains.
* doMC is an open source parallel programming backend to enable parallel computation with “foreach” on Unix/Linux machines. It automatically enables foreach and iterator functions to work with the “multicore” package from R Core member Simon Urbanek
The new packages have been developed by REvolution Computing and released under open source licenses to the R community, enabling all existing R users
citation:
Here is a great presentation on Facebook Analytics using text mining.
Here is one of the new startups in India. A batch mate from B school whom I owe too many beers, and too few
calculus notes —–well he asked me to help him vote. Treat this as shameless self promotion just like http://www.cerebralmastication.com/ ‘s moustache and R rated R stats profanity on #rstats in twitter
Please do vote and read- they are a fun couple. http://www.greatdrivingchallenge.com/application/1245656268196502/

KXEN remains a GOLDEN sponser
| General Chair | John Elder (Elder Research, Inc.) Francoise Soulie Fogelman (KXEN) |
// <![CDATA[// <![CDATA[ document.write('generalchair@kdd2009.com’); // ]]>generalchair@kdd2009.com |
I asked Francoise in her interview this March on (http://www.decisionstats.com/2009/03/27/interview-franoise-soulie-fogelman-kxen/ ) oh her views on data mining and how KXEN fits in and here is an extract –
Ajay –What kind of hardware solutions go best with KXEN’s software. What are the other BI vendors that your offerings best complement with.
Françoise – KXEN software in general and KSN in particular, run on any platform. When using KSN to build decent size graphs (with tens of millions of nodes and hundreds of millions of links for example), 64 bits architecture is required. A recent survey of KXEN customers show that the BI suites used by our customers are mostly MicroStrategy and Business Objects (SAP). We also like very much to mention Advizor Solutions which offers data visualization software already embedding KXEN technology.
Francoise of course is well versed to be talking on Knowledge Discovery and Data mining. – her credentials are kind of awe inspiring
Ms Soulie Fogelman has over 30 years of experience in data mining and CRM both from an academic and a business perspective. Prior to KXEN, she directed the first French research team on Neural Networks at Paris 11 University where she was a CS Professor. She then co-founded Mimetics, a start-up that processes and sells development environment, optical character recognition (OCR) products and services using neural network technology, and became its Chief Scientific Officer. After that she started the Data Mining and CRM group at Atos Origin and, most recently, she created and managed the CRM Agency for Business & Decision, a French IS company specialized in Business Intelligence and CRM.
Ms Soulie Fogelman holds a master’s degree in mathematics from Ecole Normale Superieure and a PhD in Computer Science from University of Grenoble. She was advisor to over 20 PhD on data mining, has authored more than 100 scientific papers and books and has been an invited speaker to many academic and b
business events.
Disclaimer- I have been both a KXEn client, user, as well as vendor.