Rapid Miner User Conference 2012

One of those cool conferences that is on my bucket list- this time in Hungary (That’s a nice place)

But I am especially interested in seeing how far Radoop has come along !

Disclaimer- Rapid Miner has been a Decisionstats.com sponsor  for many years. It is also a very cool software but I like the R Extension facility even more!


and not very expensive too compared to other User Conferences in Europe!-


Information about Registration

  • Early Bird registration until July 20th, 2012.
  • Normal registration from July 21st, 2012 until August 13th, 2012.
  • Latest registration from August 14th, 2012 until August 24th, 2012.
  • Students have to provide a valid Student ID during registration.
  • The Dinner is included in the All Days and in the Conference packages.
  • All prices below are net prices. Value added tax (VAT) has to be added if applicable.

Prices for Regular Visitors

Days and Event
Early Bird Rate
Normal Rate
Latest Registration

(Training / Development 1)

190 Euro 230 Euro 280 Euro
Wednesday + Thursday


290 Euro 350 Euro 420 Euro

(Training / Development 2 and Exam)

190 Euro 230 Euro 280 Euro
All Days

(Full Package)

610 Euro 740 Euro 900 Euro

Prices for Authors and Students

In case of students, please note that you will have to provide a valid student ID during registration.

Days and Event
Early Bird Rate
Normal Rate
Latest Registration

(Training / Development 1)

90 Euro 110 Euro 140 Euro
Wednesday + Thursday


140 Euro 170 Euro 210 Euro

(Training / Development 2 and Exam)

90 Euro 110 Euro 140 Euro
All Days

(Full Package)

290 Euro 350 Euro 450 Euro
Training / Workshop 1
Conference 1
Conference 2
Training / Workshop 2
09:00 – 10:30
Introductory Speech
Ingo Mierswa; Rapid-I 

Data Analysis


NeurophRM: Integration of the Neuroph framework into RapidMiner
Miloš Jovanović, Jelena Stojanović, Milan Vukićević, Vera Stojanović, Boris Delibašić (University of Belgrade)

To be announced (Invited Talk)
To be announced


Recommender Systems


Extending RapidMiner with Recommender Systems Algorithms
Matej Mihelčić, Nino Antulov-Fantulin, Matko Bošnjak, Tomislav Šmuc (Ruđer Bošković Institute)

Implementation of User Based Collaborative Filtering in RapidMiner
Sérgio Morais, Carlos Soares (Universidade do Porto)

Parallel Training / Workshop Session

Advanced Data Mining and Data Transformations


Development Workshop Part 2

10:30 – 12:30
Data Analysis

Nearest-Neighbor and Clustering based Anomaly Detection Algorithms for RapidMiner
Mennatallah Amer, Markus Goldstein (DFKI)

Customers’ LifeStyle Targeting on Big Data using Rapid Miner
Maksim Drobyshev (LifeStyle Marketing Ltd)

Robust GPGPU Plugin Development for RapidMiner
Andor Kovács, Zoltán Prekopcsák (Budapest University of Technology and Economics)


Image Mining Extension – Year After
Radim Burget, Václav Uher, Jan Mašek (Brno University of Technology)

Incorporating R Plots into RapidMiner Reports
Peter Jeszenszky (University of Debrecen)

An Octave Extension for RapidMiner
Sylvain Marié (Schneider Electric)

12:30 – 13:30
13:30 – 15:00
Parallel Training / Workshop Session

Basic Data Mining and Data Transformations


Development Workshop Part 1


Application of RapidMiner in Steel Industry Research and Development
Bengt-Henning Maas, Hakan Koc, Martin Bretschneider (Salzgitter Mannesmann Forschung)

A Comparison of Data-driven Models for Forecast River Flow
Milan Cisty, Juraj Bezak (Slovak University of Technology)

Portfolio Optimization Using Local Linear Regression Ensembles in Rapid Miner
Gábor Nagy, Tamás Henk, Gergő Barta (Budapest University of Technology and Economics)

Unstructured Data

Processing Data Streams with the RapidMiner Streams-Plugin
Christian Bockermann, Hendrik Blom (TU Dortmund)

Automated Creation of Corpuses for the Needs of Sentiment Analysis
Peter Koncz, Jan Paralic (Technical University of Kosice)




News from the Rapid-I Labs
Simon Fischer; Rapid-I

This short session demonstrates the latest developments from the Rapid-I lab and will let you how you can build powerful analysis processes and routines by using those RapidMiner tools.

Certification Exam
15:00 – 17:00
Book Presentation and Game Show

Data Mining for the Masses: A New Textbook on Data Mining for Everyone
Matthew North (Washington & Jefferson College)

Matthew North presents his new book “Data Mining for the Masses” introducing data mining to a broader audience and making use of RapidMiner for practical data mining problems.


Game Show
Did you miss last years’ game show “Who wants to be a data miner?”? Use RapidMiner for problems it was never created for and beat the time and other contestants!

User Support

Get some Coffee for free – Writing Operators with RapidMiner Beans
Christian Bockermann, Hendrik Blom (TU Dortmund)

Meta-Modeling Execution Times of RapidMiner operators
Matija Piškorec, Matko Bošnjak, Tomislav Šmuc (Ruđer Bošković Institute) 

Social Event (Conference Dinner)
Social Event (Visit of Bar District)


Training: Basic Data Mining and Data Transformations

This is a short introductory training course for users who are not yet familiar with RapidMiner or only have a few experiences with RapidMiner so far. The topics of this training session include

  • Basic Usage
    • User Interface
    • Creating and handling RapidMiner repositories
    • Starting a new RapidMiner project
    • Operators and processes
    • Loading data from flat files
    • Storing data, processes, and results
  • Predictive Models
    • Linear Regression
    • Naïve Bayes
    • Decision Trees
  • Basic Data Transformations
    • Changing names and roles
    • Handling missing values
    • Changing value types by discretization and dichotimization
    • Normalization and standardization
    • Filtering examples and attributes
  • Scoring and Model Evaluation
    • Applying models
    • Splitting data
    • Evaluation methods
    • Performance criteria
    • Visualizing Model Performance


Training: Advanced Data Mining and Data Transformations

This is a short introductory training course for users who already know some basic concepts of RapidMiner and data mining and have already used the software before, for example in the first training on Tuesday. The topics of this training session include

  • Advanced Data Handling
    • Sampling
    • Balancing data
    • Joins and Aggregations
    • Detection and removal of outliers
    • Dimensionality reduction
  • Control process execution
    • Remember process results
    • Recall process results
    • Loops
    • Using branches and conditions
    • Exception handling
    • Definition of macros
    • Usage of macros
    • Definition of log values
    • Clearing log tables
    • Transforming log tables to data


Development Workshop Part 1 and Part 2

Want to exchange ideas with the developers of RapidMiner? Or learn more tricks for developing own operators and extensions? During our development workshops on Tuesday and Friday, we will build small groups of developers each working on a small development project around RapidMiner. Beginners will get a comprehensive overview of the architecture of RapidMiner before making the first steps and learn how to write own operators. Advanced developers will form groups with our experienced developers, identify shortcomings of RapidMiner and develop a new extension which might be presented during the conference already. Unfinished work can be continued in the second workshop on Friday before results might be published on the Marketplace or can be taken home as a starting point for new custom operators.

R for Business Analytics- Book by Ajay Ohri

So the cover art is ready, and if you are a reviewer, you can reserve online copies of the book I have been writing for past 2 years. Special thanks to my mentors, detractors, readers and students- I owe you a beer!

You can also go here-



R for Business Analytics

R for Business Analytics

Ohri, Ajay

2012, 2012, XVI, 300 p. 208 illus., 162 in color.


ISBN 978-1-4614-4342-1

Due: September 30, 2012


approx. 44,95 €
  • Covers full spectrum of R packages related to business analytics
  • Step-by-step instruction on the use of R packages, in addition to exercises, references, interviews and useful links
  • Background information and exercises are all applied to practical business analysis topics, such as code examples on web and social media analytics, data mining, clustering and regression models

R for Business Analytics looks at some of the most common tasks performed by business analysts and helps the user navigate the wealth of information in R and its 4000 packages.  With this information the reader can select the packages that can help process the analytical tasks with minimum effort and maximum usefulness. The use of Graphical User Interfaces (GUI) is emphasized in this book to further cut down and bend the famous learning curve in learning R. This book is aimed to help you kick-start with analytics including chapters on data visualization, code examples on web analytics and social media analytics, clustering, regression models, text mining, data mining models and forecasting. The book tries to expose the reader to a breadth of business analytics topics without burying the user in needless depth. The included references and links allow the reader to pursue business analytics topics.


This book is aimed at business analysts with basic programming skills for using R for Business Analytics. Note the scope of the book is neither statistical theory nor graduate level research for statistics, but rather it is for business analytics practitioners. Business analytics (BA) refers to the field of exploration and investigation of data generated by businesses. Business Intelligence (BI) is the seamless dissemination of information through the organization, which primarily involves business metrics both past and current for the use of decision support in businesses. Data Mining (DM) is the process of discovering new patterns from large data using algorithms and statistical methods. To differentiate between the three, BI is mostly current reports, BA is models to predict and strategize and DM matches patterns in big data. The R statistical software is the fastest growing analytics platform in the world, and is established in both academia and corporations for robustness, reliability and accuracy.

Content Level » Professional/practitioner

Keywords » Business Analytics – Data Mining – Data Visualization – Forecasting – GUI – Graphical User Interface – R software – Text Mining

Related subjects » Business, Economics & Finance – Computational Statistics – Statistics


Why R.- R Infrastructure.- R Interfaces.- Manipulating Data.- Exploring Data.- Building Regression Models.- Data Mining using R.- Clustering and Data Segmentation.- Forecasting and Time-Series Models.- Data Export and Output.- Optimizing your R Coding.- Additional Training Literature.- Appendix

Revolution R Enterprise 6.0 launched!

Just got the email-more software is good news!

Revolution R Enterprise 6.0 for 32-bit and 64-bit Windows and 64-bit Red Hat Enterprise Linux (RHEL 5.x and RHEL 6.x) features an updated release of the RevoScaleR package that provides fast, scalable data management and data analysis: the same code scales from data frames to local, high-performance .xdf files to data distributed across a Windows HPC Server cluster or IBM Platform Computing LSF cluster.  RevoScaleR also allows distribution of the execution of essentially any R function across cores and nodes, delivering the results back to the user.

Detailed information on what’s new in 6.0 and known issues:

and from the manual-lots of function goodies for Big Data


  • IBM Platform LSF Cluster support [Linux only]. The new RevoScaleR function, RxLsfCluster, allows you to create a distributed compute context for the Platform LSF workload manager.
  •  Azure Burst support added for Microsoft HPC Server [Windows only]. The new RevoScaleR function, RxAzureBurst, allows you to create a distributed compute context to have computations performed in the cloud using Azure Burst
  • The rxExec function allows distributed execution of essentially any R function across cores and nodes, delivering the results back to the user.
  • functions RxLocalParallel and RxLocalSeq allow you to create compute context objects for local parallel and local sequential computation, respectively.
  • RxForeachDoPar allows you to create a compute context using the currently registered foreach parallel backend (doParallel, doSNOW, doMC, etc.). To execute rxExec calls, simply register the parallel backend as usual, then set your compute context as follows: rxSetComputeContext(RxForeachDoPar())
  • rxSetComputeContext and rxGetComputeContext simplify management of compute contexts.
  • rxGlm, provides a fast, scalable, distributable implementation of generalized linear models. This expands the list of full-featured high performance analytics functions already available: summary statistics (rxSummary), cubes and cross tabs (rxCube,rxCrossTabs), linear models (rxLinMod), covariance and correlation matrices (rxCovCor),
    binomial logistic regression (rxLogit), and k-means clustering (rxKmeans)example: a Tweedie family with 1 million observations and 78 estimated coefficients (categorical data)
    took 17 seconds with rxGlm compared with 377 seconds for glm on a quadcore laptop


    and easier working with R’s big brother SAS language


    RevoScaleR high-performance analysis functions will now conveniently work directly with a variety of external data sources (delimited and fixed format text files, SAS files, SPSS files, and ODBC data connections). New functions are provided to create data source objects to represent these data sources (RxTextData, RxOdbcData, RxSasData, and RxSpssData), which in turn can be specified for the ‘data’ argument for these RevoScaleR analysis functions: rxHistogramrxSummary, rxCube, rxCrossTabs, rxLinMod, rxCovCor, rxLogit, and rxGlm.


    you can analyze a SAS file directly as follows:

    # Create a SAS data source with information about variables and # rows to read in each chunk

    sasDataFile <- file.path(rxGetOption(“sampleDataDir”),”claims.sas7bdat”)
    sasDS <- RxSasData(sasDataFile, stringsAsFactors = TRUE,colClasses = c(RowNum = “integer”),rowsPerRead = 50)

    # Compute and draw a histogram directly from the SAS file
    rxHistogram( ~cost|type, data = sasDS)
    # Compute summary statistics
    rxSummary(~., data = sasDS)
    # Estimate a linear model
    linModObj <- rxLinMod(cost~age + car_age + type, data = sasDS)
    # Import a subset into a data frame for further inspection
    subData <- rxImport(inData = sasDS, rowSelection = cost > 400,
    varsToKeep = c(“cost”, “age”, “type”))


The installation instructions and instructions for getting started with Revolution R Enterprise & RevoDeployR for Windows: http://www.revolutionanalytics.com/downloads/instructions/windows.php

Interview Alvaro Tejada Galindo, SAP Labs Montreal, Using SAP Hana with #Rstats

Here is a brief interview with Alvaro Tejada Galindo aka Blag who is a developer working with SAP Hana and R at SAP Labs, Montreal. SAP Hana is SAP’s latest offering in BI , it’s also a database and a computing environment , and using R and HANA together on the cloud can give major productivity gains in terms of both speed and analytical ability, as per preliminary use cases.

Ajay- Describe how you got involved with databases and R language.
Blag-  I used to work as an ABAP Consultant for 11 years, but also been involved with programming since the last 13 years, so I was in touch with SQLServer, Oracle, MySQL and SQLite. When I joined SAP, I heard that SAP HANA was going to use an statistical programming language called “R”. The next day I started my “R” learning.

Ajay- What made the R language a fit for SAP HANA. Did you consider other languages? What is your view on Julia/Python/SPSS/SAS/Matlab languages

Blag- I think “R” is a must for SAP HANA. As the fastest database in the market, we needed a language that could help us shape the data in the best possible way. “R” filled that purpose very well. Right now, “R” is not the only language as “L” can be used as well (http://wiki.tcl.tk/17068) …not forgetting “SQLScript” which is our own version of SQL (http://goo.gl/x3bwh) . I have to admit that I tried Julia, but couldn’t manage to make it work. Regarding Python, it’s an interesting question as I’m going to blog about Python and SAP HANA soon. About Matlab, SPSS and SAS I haven’t used them, so I got nothing to say there.

Ajay- What is your view on some of the limitations of R that can be overcome with using it with SAP HANA.

Blag-  I think mostly the ability of SAP HANA to work with big data. Again, SAP HANA and “R” can work very nicely together and achieve things that weren’t possible before.

Ajay-  Have you considered other vendors of R including working with RStudio, Revolution Analytics, and even Oracle R Enterprise.

Blag-  I’m not really part of the SAP HANA or the R groups inside SAP, so I can’t really comment on that. I can only say that I use RStudio every time I need to do something with R. Regarding Oracle…I don’t think so…but they can use any of our products whenever they want.

Ajay- Do you have a case study on an actual usage of R with SAP HANA that led to great results.

Blag-   Right now the use of “R” and SAP HANA is very preliminary, I don’t think many people has start working on it…but as an example that it works, you can check this awesome blog entry from my friend Jitender Aswani “Big Data, R and HANA: Analyze 200 Million Data Points and Later Visualize Using Google Maps “ (http://allthingsr.blogspot.com/#!/2012/04/big-data-r-and-hana-analyze-200-million.html)

Ajay- Does your group in SAP plan to give to the R ecosystem by attending conferences like UseR 2012, sponsoring meets, or package development etc

Blag- My group is in charge of everything developers, so sure, we’re planning to get more in touch with R developers and their ecosystem. Not sure how we’re going to deal with it, but at least I’m going to get myself involved in the Montreal R Group.




Name: Alvaro Tejada Galindo
Email: a.tejada.galindo@sap.com
Profession: Development
Company: SAP Canada Labs-Montreal
Town/City: Montreal
Country: Canada
Instant Messaging Type: Twitter
Instant Messaging ID: Blag
Personal URL: http://blagrants.blogspot.com
Professional Blog URL: http://www.sdn.sap.com/irj/scn/weblogs?blog=/pub/u/252210910
My Relation to SAP: employee
Short Bio: Development Expert for the Technology Innovation and Developer Experience team.Used to be an ABAP Consultant for the last 11 years. Addicted to programming since 1997.


and from


SAP HANA is SAP AG’s implementation of in-memory database technology. There are four components within the software group:[1]

  • SAP HANA DB (or HANA DB) refers to the database technology itself,
  • SAP HANA Studio refers to the suite of tools provided by SAP for modeling,
  • SAP HANA Appliance refers to HANA DB as delivered on partner certified hardware (see below) as anappliance. It also includes the modeling tools from HANA Studio as well replication and data transformation tools to move data into HANA DB,[2]
  • SAP HANA Application Cloud refers to the cloud based infrastructure for delivery of applications (typically existing SAP applications rewritten to run on HANA).

R is integrated in HANA DB via TCP/IP. HANA uses SQL-SHM, a shared memory-based data exchange to incorporate R’s vertical data structure. HANA also introduces R scripts equivalent to native database operations like join or aggregation.[20] HANA developers can write R scripts in SQL and the types are automatically converted in HANA. R scripts can be invoked with HANA tables as both input and output in the SQLScript. R environments need to be deployed to use R within SQLScript

More blog posts on using SAP and R together

Dealing with R and HANA

R meets HANA


HANA meets R

When SAP HANA met R – First kiss




SAP HANA: My experiences on using SAP HANA with R


and of course the blog that started it all-

Jitender Aswani’s http://allthingsr.blogspot.in/



Data Quality in R #rstats

Many Data Quality Formats give problems when importing in your statistical software.A statistical software is quite unable to distingush between $1,000, 1000% and 1,000 and 1000 and will treat the former three as character variables while the third as a numeric variable by default. This issue is further compounded by the numerous ways we can represent date-time variables.

The good thing is for specific domains like finance and web analytics, even these weird data input formats are fixed, so we can fix up a list of handy data quality conversion functions in R for reference.


After much muddling about with coverting internet formats (or data used in web analytics) (mostly time formats without date like 00:35:23)  into data frame numeric formats, I found that the way to handle Date-Time conversions in R is

Dataset$Var2= strptime(as.character(Dataset$Var1),”%M:%S”)

The problem with this approach is you will get the value as a Date Time format (02/31/2012 04:00:45-  By default R will add today’s date to it.)  while you are interested in only Time Durations (4:00:45 or actually just the equivalent in seconds).

this can be handled using the as.difftime function


or to get purely numeric values so we can do numeric analysis (like summary)


(#Maybe there is  a more elegant way here- but I dont know)

The kind of data is usually one we get in web analytics for average time on site , etc.







for factor variables

Dataset$Var2= as.numeric(as.character(Dataset$Var1))



Dataset$Var2= as.numeric(paste(Dataset$Var1))


Slight problem is suppose there is data like 1,504 – it will be converted to NA instead of 1504

The way to solve this is use the nice gsub function ONLy on that variable. Since the comma is also the most commonly used delimiter , you dont want to replace all the commas, just only the one in that variable.




Now lets assume we have data in the form of % like 0.00% , 1.23%, 3.5%

again we use the gsub function to replace the % value in the string with  (nothing).





If you simply do the following for a factor variable, it will show you the level not the value. This can create an error when you are reading in CSV data which may be read as character or factor data type.

Dataset$Var2= as.numeric(Dataset$Var1)

An additional way is to use substr (using substr( and concatenate (using paste) for manipulating string /character variables.


iris$sp=substr(iris$Species,1,3) –will reduce the famous Iris species into three digits , without losing any analytical value.

The other issue is with missing values, and na.rm=T helps with getting summaries of numeric variables with missing values, we need to further investigate how suitable, na.omit functions are for domains which have large amounts of missing data and need to be treated.



BigML meets R #rstats

I am just checking the nice new R package created by BigML.com co-founder Justin Donaldson. The name of the new package is bigml, which can confuse a bit since there do exist many big suffix named packages in R (including biglm)

The bigml package is available at CRAN http://cran.r-project.org/web/packages/bigml/index.html

I just tweaked the code given at http://blog.bigml.com/2012/05/10/r-you-ready-for-bigml/ to include the ssl authentication code at http://www.brocktibert.com/blog/2012/01/19/358/

so it goes

> library(bigml)
Loading required package: RJSONIO
Loading required package: RCurl
Loading required package: bitops
Loading required package: plyr
> setCredentials(“bigml_username”,”API_key”)

# download the file needed for authentication
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")

# set the curl options
curl <- getCurlHandle()
options(RCurlOptions = list(capath = system.file("CurlSSL", "cacert.pem",
package = "RCurl"),
ssl.verifypeer = FALSE))
curlSetOpt(.opts = list(proxy = 'proxyserver:port'), curl = curl)

> iris.model = quickModel(iris, objective_field = ‘Species’)

Of course there are lots of goodies added here , so read the post yourself at http://blog.bigml.com/2012/05/10/r-you-ready-for-bigml/

Incidentally , the author of this R package (bigml) Justin Donalsdon who goes by name sudojudo at http://twitter.com/#!/sudojudo has also recently authored two other R packages including tsne at  http://cran.r-project.org/web/packages/tsne/index.html (tsne: T-distributed Stochastic Neighbor Embedding for R (t-SNE) -A “pure R” implementation of the t-SNE algorithm) and a GUI toolbar http://cran.r-project.org/web/packages/sculpt3d/index.html (sculpt3d is a GTK+ toolbar that allows for more interactive control of a dataset inside the RGL plot window. Controls for simple brushing, highlighting, labeling, and mouseMode changes are provided by point-and-click rather than through the R terminal interface)

This along with the fact the their recently released python bindings for bigml.com was one of the top news at Hacker News- shows bigML.com is going for some traction in bringing cloud computing, better software interfaces and data mining together!

Protected: Converting SAS language code to Java

This content is password protected. To view it please enter your password below: