New Plotters in Rapid Miner 5.2

I almost missed this because of my vacation and traveling

Rapid Miner has a tonne of new stuff (Statuary Ethics Declaration- Rapid Miner has been an advertising partner for Decisionstats – see the right margin)

see

http://rapid-i.com/component/option,com_myblog/Itemid,172/lang,en/

Great New Graphical Plotters

and some flashy work

and a great series of educational lectures

A Simple Explanation of Decision Tree Modeling based on Entropies

Link: http://www.simafore.com/blog/bid/94454/A-simple-explanation-of-how-entropy-fuels-a-decision-tree-model

Description of some of the basics of decision trees. Simple and hardly any math, I like the plots explaining the basic idea of the entropy as splitting criterion (although we actually calculate gain ratio differently than explained…)

Logistic Regression for Business Analytics using RapidMiner

Link: http://www.simafore.com/blog/bid/57924/Logistic-regression-for-business-analytics-using-RapidMiner-Part-2

Same as above, but this time for modeling with logistic regression.
Easy to read and covering all basic ideas together with some examples. If you are not familiar with the topic yet, part 1 (see below) might help.

Part 1 (Basics): http://www.simafore.com/blog/bid/57801/Logistic-regression-for-business-analytics-using-RapidMiner-Part-1

Deploy Model: http://www.simafore.com/blog/bid/82024/How-to-deploy-a-logistic-regression-model-using-RapidMiner

Advanced Information: http://www.simafore.com/blog/bid/99443/Understand-3-critical-steps-in-developing-logistic-regression-models

and lastly a new research project for collaborative data mining

http://www.e-lico.eu/

e-LICO Architecture and Components

The goal of the e-LICO project is to build a virtual laboratory for interdisciplinary collaborative research in data mining and data-intensive sciences. The proposed e-lab will comprise three layers: the e-science and data mining layers will form a generic research environment that can be adapted to different scientific domains by customizing the application layer.

  1. Drag a data set into one of the slots. It will be automatically detected as training data, test data or apply data, depending on whether it has a label or not.
  2. Select a goal. The most frequent one is probably “Predictive Modelling”. All goals have comments, so you see what they can be used for.
  3. Select “Fetch plans” and wait a bit to get a list of processes that solve your problem. Once the planning completes, select one of the processes (you can see a preview at the right) and run it. Alternatively, select multiple (selecting none means selecting all) and evaluate them on your data in a batch.

The assistant strives to generate processes that are compatible with your data. To do so, it performs a lot of clever operations, e.g., it automatically replaces missing values if missing values exist and this is required by the learning algorithm or performs a normalization when using a distance-based learner.

You can install the extension directly by using the Rapid-I Marketplace instead of the old update server. Just go to the preferences and enter http://rapidupdate.de:8180/UpdateServer as the update URL

Of course Rapid Miner has been of the most professional open source analytics company and they have been doing it for a long time now. I am particularly impressed by the product map (see below) and the graphical user interface.

http://rapid-i.com/content/view/186/191/lang,en/

Product Map

Just click on the products in the overview below in order to get more information about Rapid-I products.

 

Rapid-I Product Overview 

 

A Brief Overview of Open vs Closed in Computing

1984 – IBM   (Big Brother) vs Apple  (Computing opened for individuals)

1988- Apple (Closed Hardware and Software) vs Microsoft (  Licensed to all software)

1998- Microsoft (Source code is closed but licenses to all) vs Linux (Open Source Code)

2008- Apple (Closed Hardware and Software) vs Google (Android/Linux) -(Free and Open Source)

2010 – Google (Web open to search) vs Facebook (Closed to search)

2018 (?)-Google (Code is open for all non revenue generating software, but search engine algorithm is closed) VS       TBD

Interview Rapid-I -Ingo Mierswa and Simon Fischer

Here is an interview with Dr Ingo Mierswa , CEO of Rapid -I and Dr Simon Fischer, Head R&D. Rapid-I makes the very popular software Rapid Miner – perhaps one of the earliest leading open source software in business analytics and business intelligence. It is quite easy to use, deploy and with it’s extensions and innovations (including compatibility with R )has continued to grow tremendously through the years.

In an extensive interview Ingo and Simon talk about algorithms marketplace, extensions , big data analytics, hadoop, mobile computing and use of the graphical user interface in analytics.

Special Thanks to Nadja from Rapid I communication team for helping coordinate this interview.( Statuary Blogging Disclosure- Rapid I is a marketing partner with Decisionstats as per the terms in https://decisionstats.com/privacy-3/)

Ajay- Describe your background in science. What are the key lessons that you have learnt while as scientific researcher and what advice would you give to new students today.

Ingo: My time as researcher really was a great experience which has influenced me a lot. I have worked at the AI lab of Prof. Dr. Katharina Morik, one of the persons who brought machine learning and data mining to Europe. Katharina always believed in what we are doing, encouraged us and gave us the space for trying out new things. Funnily enough, I never managed to use my own scientific results in any real-life project so far but I consider this as a quite common gap between science and the “real world”. At Rapid-I, however, we are still heavily connected to the scientific world and try to combine the best of both worlds: solving existing problems with leading-edge technologies.

Simon: In fact, during my academic career I have not worked in the field of data mining at all. I worked on a field some of my colleagues would probably even consider boring, and that is theoretical computer science. To be precise, my research was in the intersection of game theory and network theory. During that time, I have learnt a lot of exciting things, none of which had any business use. Still, I consider that a very valuable experience. When we at Rapid-I hire people coming to us right after graduating, I don’t care whether they know the latest technology with a fancy three-letter acronym – that will be forgotten more quickly than it came. What matters is the way you approach new problems and challenges. And that is also my recommendation to new students: work on whatever you like, as long as you are passionate about it and it brings you forward.

Ajay-  How is the Rapid Miner Extensions marketplace moving along. Do you think there is a scope for people to say create algorithms in a platform like R , and then offer that algorithm as an app for sale just like iTunes or Android apps.

 Simon: Well, of course it is not going to be exactly like iTunes or Android apps are, because of the more business-orientated character. But in fact there is a scope for that, yes. We have talked to several developers, e.g., at our user conference RCOMM, and several people would be interested in such an opportunity. Companies using data mining software need supported software packages, not just something they downloaded from some anonymous server, and that is only possible through a platform like the new Marketplace. Besides that, the marketplace will not only host commercial extensions. It is also meant to be a platform for all the developers that want to publish their extensions to a broader community and make them accessible in a comfortable way. Of course they could just place them on their personal Web pages, but who would find them there? From the Marketplace, they are installable with a single click.

Ingo: What I like most about the new Rapid-I Marketplace is the fact that people can now get something back for their efforts. Developing a new algorithm is a lot of work, in some cases even more that developing a nice app for your mobile phone. It is completely accepted that people buy apps from a store for a couple of Dollars and I foresee the same for sharing and selling algorithms instead of apps. Right now, people can already share algorithms and extensions for free, one of the next versions will also support selling of those contributions. Let’s see what’s happening next, maybe we will add the option to sell complete RapidMiner workflows or even some data pools…

Ajay- What are the recent features in Rapid Miner that support cloud computing, mobile computing and tablets. How do you think the landscape for Big Data (over 1 Tb ) is changing and how is Rapid Miner adapting to it.

Simon: These are areas we are very active in. For instance, we have an In-Database-Mining Extension that allows the user to run their modelling algorithms directly inside the database, without ever loading the data into memory. Using analytic databases like Vectorwise or Infobright, this technology can really boost performance. Our data mining server, RapidAnalytics, already offers functionality to send analysis processes into the cloud. In addition to that, we are currently preparing a research project dealing with data mining in the cloud. A second project is targeted towards the other aspect you mention: the use of mobile devices. This is certainly a growing market, of course not for designing and running analyses, but for inspecting reports and results. But even that is tricky: When you have a large screen you can display fancy and comprehensive interactive dashboards with drill downs and the like. On a mobile device, that does not work, so you must bring your reports and visualizations very much to the point. And this is precisely what data mining can do – and what is hard to do for classical BI.

Ingo: Then there is Radoop, which you may have heard of. It uses the Apache Hadoop framework for large-scale distributed computing to execute RapidMiner processes in the cloud. Radoop has been presented at this year’s RCOMM and people are really excited about the combination of RapidMiner with Hadoop and the scalability this brings.

 Ajay- Describe the Rapid Miner analytics certification program and what steps are you taking to partner with academic universities.

Ingo: The Rapid-I Certification Program was created to recognize professional users of RapidMiner or RapidAnalytics. The idea is that certified users have demonstrated a deep understanding of the data analysis software solutions provided by Rapid-I and how they are used in data analysis projects. Taking part in the Rapid-I Certification Program offers a lot of benefits for IT professionals as well as for employers: professionals can demonstrate their skills and employers can make sure that they hire qualified professionals. We started our certification program only about 6 months ago and until now about 100 professionals have been certified so far.

Simon: During our annual user conference, the RCOMM, we have plenty of opportunities to talk to people from academia. We’re also present at other conferences, e.g. at ECML/PKDD, and we are sponsoring data mining challenges and grants. We maintain strong ties with several universities all over Europe and the world, which is something that I would not want to miss. We are also cooperating with institutes like the ITB in Dublin during their training programmes, e.g. by giving lectures, etc. Also, we are leading or participating in several national or EU-funded research projects, so we are still close to academia. And we offer an academic discount on all our products 🙂

Ajay- Describe the global efforts in making Rapid Miner a truly international software including spread of developers, clients and employees.

Simon: Our clients already are very international. We have a partner network in America, Asia, and Australia, and, while I am responding to these questions, we have a training course in the US. Developers working on the core of RapidMiner and RapidAnalytics, however, are likely to stay in Germany for the foreseeable future. We need specialists for that, and it would be pointless to spread the development team over the globe. That is also owed to the agile philosophy that we are following.

Ingo: Simon is right, Rapid-I already is acting on an international level. Rapid-I now has more than 300 customers from 39 countries in the world which is a great result for a young company like ours. We are of course very strong in Germany and also the rest of Europe, but also concentrate on more countries by means of our very successful partner network. Rapid-I continues to build this partner network and to recruit dynamic and knowledgeable partners and in the future. However, extending and acting globally is definitely part of our strategic roadmap.

Biography

Dr. Ingo Mierswa is working as Chief Executive Officer (CEO) of Rapid-I. He has several years of experience in project management, human resources management, consulting, and leadership including eight years of coordinating and leading the multi-national RapidMiner developer team with about 30 developers and contributors world-wide. He wrote his Phd titled “Non-Convex and Multi-Objective Optimization for Numerical Feature Engineering and Data Mining” at the University of Dortmund under the supervision of Prof. Morik.

Dr. Simon Fischer is heading the research & development at Rapid-I. His interests include game theory and networks, the theory of evolutionary algorithms (e.g. on the Ising model), and theoretical and practical aspects of data mining. He wrote his PhD in Aachen where he worked in the project “Design and Analysis of Self-Regulating Protocols for Spectrum Assignment” within the excellence cluster UMIC. Before, he was working on the vtraffic project within the DFG Programme 1126 “Algorithms for large and complex networks”.

http://rapid-i.com/content/view/181/190/ tells you more on the various types of Rapid Miner licensing for enterprise, individual and developer versions.

(Note from Ajay- to receive an early edition invite to Radoop, click here http://radoop.eu/z1sxe)

 

Analytics 2011 Conference

From http://www.sas.com/events/analytics/us/

The Analytics 2011 Conference Series combines the power of SAS’s M2010 Data Mining Conference and F2010 Business Forecasting Conference into one conference covering the latest trends and techniques in the field of analytics. Analytics 2011 Conference Series brings the brightest minds in the field of analytics together with hundreds of analytics practitioners. Join us as these leading conferences change names and locations. At Analytics 2011, you’ll learn through a series of case studies, technical presentations and hands-on training. If you are in the field of analytics, this is one conference you can’t afford to miss.

Conference Details

October 24-25, 2011
Grande Lakes Resort
Orlando, FL

Analytics 2011 topic areas include:

RapidMiner launches extensions marketplace

For some time now, I had been hoping for a place where new package or algorithm developers get at least a fraction of the money that iPad or iPhone application developers get. Rapid Miner has taken the lead in establishing a marketplace for extensions. Is there going to be paid extensions as well- I hope so!!

This probably makes it the first “app” marketplace in open source and the second app marketplace in analytics after salesforce.com

It is hard work to think of new algols, and some of them can really be usefull.

Can we hope for #rstats marketplace where people downloading say ggplot3.0 atleast get a prompt to donate 99 cents per download to Hadley Wickham’s Amazon wishlist. http://www.amazon.com/gp/registry/1Y65N3VFA613B

Do you think it is okay to pay 99 cents per iTunes song, but not pay a cent for open source software.

I dont know- but I am just a capitalist born in a country that was socialist for the first 13 years of my life. Congratulations once again to Rapid Miner for innovating and leading the way.

http://rapid-i.com/component/option,com_myblog/show,Rapid-I-Marketplace-Launched.html/Itemid,172

RapidMinerMarketplaceExtensions 30 May 2011
Rapid-I Marketplace Launched by Simon Fischer

Over the years, many of you have been developing new RapidMiner Extensions dedicated to a broad set of topics. Whereas these extensions are easy to install in RapidMiner – just download and place them in the plugins folder – the hard part is to find them in the vastness that is the Internet. Extensions made by ourselves at Rapid-I, on the other hand,  are distributed by the update server making them searchable and installable directly inside RapidMiner.

We thought that this was a bit unfair, so we decieded to open up the update server to the public, and not only this, we even gave it a new look and name. The Rapid-I Marketplace is available in beta mode at http://rapidupdate.de:8180/ . You can use the Web interface to browse, comment, and rate the extensions, and you can use the update functionality in RapidMiner by going to the preferences and entering http://rapidupdate.de:8180/UpdateServer/ as the update server URL. (Once the beta test is complete, we will change the port back to 80 so we won’t have any firewall problems.)

As an Extension developer, just register with the Marketplace and drop me an email (fischer at rapid-i dot com) so I can give you permissions to upload your own extension. Upload is simple provided you use the standard RapidMiner Extension build process and will boost visibility of your extension.

Looking forward to see many new extensions there soon!

Disclaimer- Decisionstats is a partner of Rapid Miner. I have been liking the software for a long long time, and recently agreed to partner with them just like I did with KXEN some years back, and with Predictive AnalyticsConference, and Aster Data until last year.

I still think Rapid Miner is a very very good software,and a globally created software after SAP.

Here is the actual marketplace

http://rapidupdate.de:8180/UpdateServer/faces/index.xhtml

Welcome to the Rapid-I Marketplace Public Beta Test

The Rapid-I Marketplace will soon replace the RapidMiner update server. Using this marketplace, you can share your RapidMiner extensions and make them available for download by the community of RapidMiner users. Currently, we are beta testing this server. If you want to use this server in RapidMiner, you must go to the preferences and enter http://rapidupdate.de:8180/UpdateServer for the update url. After the beta test, we will change the port back to 80, which is currently occupied by the old update server. You can test the marketplace as a user (downloading extensions) and as an Extension developer. If you want to publish your extension here, please let us know via the contact form.

Hot Downloads
«« « 1 2 3 » »»
[Icon]The Image Processing Extension provides operators for handling image data. You can extract attributes describing colour and texture in the image, you can make several transformation of a image data which allows you to perform segmentation and detection of suspicious areas in image data.The extension provides many of image transformation and extraction operators ranging from Wavelet Decomposition, Hough Circle to Block Difference of Inverse probabilities.

[Icon]RapidMiner is unquestionably the world-leading open-source system for data mining. It is available as a stand-alone application for data analysis and as a data mining engine for the integration into own products. Thousands of applications of RapidMiner in more than 40 countries give their users a competitive edge.

  • Data IntegrationAnalytical ETLData Analysis, and Reporting in one single suite
  • Powerful but intuitive graphical user interface for the design of analysis processes
  • Repositories for process, data and meta data handling
  • Only solution with meta data transformation: forget trial and error and inspect results already during design time
  • Only solution which supports on-the-fly error recognition and quick fixes
  • Complete and flexible: Hundreds of data loading, data transformation, data modeling, and data visualization methods
[Icon]All modeling methods and attribute evaluation methods from the Weka machine learning library are available within RapidMiner. After installing this extension you will get access to about 100 additional modelling schemes including additional decision trees, rule learners and regression estimators.This extension combines two of the most widely used open source data mining solutions. By installing it, you can extend RapidMiner to everything what is possible with Weka while keeping the full analysis, preprocessing, and visualization power of RapidMiner.

[Icon]Finally, the two most widely used data analysis solutions – RapidMiner and R – are connected. Arbitrary R models and scripts can now be directly integrated into the RapidMiner analysis processes. The new R perspective offers the known R console together with the great plotting facilities of R. All variables and R scripts can be organized in the RapidMiner Repository.A directly included online help and multi-line editing makes the creation of R scripts much more comfortable.

Interview- Top Data Mining Blogger on Earth , Sandro Saitta

Surajustement Modèle 2
Image via Wikipedia

If you do a Google search for Data Mining Blog- for the past several years one Blog will come on top. data mining blog – Google Search http://bit.ly/kEdPlE

To honor 5 years of Sandro Saitta’s blog (yes thats 5 years!) , we cover an exclusive interview with him where he reveals his unique sauce for cool techie blogging.

Ajay- Describe your journey as a scientist and data miner, from early experiences, to schooling to your work/research/blogging.

Sandro- My first experience with data mining was my master project. I used decision tree to predict pollen concentration for the following week using input data such as wind, temperature and rain. The fact that an algorithm can make a computer learn from experience was really amazing to me. I found it so interesting that I started a PhD in data mining. This time, the field of application was civil engineering. Civil engineers put a lot of sensors on their structure in order to understand how they behave. With all these sensors they generate a lot of data. To interpret these data, I used data mining techniques such as feature selection and clustering. I started my blog, Data Mining Research, during my PhD, to share with other researchers.

I then started applying data mining in the stock market as my first job in industry. I realized the difference between image recognition, where 99% correct classification rate is state of the art, and stock market, where you’re happy with 55%. However, the company ambiance was not as good as I thought, so I moved to consulting. There, I applied data mining in behavioral targeting to increase click-through rates. When you compare the number of customers who click with the ones who don’t, then you really understand what class imbalance mean. A few months ago, I accepted a very good opportunity at SICPA. I’m looking forward to resolving new challenges there.

Ajay- Your blog is the top ranked blog for “data mining blog”. Could you share some tips on better blogging for analytics and technical people

Sandro- It’s always difficult to start a blog, since at the beginning you have no reader. Writing for nobody may seem stupid, but it is not. By writing my first posts during my PhD I was reorganizing my ideas. I was expressing concepts which were not always clear to me. I thus learned a lot and also improved my English level. Of course, it’s still not perfect, but I hope most people can understand me.

Next come the readers. A few dozen each week first. To increase this number, I then started to learn SEO (Search Engine Optimization) by reading books and blogs. I tested many techniques that increased Data Mining Research visibility in the blogosphere. I think SEO is interesting when you already have some content published (which means not at the very beginning of your blog). After a while, once your blog is nicely ranked, the main task is to work on the content of the blog. To be of interest, your content must be particular: original, informative or provocative for example. I also had the chance to have a good visibility thanks to well-known people in the field like Kevin Hillstrom, Gregory Piatetsky-Shapiro, Will Dwinnell / Dean Abbott, Vincent Granville, Matthew Hurst and many others.

Ajay- Whats your favorite statistical software and what are the various softwares that you have worked with.
Could you compare and contrast these software as well.

Sandro- My favorite software at this point is SAS. I worked with it for two years. Once you know the language, you can perform ETL and data mining so easily. It’s also very fast compared to others. There are a lot of tools for data mining, but I cannot think of a tool that is as powerful as SAS and, in the same time, has a high-level programming language behind it.

I also worked with R and Matlab. R is very nice since you have all the up-to-date data mining algorithms implemented. However, working in the memory is not always a good choice, especially for ETL. Matlab is an excellent tool for prototyping. It’s not so fast and certainly not done for ETL, but the price is low regarding all the possibilities for data mining. According to me, SAS is the best choice for ETL and a good choice for data mining. Of course, there is the price.

Ajay- What are your favorite techniques and training resources for learning basics of data mining to say statisticians or business management graduates.

Sandro- I’m the kind of guy who likes to read books. I read data mining books one after the other. The fact that the same concepts are explained differently (and by different people) helps a lot in learning a topic like data mining. Of course, nothing replaces experience in the field. You can read hundreds of books, you will still not be a good practitioner until you really apply data mining in specific fields. My second choice after books is blogs. By reading data mining blogs, you will really see the issues and challenges in the field. It’s still not experience, but we are closer. Finally, web resources and networks such as KDnuggets of course, but also AnalyticBridge and LinkedIn.

Ajay- Describe your hobbies and how they help you ,if at all in your professional life.

Sandro- One of my hobbies is reading. I read a lot of books about data mining, SEO, Google as well as Sci-Fi and Fantasy. I’m a big fan of Asimov by the way. My other hobby is playing tennis. I think I simply use my hobbies as a way to find equilibrium in my life. I always try to find the best balance between work, family, friends and sport.

Ajay- What are your plans for your website for 2011-2012.

Sandro- I will continue to publish guest posts and interviews. I think it is important to let other people express themselves about data mining topics. I will not write about my current applications due to the policies of my current employer. But don’t worry, I still have a lot to write, whether it is technical or not. I will also emphasis more on my experience with data mining, advices for data miners, tips and tricks, and of course book reviews!

Standard Disclosure of Blogging- Sandro awarded me the Peoples Choice award for his blog for 2010 and carried out my interview. There is a lot of love between our respective wordpress blogs, but to reassure our puritan American readers- it is platonic and intellectual.

About Sandro S-



Sandro Saitta is a Data Mining Research Engineer at SICPA Security Solutions. He is also a blogger at Data Mining Research (www.dataminingblog.com). His interests include data mining, machine learning, search engine optimization and website marketing.

You can contact Mr Saitta at his Twitter address- 

https://twitter.com/#!/dataminingblog

Heritage offers 3 million chump change for Monkeys

My perspective is life is not fair, and if someone offers me 1 mill a year so they make 1 bill a year, I would still take it, especially if it leads to better human beings and better humanity on this planet. Health care isnt toothpaste.

Unless there are even more fine print changes involved- there exist several players in the pharma sector who do build and deploy models internally for denying claims or prospecting medical doctors with freebies, but they might just get caught with the new open data movement

————————————————————————————————–

A note from KDNuggets-

Heritage Health Prizereleased a second set of data on May 4. They also recently modified their ruleswhich now demand complete exclusivity and seem to disallow use of other tools (emphasis mine – Gregory PS)

21. LICENSE
By registering for the Competition, each Entrant (a) grants to Sponsor and its designees a worldwide, exclusive (except with respect to Entrant) , sub-licensable (through multiple tiers), transferable, fully paid-up, royalty-free, perpetual, irrevocable right to use, not use, reproduce, distribute (through multiple tiers), create derivative works of, publicly perform, publicly display, digitally perform, make, have made, sell, offer for sale and import the entry and the algorithm used to produce the entry, as well as any other algorithm, data or other information whatsoever developed or produced at any time using the data provided to Entrant in this Competition (collectively, the “Licensed Materials”), in any media now known or hereafter developed, for any purpose whatsoever, commercial or otherwise, without further approval by or payment to Entrant (the “License”) and
(b) represents that he/she/it has the unrestricted right to grant the License. 
Entrant understands and agrees that the License is exclusive except with respect to Entrant: Entrant may use the Licensed Materials solely for his/her/its own patient management and other internal business purposes but may not grant or otherwise transfer to any third party any rights to or interests in the Licensed Materials whatsoever.

This has lead to a call to boycott the competition by Tristan, who also notes that academics cannot publish their results without prior written approval of the Sponsor.

Anthony Goldbloom, CEO of Kaggle, emailed the HHP participants on May 4

HPN have asked me to pass on the following message: “The Heritage Provider Network is sponsoring the Heritage Health Prize to spur innovation and creative thinking in healthcare. HPN, however, is a medical group and must retain an exclusive license to the algorithms created using its data so as to ensure that the algorithms are used responsibly, and are only used to provide better health care to patients and not for improper purposes.
Put simply, while the competition hopes to spur innovation, this is not a competition regarding movie ratings or chess results. We hope that the clarifications we have made to the Rules and the FAQ adequately address your concerns and look forward to your participation in the competition.”

What do you think? Will the exclusive license prevent you from participating?