Hadoop is an excellent tool for analyzing large data sets, but it lacks an easy-to-use graphical interface. RapidMiner is an excellent tool for data analytics, but its data size is limited by the memory available, and a single machine is often not enough to run the analyses on time. In this project, we combine the strengths of both projects and provide a RapidMiner extension for editing and running ETL, data analytics and machine learning processes over Hadoop.
We have closely integrated the highly optimized data analytics capabilities of Hive and Mahout, and the user-friendly interface of RapidMiner to form a powerful and easy-to-use data analytics solution for Hadoop.
Today, Radoop had a major step forward with its 0.3 release. The new version of the visual big data analytics package adds full support for all major Hadoop distributions used these days: Apache Hadoop 0.20.2, 0.20.203, 1.0 and Cloudera’s Distribution including Apache Hadoop 3 (CDH3). It also adds support for large clusters by allowing the namenode, the jobtracker and the Hive server to reside on different nodes.
As Radoop’s promise is to make big data analytics easier, the 0.3 release is also focused on improving the user interface. It has an enhanced breakpointing system which allows to investigate intermediate results, and it adds dozens of quick fixes, so common process design mistakes get much easier to solve.
There are many further improvements and fixes, so please consult the release notes for more details. Radoop is in private beta mode, but heading towards a public release in Q2 2012. If you would like to get early access, then please apply at the signup page or describe your use case in email (beta at radoop.eu).
Radoop 0.3 (15 February 2012)
Support for Apache Hadoop 0.20.2, 0.20.203, 1.0 and Cloudera’s Distribution Including Apache Hadoop 3 (CDH3) in a single release
Support for clusters with separate master nodes (namenode, jobtracker, Hive server)
Enhanced breakpointing to evaluate intermediate results
Dozens of quick fixes for the most common process design errors
Improved process design and error reporting
New welcome perspective to help in the first steps
Many bugfixes and performance improvements
Radoop 0.2.2 (6 December 2011)
More Aggregate functions and distinct option
Generate ID operator for convenience
Numerous bug fixes and improvements
Improved user interface
Radoop 0.2.1 (16 September 2011)
Set Role and Data Multiplier operators
Management panel for testing Hadoop connections
Stability improvements for Hive access
Further small bugfixes and improvements
Radoop 0.2 (26 July 2011)
Three new algoritms: Fuzzy K-Means, Canopy, and Dirichlet clustering
Three new data preprocessing operators: Normalize, Replace, and Replace Missing Values
Significant speed improvements in data transmission and interactive analytics
Increased stability and speedup for K-Means
More flexible settings for Join operations
More meaningful error messages
Other small bugfixes and improvements
Radoop 0.1 (14 June 2011)
Initial release with 26 operators for data transmission, data preprocessing, and one clustering algorithm.
Note that Rapid Miner also has a great R extension so you can use R, a graphical interface and big data analytics is now easier and more powerful than ever.
Hive is a data warehouse infrastructure built on top of Hadoop, i.e. it uses the distributed file system of Hadoop and the efficient access technologies. Hive was initially developed by Facebook and is now used and developed by many other companies for their distributed data warehouse.
Mahout is a machine learning library already offering many scalable machine learning libraries implemented as well on top of Hadoop and its map & reduce paradigm. Hence, Mahout is one of the first distributed data analytics framework making use of the power of Hadoop.
You will see below that both frameworks will be tightly integrated with RapidMiner.
What can RapidMiner bring into the game?
Hadoop is great for large scale analytics, but it lacks an easy-to-use graphical interface. RapidMiner is an excellent tool for data analytics, but unless the analyst is not performing some nasty tricks, the data size is limited by the memory available. So we have the algorithms, the support for analytical process design, the user interface, and of course the community with a demand for large-scale analytics.
RapidMiner + Hadoop = Radoop
Radoop combines the strengths of RapidMiner and Hadoop. The result is a RapidMiner extension for editing and running ETL, data analytics and machine learning processes over Hadoop. The developers have closely integrated the highly optimized data analytics capabilities of Hive and Mahout, and the user-friendly interface of RapidMiner to form a powerful and easy-to-use data analytics solution for Hadoop.
Using WP- Stats I set about answering this question-
What search keywords lead here-
Clearly Michael Jackson is down this year
And R GUI, Data Mining is up.
How does that affect my writing- given I get almost 250 visitors by search engines alone daily- assume I write nothing on this blog from now on.
It doesnt- I still write what ever code or poem that comes to my mind. So it is hurtful people misunderstimate the effort in writing and jump to conclusions (esp if I write about a company- I am not on payroll of that company- just like if I write about a poem- I am not a full time poet)
Over to xkcd
All Time (for Decisionstats.Wordpress.com)
michael jackson history
wps sas lawsuit
sas wps lawsuit
google maps jet ski
google maps jetski
sas sues wps
donald farmer microsoft
best statistics software
r gui ubuntu
tamilnadu advanced technical training institute tatti
Here is a nice video from Cloudera on a HIVE tutorial. I wonder what would happen if they put a real analytical system and not just basic analytics and reporting … like R or SPSS or JMP or SAS on big database system like Hadoop (including some text mined data from legacy company documents)
Unlike Oracle or other data base systems, Hadoop is free now and in reasonable future (like MySQL used to be before acquired by big fish Sun acquired by bigger Oracle).
Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files
Hive is based on Hadoop which is a batch processing system. Accordingly, this system does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real time queries. As a result it should not be compared with systems like Oracle where analysis is done on a significantly smaller amount of data but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes. For Hive queries response times for even the smallest jobs can be of the order of 5-10 minutes and for larger jobs this may even run into hours.
If your input data is small you can execute a query in a short time. For example, if a table has 100 rows you can ‘set mapred.reduce.tasks=1’ and ‘set mapred.map.tasks=1’ and the query time will be ~15 seconds.