Radoop 0.3 launched- Open Source Graphical Analytics meets Big Data

What is Radoop? Quite possibly an exciting mix of analytics and big data computing

http://blog.radoop.eu/?p=12

What is Radoop?

BY ZOLTÁN PREKOPCSÁK

Hadoop is an excellent tool for analyzing large data sets, but it lacks an easy-to-use graphical interface. RapidMiner is an excellent tool for data analytics, but its data size is limited by the memory available, and a single machine is often not enough to run the analyses on time. In this project, we combine the strengths of both projects and provide a RapidMiner extension for editing and running ETL, data analytics and machine learning processes over Hadoop.

We have closely integrated the highly optimized data analytics capabilities of Hive and Mahout, and the user-friendly interface of RapidMiner to form a powerful and easy-to-use data analytics solution for Hadoop.

and what’s new

http://blog.radoop.eu/?p=198

Radoop 0.3 released – fully graphical big data analytics

BY ZOLTÁN PREKOPCSÁK

Today, Radoop had a major step forward with its 0.3 release. The new version of the visual big data analytics package adds full support for all major Hadoop distributions used these days: Apache Hadoop 0.20.2, 0.20.203, 1.0 and Cloudera’s Distribution including Apache Hadoop 3 (CDH3). It also adds support for large clusters by allowing the namenode, the jobtracker and the Hive server to reside on different nodes.

As Radoop’s promise is to make big data analytics easier, the 0.3 release is also focused on improving the user interface. It has an enhanced breakpointing system which allows to investigate intermediate results, and it adds dozens of quick fixes, so common process design mistakes get much easier to solve.

There are many further improvements and fixes, so please consult the release notes for more details. Radoop is in private beta mode, but heading towards a public release in Q2 2012. If you would like to get early access, then please apply at the signup page or describe your use case in email (beta at radoop.eu).

Radoop 0.3 (15 February 2012)

Support for Apache Hadoop 0.20.2, 0.20.203, 1.0 and Cloudera’s Distribution Including Apache Hadoop 3 (CDH3) in a single release
Support for clusters with separate master nodes (namenode, jobtracker, Hive server)
Enhanced breakpointing to evaluate intermediate results
Dozens of quick fixes for the most common process design errors
Improved process design and error reporting
New welcome perspective to help in the first steps
Many bugfixes and performance improvements

Radoop 0.2.2 (6 December 2011)

More Aggregate functions and distinct option
Generate ID operator for convenience
Numerous bug fixes and improvements
Improved user interface

Radoop 0.2.1 (16 September 2011)

Set Role and Data Multiplier operators
Management panel for testing Hadoop connections
Stability improvements for Hive access
Further small bugfixes and improvements

Radoop 0.2 (26 July 2011)

Three new algoritms: Fuzzy K-Means, Canopy, and Dirichlet clustering
Three new data preprocessing operators: Normalize, Replace, and Replace Missing Values
Significant speed improvements in data transmission and interactive analytics
Increased stability and speedup for K-Means
More flexible settings for Join operations
More meaningful error messages
Other small bugfixes and improvements

Radoop 0.1 (14 June 2011)

Initial release with 26 operators for data transmission, data preprocessing, and one clustering algorithm.

Note that Rapid Miner also has a great R extension so you can use R, a graphical interface and big data analytics is now easier and more powerful than ever.

Introducing Radoop

Thats Right- This is Radoop and it is

Hadoop meats Rapid Miner=Radoop

http://prezi.com/bin/preziloader.swf

http://prezi.com/dxx7m50le5hr/radoop-presentation-at-rcomm-2011/

Radoop presentation at RCOMM 2011 on Prezi

What about Hive and Mahout?

Hive is a data warehouse infrastructure built on top of Hadoop, i.e. it uses the distributed file system of Hadoop and the efficient access technologies. Hive was initially developed by Facebook and is now used and developed by many other companies for their distributed data warehouse.

Mahout is a machine learning library already offering many scalable machine learning libraries implemented as well on top of Hadoop and its map & reduce paradigm. Hence, Mahout is one of the first distributed data analytics framework making use of the power of Hadoop.

You will see below that both frameworks will be tightly integrated with RapidMiner.

What can RapidMiner bring into the game?

Hadoop is great for large scale analytics, but it lacks an easy-to-use graphical interface. RapidMiner is an excellent tool for data analytics, but unless the analyst is not performing some nasty tricks, the data size is limited by the memory available. So we have the algorithms, the support for analytical process design, the user interface, and of course the community with a demand for large-scale analytics.

RapidMiner + Hadoop = Radoop

Radoop combines the strengths of RapidMiner and Hadoop. The result is a RapidMiner extension for editing and running ETL, data analytics and machine learning processes over Hadoop. The developers have closely integrated the highly optimized data analytics capabilities of Hive and Mahout, and the user-friendly interface of RapidMiner to form a powerful and easy-to-use data analytics solution for Hadoop.

Source-https://rapid-i.com/component/option,com_myblog/show,Big-data-analytics-made-easy-Radoop.html/Itemid,172/

and http://blog.radoop.eu/

Interesting? Sign up here- http://radoop.eu/z1sxe

Who searches for this Blog?

Using WP- Stats I set about answering this question-

What search keywords lead here-

Clearly Michael Jackson is down this year

And R GUI, Data Mining is up.

How does that affect my writing- given I get almost 250 visitors by search engines alone daily- assume I write nothing on this blog from now on.

It doesnt- I still write what ever code or poem that comes to my mind. So it is hurtful people misunderstimate the effort in writing and jump to conclusions (esp if I write about a company- I am not on payroll of that company- just like if I write about a poem- I am not a full time poet)

Over to xkcd

All Time (for Decisionstats.Wordpress.com)

Search	Views
libre office	818
facebook analytics	806
michael jackson history	240
wps sas lawsuit	180
r gui	168
wps sas	154
wordle.net	118
sas wps	116
decision stats	110
sas wps lawsuit	100
google maps jet ski	94
data mining	88
doug savage	72
hive tutorial	63
spss certification	63
hadley wickham	63
google maps jetski	62
sas sues wps	60
decisionstats	58
donald farmer microsoft	45
libreoffice	44
wps statistics	44
best statistics software	42
r gui ubuntu	41
rstat	37
tamilnadu advanced technical training institute tatti	37

YTD

2009-11-24 to Today

Search	Views
libre office	818
facebook analytics	781
wps sas lawsuit	170
r gui	164
wps sas	125
wordle.net	118
sas wps	101
sas wps lawsuit	95
google maps jet ski	94
data mining	86
decision stats	82
doug savage	63
hadley wickham	63
google maps jetski	62
hive tutorial	56
donald farmer microsoft	45

Ways of Optimizing Blog in Search Engines (cash-bandit.com)
SearchCap: The Day In Search, November 23, 2010 (searchengineland.com)
Do You Want Increased Income? Chango Could Be The Answer (wassupblog.com)
Domain.com Announces Industry’s First Natively Integrated Browser Domain Search (prweb.com)
Why Keyword Research Matters and Link Building Doesn’t (danlew.com)
Six rules for producing optimised web content (econsultancy.com)
Consumer Watch Dog Group Files Complaint with the FTC Regarding Data Mining, Profiling Algorithms – Privacy With Health Information At Risk With Insurer and Employer Usage (ducknetweb.blogspot.com)
Find the Question to your Yahoo Answers! (seomoz.org)

Hive Tutorial: Cloud Computing

Here is a nice video from Cloudera on a HIVE tutorial. I wonder what would happen if they put a real analytical system and not just basic analytics and reporting … like R or SPSS or JMP or SAS on big database system like Hadoop (including some text mined data from legacy company documents)

Unlike Oracle or other data base systems, Hadoop is free now and in reasonable future (like MySQL used to be before acquired by big fish Sun acquired by bigger Oracle).

Citation-

http://wiki.apache.org/hadoop/Hive

Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files

Hive is based on Hadoop which is a batch processing system. Accordingly, this system does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real time queries. As a result it should not be compared with systems like Oracle where analysis is done on a significantly smaller amount of data but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes. For Hive queries response times for even the smallest jobs can be of the order of 5-10 minutes and for larger jobs this may even run into hours.

If your input data is small you can execute a query in a short time. For example, if a table has 100 rows you can ‘set mapred.reduce.tasks=1’ and ‘set mapred.map.tasks=1’ and the query time will be ~15 seconds.

What is Radoop?

Radoop 0.3 released – fully graphical big data analytics

Radoop 0.3 (15 February 2012)

Radoop 0.2.2 (6 December 2011)

Radoop 0.2.1 (16 September 2011)

Radoop 0.2 (26 July 2011)

Radoop 0.1 (14 June 2011)

Please share:

Interesting? Sign up here- http://radoop.eu/z1sxe

Please share:

All Time (for Decisionstats.Wordpress.com)

2009-11-24 to Today

Related Articles

Please share:

Please share: