Tag: easy
AsterData releases nCluster 4.6
From the press release
Aster Data nCluster 4.6, which includes a column data store, making Aster Data nCluster 4.6 the first platform with a unified SQL-MapReduce analytic framework on a hybrid row and column massively parallel processing (MPP) database management system (DBMS). The unified SQL-MapReduce analytic framework and Aster Data’s suite of 1000+ MapReduce-ready analytic functions, delivers a substantial breakthrough in richer, high performance analytics on large data volumes where data can be stored in either a row or column format.
With Aster Data nCluster 4.6, customers can choose the data format best suited to their needs and benefit from the power of Aster Data’s SQL-MapReduce analytic capabilities, providing maximum query performance by leveraging row-only, column-only, or hybrid storage strategies. Aster Data makes selection of the appropriate storage strategy easy with the new Data Model Express tool that determines the optimal data model based on a customer’s query workloads. Both row and column stores in Aster Data nCluster 4.6 benefit from platform-level services including Online Precision Scaling™ on commodity hardware, dynamic workload management, and always-on availability, all of which now operate on both row and column stores. All 1000+ MapReduce-ready analytic functions released previously through Aster Data Analytic Foundation — a powerful suite of pre-built MapReduce analytic software building blocks — now run on a hybrid row and column architecture. Aster Data nCluster 4.6 also includes new pre-built analytic functions, including decision trees and histograms. For custom analytic application development, the Aster Data IDE, Aster Data Developer Express, also fully and seamlessly supports the hybrid row and column store in Aster DatanCluster 4.6.
More advanced analytics infrastructure.
Professors and Patches: For a Betterrrr R
Professors sometime throw out provocative statements to ensure intellectual debate. I have had almost 1500+ hits in less than 2 days ( and I am glad I am on wordpress.com , my old beloved server would have crashed))
The remarks from Ross Ihaka, covered before and also at Xian’s blog at
Note most of his remarks are techie- and only a single line refers to Revlution Analytics.
Other senior members of community (read- professors are silent, though brobably some thought may have been ignited behind scenes)
http://xianblog.wordpress.com/2010/09/06/insane/comment-page-4/#comments
Ross Ihaka Says:
September 12, 2010 at 1:23 pm
Since (something like) my name has been taken in vain here, let me
chip in.
I’ve been worried for some time that R isn’t going to provide the base
that we’re going to need for statistical computation in the
future. (It may well be that the future is already upon us.) There
are certainly efficiency problems (speed and memory use), but there
are more fundamental issues too. Some of these were inherited from S
and some are peculiar to R.
One of the worst problems is scoping. Consider the following little
gem.
f =
function() {
if (runif(1) > .5)
x = 10
x
}
The x being returned by this function is randomly local or global.
There are other examples where variables alternate between local and
non-local throughout the body of a function. No sensible language
would allow this. It’s ugly and it makes optimisation really
difficult. This isn’t the only problem, even weirder things happen
because of interactions between scoping and lazy evaluation.
In light of this, I’ve come to the conclusion that rather than
“fixing” R, it would be much more productive to simply start over and
build something better. I think the best you could hope for by fixing
the efficiency problems in R would be to boost performance by a small
multiple, or perhaps as much as an order of magnitude. This probably
isn’t enough to justify the effort (Luke Tierney has been working on R
compilation for over a decade now).
To try to get an idea of how much speedup is possible, a number of us
have been carrying out some experiments to see how much better we
could do with something new. Based on prototyping we’ve been doing at
Auckland, it looks like it should be straightforward to get two orders
of magnitude speedup over R, at least for those computations which are
currently bottle-necked. There are a couple of ways to make this
happen.
First, scalar computations in R are very slow. This in part because
the R interpreter is very slow, but also because there are a no scalar
types. By introducing scalars and using compilation it looks like its
possible to get a speedup by a factor of several hundred for scalar
computations. This is important because it means that many ghastly
uses of array operations and the apply functions could be replaced by
simple loops. The cost of these improvements is that scope
declarations become mandatory and (optional) type declarations are
necessary to help the compiler.
As a side-effect of compilation and the use of type-hinting it should
be possible to eliminate dispatch overhead for certain (sealed)
classes (scalars and arrays in particular). This won’t bring huge
benefits across the board, but it will mean that you won’t have to do
foreign language calls to get efficiency.
A second big problem is that computations on aggregates (data frames
in particular) run at glacial rates. This is entirely down to
unnecessary copying because of the call-by-value semantics.
Preserving call-by-value semantics while eliminating the extra copying
is hard. The best we can probably do is to take a conservative
approach. R already tries to avoid copying where it can, but fails in
an epic fashion. The alternative is to abandon call-by-value and move
to reference semantics. Again, prototyping indicates that several
hundredfold speedup is possible (for data frames in particular).
The changes in semantics mentioned above mean that the new language
will not be R. However, it won’t be all that far from R and it should
be easy to port R code to the new system, perhaps using some form of
automatic translation.
If we’re smart about building the new system, it should be possible to
make use of multi-cores and parallelism. Adding this to the mix might just
make it possible to get a three order-of-magnitude performance boost
with just a fraction of the memory that R uses. I think it’s something
really worth putting some effort into.
I also think one other change is necessary. The license will need to a
better job of protecting work donated to the commons than GPL2 seems
to have done. I’m not willing to have any more of my work purloined by
the likes of Revolution Analytics, so I’ll be looking for better
protection from the license (and being a lot more careful about who I
work with).
The discussion spilled over to Stack Overflow as well
n the past week I’ve been following a discussion where Ross Ihaka wrote (here ):
I’ve been worried for some time that R isn’t going to provide the base that we’re going to need for statistical computation in the future. (It may well be that the future is already upon us.) There are certainly efficiency problems (speed and memory use), but there are more fundamental issues too. Some of these were inherited from S and some are peculiar to R.
He then continued explaining. This discussion started from this post, and was then followed by commentshere, here, here, here, here, here and maybe some more places I don’t know of.
We all know the problem now.
R can be improved substantially in terms of speed.
For some solutions, here are the patches by Radford-
http://www.cs.toronto.edu/~radford/speed-patches-doc
patch-dollar
Speeds up access to lists, pairlists, and environments using the
$ operator. The speedup comes mainly from avoiding the overhead of
calling DispatchOrEval if there are no complexities, from passing
on the field to extract as a symbol, or a name, or both, as available,
and then converting only as necessary, from simplifying and inlining
the pstrmatch procedure, and from not translating string multiple
times.
Relevant timing test script: test-dollar.r
This test shows about a 40% decrease in the time needed to extract
elements of lists and environments.
Changes unrelated to speed improvement:
A small error-reporting bug is fixed, illustrated by the following
output with r52822:
> options(warnPartialMatchDollar=TRUE)
> pl <- pairlist(abc=1,def=2)
> pl$ab
[1] 1
Warning message:
In pl$ab : partial match of 'ab' to ''
Some code is changed at the end of R_subset3_dflt because it seems
to be more correct, as discussed in code comments.
patch-evalList
Speeds up a large number of operations by avoiding allocation of
an extra CONS cell in the procedures for evaluating argument lists.
Relevant timing test scripts: all of them, but will look at test-em.r
On test-em.r, the speedup from this patch is about 5%.
patch-fast-base
Speeds up lookup of symbols defined in the base environment, by
flagging symbols that have a base environment definition recorded
in the global cache. This allows the definition to be retrieved
quickly without looking in the hash table.
Relevant timing test scripts: all of them, but will look at test-em.r
On test-em.r, the speedup from this patch is about 3%.
Issue: This patch uses the "spare" bit for the flag. This bit is
misnamed, since it is already used elsewhere (for closures). It is
possible that one of the "gp" bits should be used instead. The
"gp" bits should really be divided up for faster access, and so that
their present use is apparent in the code.
In case this use of the "spare" bit proves unwise, the patch code is
conditional on FAST_BASE_CACHE_LOOKUP being defined at the start of
envir.r.
patch-fast-spec
Speeds up lookup of function symbols that begin with a character
other than a letter or ".", by allowing fast bypass of non-global
environments that do not contain (and have never contained) symbols
of this sort. Since it is expected that only functions will be
given names of this sort, the check is done only in findFun, though
it could also be done in findVar.
Relevant timing test scripts: all of them, but will look at test-em.r
On test-em.r, the speedup from this patch is about 8%.
Issue: This patch uses the "spare" bit to flag environments known
to not have symbols starting with a special character. See remarks
on patch-fast-base.
In case this use of the "spare" bit proves unwise, the patch code is
conditional on FAST_SPEC_BYPASS being defined at the start of envir.r.
patch-for
Speeds up for loops by not allocating new space for the loop
variable every iteration, unless necessary.
Relevant timing test script: test-for.r
This test shows a speedup of about 5%.
Change unrelated to speed improvement:
Fixes what I consider to be a bug, in which the loop clobbers a
global variable, as demonstrated by the following output with r52822:
> i <- 99
> f <- function () for (i in 1:3) { print(i); if (i==2) rm(i); }
> f()
[1] 1
[1] 2
[1] 3
> print(i)
[1] 3
patch-matprod
Speeds up matrix products, including vector dot products. The
speed issue here is that the R code checks for any NAs, and
does the multiply in the matprod procedure (in array.c) if so,
since BLAS isn't trusted with NAs. If this check takes about
as long as just doing the multiply in matprod, calling a BLAS
routine makes no sense.
Relevant time test script: test-matprod.r
With no external BLAS, this patch speeds up long vector-vector
products by a factor of about six, matrix-vector products by a
factor of about three, and some matrix-matrix products by a
factor of about two.
Issue: The matrix multiply code in matprod using an LDOUBLE
(long double) variable to accumulate sums, for improved accuracy.
On a SPARC system I tested on, operations on long doubles are
vastly slower than on doubles, so that the patch produces a
large slowdown rather than an improvement. This is also an issue
for the "sum" function, which also uses an LDOUBLE to accumulate
the sum. Perhaps an ordinarly double should be used in these
places, or perhaps the configuration script should define LDOUBLE
as double on architectures where long doubles are extraordinarily
slow.
Due to this issue, not defining MATPROD_CAN_BE_DONE_HERE at the
start of array.c will disable this patch.
patch-parens
Speeds up parentheses by making "(" a special operator whose
argument is not evaluated, thereby bypassing the overhead of
evalList. Also slightly speeds up curly brackets by inlining
a function that is stylistically better inline anyway.
Relevant test script: test-parens.r
In the parens part of test-parens.r, the speedup is about 9%.
patch-protect
Speeds up numerous operations by making PROTECT, UNPROTECT, etc.
be mostly macros in the files in src/main. This takes effect
only for files that include Defn.h after defining the symbol
USE_FAST_PROTECT_MACROS. With these macros, code of the form
v = PROTECT(...) must be replaced by PROTECT(v = ...).
Relevant timing test scripts: all of them, but will look at test-em.r
On test-em.r, the speedup from this patch is about 9%.
patch-save-alloc
Speeds up some binary and unary arithmetic operations by, when
possible, using the space holding one of the operands to hold
the result, rather than allocating new space. Though primarily
a speed improvement, for very long vectors avoiding this allocation
could avoid running out of space.
Relevant test script: test-complex-expr.r
On this test, the speedup is about 5% for scalar operands and about
8% for vector operands.
Issues: There are some tricky issues with attributes, but I think
I got them right. This patch relies on NAMED being set correctly
in the rest of the code. In case it isn't, the patch can be disabled
by not defining AVOID_ALLOC_IF_POSSIBLE at the top of arithmetic.c.
patch-square
Speeds up a^2 when a is a long vector by not checking for the
special case of an exponent of 2 over and over again for every
vector element.
Relevant test script: test-square.r
The time for squaring a long vector is reduced in this test by a
factor of more than five.
patch-sum-prod
Speeds up the "sum" and "prod" functions by not checking for NA
when na.rm=FALSE, and other detailed code improvements.
Relevant test script: test-sum-prod.r
For sum, the improvement is about a factor of 2.5 when na.rm=FALSE,
and about 10% when na.rm=TRUE.
Issue: See the discussion of patch-matprod regarding LDOUBLE.
There is no change regarding this issue due to this patch, however.
patch-transpose
Speeds up the transpose operation (the "t" function) from detailed
code improvements.
Relevant test script: test-transpose.r
The improvement for 200x60 matrices is about a factor of two.
There is little or no improvement for long row or column vectors.
patch-vec-arith
Speeds up arithmetic on vectors of the same length, or when on
vector is of length one. This is done with detailed code improvements.
Relevant test script: test-vec-arith.r
On long vectors, the +, -, and * operators are sped up by about
20% when operands are the same length or one operand is of length one.
Rather mysteriously, when the operands are not length one or the
same length, there is about a 20% increase in time required, though
this may be due to some strange C optimizer peculiarity or some
strange cache effect, since the C code for this is the same as before,
with negligible additional overhead getting to it. Regardless, this
case is much less common than equal lengths or length one.
There is little change for the / operator, which is much slower than
+, -, or *.
patch-vec-subset
Speeds up extraction of subsets of vectors or matrices (eg, v[10:20]
or M[1:10,101:110]). This is done with detailed code improvements.
Relevant test script: test-vec-subset.r
There are lots of tests in this script. The most dramatic improvement
is for extracting many rows and columns of a large array, where the
improvement is by about a factor of four. Extracting many rows from
one column of a matrix is sped up by about 30%.
Changes unrelated to speed improvement:
Fixes two latent bugs where the code incorrectly refers to NA_LOGICAL
when NA_INTEGER is appropriate and where LOGICAL and INTEGER types
are treated as interchangeable. These cause no problems at the moment,
but would if representations were changed.
patch-subscript
(Formerly part of patch-vec-subset) This patch also speeds up
extraction, and also replacement, of subsets of vectors or
matrices, but focuses on the creation of the indexes rather than
the copy operations. Often avoids a duplication (see below) and
eliminates a second scan of the subscript vector for zero
subscripts, folding it into a previous scan at no additional cost.
Relevant test script: test-vec-subset.r
Speeds up some operations with scalar or short vector indexes by
about 10%. Speeds up subscripting with a longer vector of
positive indexes by about 20%.
Issues: The current code duplicates a vector of indexes when it
seems unnecessary. Duplication is for two reasons: to handle
the situation where the index vector is itself being modified in
a replace operation, and so that any attributes can be removed, which
is helpful only for string subscripts, given how the routine to handle
them returns information via an attribute. Duplication for the
second reasons can easily be avoided, so I avoided it. The first
reason for duplication is sometimes valid, but can usually be avoided
by first only doing it if the subscript is to be used for replacement
rather than extraction, and second only doing it if the NAMED field
for the subscript isn't zero.
I also removed two layers of procedure call overhead (passing seven
arguments, so not trivial) that seemed to be doing nothing. Probably
it used to do something, but no longer does, but if instead it is
preparation for some future use, then removing it might be a mistake.
Software problems are best solved by writing code or patches in my opinion rather than discussing endlessly
Some other solutions to a BETTERRRR R
1) Complete Code Design Review
2) Version 3 - Tuneup
3) Better Documentation
4) Suing Revolution Analytics for the code - Hand over da code pardner
KXEN Update
Update from a very good data mining software company, KXEN –
- Longtime Chairman and founder Roger Haddad is retiring but would be a Board Member. See his interview with Decisionstats here https://decisionstats.wordpress.com/2009/01/05/interview-roger-haddad-founder-of-kxen-automated-modeling-software/ (note images were hidden due to migration from .com to .wordpress.com )
- New Members of Leadership are as-
|
I hope John atleast helps build a KXEN Force.com application- there are only 2 data mining apps there on App Exchange. Also on the wish list more social media presence, a Web SaaS/Amazon API for KXEN, greater presence in American/Asian conferences, and a solution for SME’s (which cannot afford the premium pricing of the flagship solution. An alliance with bigger BI vendors like Oracle, SAP or IBM for selling the great social network analysis.
Bill Russell as Non Executive Chairman-
Bill Russell as Non-executive Chairman of the Board, effective July 16 2010. Russell has 30 years of operational experience in enterprise software, with a special focus on business intelligence, analytics, and databases.Russell held a number of senior-level positions in his more than 20 years at Hewlett-Packard, including Vice President and General Manager of the multi-billion dollar Enterprise Systems Group. He has served as Non-executive Chairman of the Board for Sylantro Systems Corporation, webMethods Inc., and Network Physics, Inc. and has served as a board director for Cognos Inc. In addition to KXEN, Russell currently serves on the boards of Saba, PROS Holdings Inc., Global 360, ParAccel Inc., and B.T. Mancini Company.
Xavier Haffreingue as senior vice president, worldwide professional services and solutions.
He has almost 20 years of international enterprise software experience gained in the CRM, BI, Web and database sectors. Haffreingue joins KXEN from software provider Axway where he was VP global support operations. Prior to Axway, he held various leadership roles in the software industry, including VP self service solutions at Comverse Technologies and VP professional services and support at Netonomy, where he successfully delivered multi-million dollar projects across Europe, Asia-Pacific and Africa. Before that he was with Business Objects and Sybase, where he ran support and services in southern Europe managing over 2,500 customers in more than 20 countries.
David Guercio as senior vice president, Americas field operations. Guercio brings to the role more than 25 years experience of building and managing high-achieving sales teams in the data mining, business intelligence and CRM markets. Guercio comes to KXEN from product lifecycle management vendor Centric Software, where he was EVP sales and client services. Prior to Centric, he was SVP worldwide sales and client services at Inxight Software, where he was also Chairman and CEO of the company’s Federal Systems Group, a subsidiary of Inxight that saw success in the US Federal Government intelligence market. The success in sales growth and penetration into the federal government led to the acquisition of Inxight by Business Objects in 2007, where Guercio then led the Inxight sales organization until Business Objects was acquired by SAP. Guercio was also a key member of the management team and a co-founder at Neovista, an early pioneer in data mining and predictive analytics. Additionally, he held the positions of director of sales and VP of professional services at Metaphor Computer Systems, one of the first data extraction solutions companies, which was acquired by IBM. During his career, Guercio also held executive positions at Resonate and SiGen.
3) Venture Capital funding to fund expansion-
It has closed $8 million in series D funding to further accelerate its growth and international expansion. The round was led by NextStage and included participation from existing investors XAnge Capital, Sofinnova Ventures, Saints Capital and Motorola Ventures.
This was done after John Ball had joined as CEO.
4) Continued kudos from analysts and customers for it’s technical excellence.
KXEN was named a leader in predictive analytics and data mining by Forrester Research (1) and was rated highest for commercial deployments of social network analytics by Frost & Sullivan (2)
Also it became an alliance partner of Accenture- which is also a prominent SAS partner as well.
In Database Optimization-
In KXEN V5.1, a new data manipulation module (ADM) is provided in conjunction with scoring to optimize database workloads and provide full in-database model deployment. Some leading data mining vendors are only now beginning to offer this kind of functionality, and then with only one or two selected databases, giving KXEN a more than five-year head start. Some other vendors are only offering generic SQL generation, not optimized for each database, and do not provide the wealth of possible outputs for their scoring equations: For example, real operational applications require not only to generate scores, but decision probabilities, error bars, individual input contributions – used to derive reasons of decision and more, which are available in KXEN in-database scoring modules.
Since 2005, KXEN has leveraged databases as the data manipulation engine for analytical dataset generation. In 2008, the ADM (Analytical Data Management) module delivered a major enhancement by providing a very easy to use data manipulation environment with unmatched productivity and efficiency. ADM works as a generator of optimized database-specific SQL code and comes with an integrated layer for the management of meta-data for analytics.
KXEN Modeling Factory- (similar to SAS’s recent product Rapid Predictive Modeler http://www.sas.com/resources/product-brief/rapid-predictive-modeler-brief.pdf and http://jtonedm.com/2010/09/02/first-look-rapid-predictive-modeler/)
KXEN Modeling Factory (KMF) has been designed to automate the development and maintenance of predictive analytics-intensive systems, especially systems that include large numbers of models, vast amounts of data or require frequent model refreshes. Information about each project and model is monitored and disseminated to ensure complete management and oversight and to facilitate continual improvement in business performance.
![]() |
Main Functions
Schedule : creation of the Analytic Data Set (ADS), setup of how and when to score, setup of when and how to perform model retraining and refreshes …Report: Monitormodel execution over time, Track changes in model quality over time, see how useful one variable is by considering its multiple instance in models … Notification: Rather than having to wade through pages of event logs, KMF Department allows users to manage by exception through notifications. |
Other products from KXEN have been covered here before https://decisionstats.wordpress.com/tag/kxen/ , including Structural Risk Minimization- https://decisionstats.wordpress.com/2009/04/27/kxen-automated-regression-modeling/
Thats all for the KXEN update- all the best to the new management team and a splendid job done by Roger Haddad in creating what is France and Europe’s best known data mining company.
Note- Source – http://www.kxen.com
Open Source Business Intelligence: Pentaho and Jaspersoft
Here are two products that are used widely for Business Intelligence_ They are open source and both have free preview.
Jaspersoft-For the Enterprise version click on the screenshot while for the free community version you can go to
http://jasperforge.org/projects/jasperserver
Interestingly (and not surprisingly) Revolution Analytics is teaming up with Jaspersoft to use R for reporting along with the Jaspersoft BI stack.
ADVANCED ANALYTICS ON DEMAND IN APPLICATIONS, IN DASHBOARDS, AND ON THE WEB
FREE WEBINAR WEDNESDAY, SEPTEMBER 22ND @9AM PACIFIC
DEPLOYING R: ADVANCED ANALYTICS ON DEMAND IN APPLICATIONS, IN DASHBOARDS, AND ON THE WEB
A JOINT WEBINAR FROM REVOLUTION ANALYTICS AND JASPERSOFT
Date: Wednesday, September 22, 2010 Time: 9:00am PDT (12:00pm EDT; 4:00pm GMT) Presenters: David Smith, Vice President of Marketing, Revolution Analytics
Andrew Lampitt, Senior Director of Technology Alliances, Jaspersoft
Matthew Dahlman, Business Development Engineer, JaspersoftRegistration: Click here to register now! R is a popular and powerful system for creating custom data analysis, statistical models, and data visualizations. But how can you make the results of these R-based computations easily accessible to others? A PhD statistician could use R directly to run the forecasting model on the latest sales data, and email a report on request, but then the process is just going to have to be repeated again next month, even if the model hasn’t changed. Wouldn’t it be better to empower the Sales manager to run the model on demand from within the BI application she already uses—daily, even!—and free up the statistician to build newer, better models for others?
In this webinar, David Smith (VP of Marketing, Revolution Analytics) will introduce the new “RevoDeployR” Web Services framework for Revolution R Enterprise, which is designed to make it easy to integrate dynamic R-based computations into applications for business users. RevoDeployR empowers data analysts working in R to publish R scripts to a server-based installation of Revolution R Enterprise. Application developers can then use the RevoDeployR Web Services API to securely and scalably integrate the results of these scripts into any application, without needing to learn the R language. With RevoDeployR, authorized users of hosted or cloud-based interactive Web applications, desktop applications such as Microsoft Excel, and BI applications like Jaspersoft can all benefit from on-demand analytics and visualizations developed by expert R users.
To demonstrate the power of deploying R-based computations to business users, Andrew Lampitt will introduce Jaspersoft commercial open source business intelligence, the world’s most widely used BI software. In a live demonstration, Matt Dahlman will show how to supercharge the BI process by combining Jaspersoft and Revolution R Enterprise, giving business users on-demand access to advanced forecasts and visualizations developed by expert analysts.
Click here to register for the webinar.
Speaker Biographies:
David Smith is the Vice President of Marketing at Revolution Analytics, the leading commercial provider of software and support for the open source “R” statistical computing language. David is the co-author (with Bill Venables) of the official R manual An Introduction to R. He is also the editor of Revolutions (http://blog.revolutionanalytics.com), the leading blog focused on “R” language, and one of the originating developers of ESS: Emacs Speaks Statistics. You can follow David on Twitter as @revodavid.
Andrew Lampitt is Senior Director of Technology Alliances at Jaspersoft. Andrew is responsible for strategic initiatives and partnerships including cloud business intelligence, advanced analytics, and analytic databases. Prior to Jaspersoft, Andrew held other business positions with Sunopsis (Oracle), Business Objects (SAP), and Sybase (SAP). Andrew earned a BS in engineering from the University of Illinois at Urbana Champaign.
Matthew Dahlman is Jaspersoft’s Business Development Engineer, responsible for technical aspects of technology alliances and regional business development. Matt has held a wide range of technical positions including quality assurance, pre-sales, and technical evangelism with enterprise software companies including Sybase, Netonomy (Comverse), and Sunopsis (Oracle). Matt earned a BA in mathematics from Carleton College in Northfield, Minnesota.
The second widely used BI stack in open source is Pentaho.
You can download it here to evaluate it or click on screenshot to read more at
http://sourceforge.net/projects/pentaho/files/Business%20Intelligence%20Server/
Big Data Management and Advanced Analytics
Here is a new list for the top 10 considerations for Big Data Management and using Advanced Analytics -courtesy AsterData.
Source-
http://www.asterdata.com/wp_10_considerations/index.php?ref=decisionstats
| “There are ten strong reasons why competitive organizations are turning to new data management solutions to handle their growing data volumes and evolving analytic needs. This new platform – a ‘data-analytics server’ – merges data storage and data analytics into one single system to conquer the big data challenge.
“Big data storage is handled by a massively parallel database architecture; big data analytics is handled by an integrated analytics engine, so that analytics run fully in-database yielding ultra high performance on large data sets. The analytics engine leverages the powerful analytics framework MapReduce. The results are cost-effective, scalable data storage, ultra high performance and richer data analysis.” |
|||||||||||
| Major considerations include: | |||||||||||
|
|||||||||||
Rapid Miner- R Extension
Here is a new video which shows exactly how you can use Rapid Miner and R together. Advantages of using both together is using Rapid Miner’s GUI (including the flowchart style for data mning) and adding R statistical functionality to it.
From http://rapid-i.com/content/view/219/1/
The web site features a video showing how easy R models and scripts can be integrated into the RapidMiner analysis processes. RapidMiner offers a new R perspective consisting of the known R console together with the great plotting facilities of R. All variables as well as R scripts can be stored in the RapidMiner Repository and used from there which helps to organize the usually large number of scripts. Furthermore, widely used modeling methods are directly integrated as RapidMiner operators as usual.
“This is a huge step for open source data analysis. RapidMiner offers a great user interface, a clear process structure and lots of ETL and analysis capabilities necessary for real-world problems. R adds a lot of flexibility and many analysis and data manipulation methods. The result is the by far most powerful data transformation and analysis solution worldwide. And this analysis power is now combined with the ease-of-use already known from RapidMiner.” states Dr. Ingo Mierswa, CEO of Rapid-I.
Visit the RCOMM 2010 and learn more about how to integrate analysis and preprocessing methods offered by R as well as how to use the new R perspective offering a full R console and access to all R plotters.
Thus Rapid Miner is one more mainstream software (after SPSS, SAS etc) to add R functionality to it.

John Ball


