Sector/ Sphere – Faster than Hadoop/Mapreduce at Terasort

Here is a preview of a relatively young software Sector and Sphere- which are claimed to be better than Hadoop /MapReduce at TeraSort Benchmark among others.

http://sector.sourceforge.net/tech.html

System Overview

The Sector/Sphere stack consists of the Sector distributed file system and the Sphere parallel data processing framework. The objective is to support highly effective and efficient large data storage and processing over commodity computer clusters.

Sector/Sphere Architecture

Sector consists of 4 parts, as shown in the above diagram. The Security server maintains the system security configurations such as user accounts, data IO permissions, and IP access control lists. The master servers maintain file system metadata, schedule jobs, and respond users’ requests. Sector supports multiple active masters that can join and leave at run time and they all actively respond users’ requests. The slave nodes are racks of computers that store and process data. The slaves nodes can be located within a single data center to across multiple data centers with high speed network connections. Finally, the client includes tools and programming APIs to access and process Sector data.

Sphere: Parallel Data Processing Framework

Sphere allows developers to write parallel data processing applications with a very simple set of API. It applies user-defined functions (UDF) on all input data segments in parallel. In a Sphere application, both inputs and outputs are Sector files. Multiple Sphere processing can be combined to support more complicated applications, with inputs/outputs exchanged/shared via the Sector file system.

Data segments are processed at their storage locations whenever possible (data locality). Failed data segments may be restarted on other nodes to achieve fault tolerance.

The Sphere framework can be compared to MapReduce as they both enforce data locality and provide simplified programming interfaces. In fact, Sphere can simulate any MapReduce operations, but Sphere is more efficient and flexible. Sphere can provide better data locality for applications that process files or multiple files as minimum input units and for applications that involve with iterative/combinative processing, which requires coordination of multiple UDFs to obtain the final result.

A Sphere application includes two parts: the client program that organizes inputs (including certain parameters), outputs, and UDFs; and the UDFs that process data segments. Data segmentation, load balancing, and fault tolerance are transparent to developers.

Space: Column-based Distbuted Data Table

Space stores data tables in Sector and uses Sphere for parallel query processing. Space is similar to BigTable. Table is stored by columns and is segmented on to multiple slave nodes. Tables are independent and no relationship between tables are supported. A reduced set of SQL operations is supported, including but not limited to table creation and modification, key-value update and lookup, and select operations based on UDF.

Supported by the Sector data placement mechanism and the Sphere parallel processing framework, Space can support efficient key-value lookup and certain SQL queries on very large data tables.

Space is currently still in development.

and just when you thought Hadoop was the only way to be on the cloud.

http://sector.sourceforge.net/benchmark.html

The Terasort Benchmark

The table below lists the performance (total processing time in seconds) of the Terasort benchmark of both Sphere and Hadoop. (Terasort benchmark: suppose there are N nodes in the system, the benchmark generates a 10GB file on each node and sorts the total N*10GB data. Data generation time is excluded.) Note that it is normal to see a longer processing time for more nodes because the total amount of data also increases proportionally.

The performance value listed in this page was achieved using the Open Cloud Testbed. Currently the testbed consists of 4 racks. Each rack has 32 nodes, including 1 NFS server, 1 head node, and 30 compute/slave nodes. The head node is a Dell 1950, dual dual-core Xeon 3.0GHz, 16GB RAM. The compute nodes are Dell 1435s, single dual core AMD Opteron 2.0GHz, 4GB RAM, and 1TB single disk. The 4 racks are located in JHU (Baltimore), StarLight (Chicago), UIC (Chicago), and Calit2(San Diego). The inter-rack bandwidth is 10GE, supported by CiscoWave deployed over National Lambda Rail.

	Sphere	Hadoop (3 replicas)	Hadoop (1 replica)
UIC	1265	2889	2252
UIC + StarLight	1361	2896	2617
UIC + StarLight + Calit2	1430	4341	3069
UIC + StarLight + Calit2 + JHU	1526	6675	3702

The benchmark uses the testfs/testdc examples of Sphere and randomwriter/sort examples of Hadoop. Hadoop parameters were tuned to reach good results.

Updated on Sep. 22, 2009: We have benchmarked the most recent versions of Sector/Sphere (1.24a) and Hadoop (0.20.1) on a new set of servers. Each server node costs $2,200 and consits of a single Intel Xeon E5410 2.4GHz CPU, 16GB RAM, 4*1TB RAID0 disk, and 1Gb/s NIC. The 120 nodes are hosted on 4 racks within the same data center and the inter-rack bandwidth is 20Gb/s.

The table below lists the performance of sorting 1TB data using Sector/Sphere version 1.24a and Hadoop 0.20.1. Related Hadoop parameters have been tuned for better performance (e.g., big block size), while Sector/Sphere does not require tuning. In addition, to achieve the highest performance, replication is disabled in both systems (note that replication does not afftect the performance of Sphere but will significantly decrease the performance of Hadoop).

Number of Racks	Sphere	Hadoop
1	28m 25s	85m 49s
2	15m 20s	37m 0s
3	10m 19s	25m 14s
4	7m 56s	17m 45s

Professors and Patches: For a Betterrrr R

Professors sometime throw out provocative statements to ensure intellectual debate. I have had almost 1500+ hits in less than 2 days ( and I am glad I am on wordpress.com , my old beloved server would have crashed))

The remarks from Ross Ihaka, covered before and also at Xian’s blog at

Note most of his remarks are techie- and only a single line refers to Revlution Analytics.

Other senior members of community (read- professors are silent, though brobably some thought may have been ignited behind scenes)

http://xianblog.wordpress.com/2010/09/06/insane/comment-page-4/#comments

Ross Ihaka Says:
September 12, 2010 at 1:23 pm

Since (something like) my name has been taken in vain here, let me
chip in.

I’ve been worried for some time that R isn’t going to provide the base
that we’re going to need for statistical computation in the
future. (It may well be that the future is already upon us.) There
are certainly efficiency problems (speed and memory use), but there
are more fundamental issues too. Some of these were inherited from S
and some are peculiar to R.

One of the worst problems is scoping. Consider the following little
gem.

f =
function() {
if (runif(1) > .5)
x = 10
x
}

The x being returned by this function is randomly local or global.
There are other examples where variables alternate between local and
non-local throughout the body of a function. No sensible language
would allow this. It’s ugly and it makes optimisation really
difficult. This isn’t the only problem, even weirder things happen
because of interactions between scoping and lazy evaluation.

In light of this, I’ve come to the conclusion that rather than
“fixing” R, it would be much more productive to simply start over and
build something better. I think the best you could hope for by fixing
the efficiency problems in R would be to boost performance by a small
multiple, or perhaps as much as an order of magnitude. This probably
isn’t enough to justify the effort (Luke Tierney has been working on R
compilation for over a decade now).

To try to get an idea of how much speedup is possible, a number of us
have been carrying out some experiments to see how much better we
could do with something new. Based on prototyping we’ve been doing at
Auckland, it looks like it should be straightforward to get two orders
of magnitude speedup over R, at least for those computations which are
currently bottle-necked. There are a couple of ways to make this
happen.

First, scalar computations in R are very slow. This in part because
the R interpreter is very slow, but also because there are a no scalar
types. By introducing scalars and using compilation it looks like its
possible to get a speedup by a factor of several hundred for scalar
computations. This is important because it means that many ghastly
uses of array operations and the apply functions could be replaced by
simple loops. The cost of these improvements is that scope
declarations become mandatory and (optional) type declarations are
necessary to help the compiler.

As a side-effect of compilation and the use of type-hinting it should
be possible to eliminate dispatch overhead for certain (sealed)
classes (scalars and arrays in particular). This won’t bring huge
benefits across the board, but it will mean that you won’t have to do
foreign language calls to get efficiency.

A second big problem is that computations on aggregates (data frames
in particular) run at glacial rates. This is entirely down to
unnecessary copying because of the call-by-value semantics.
Preserving call-by-value semantics while eliminating the extra copying
is hard. The best we can probably do is to take a conservative
approach. R already tries to avoid copying where it can, but fails in
an epic fashion. The alternative is to abandon call-by-value and move
to reference semantics. Again, prototyping indicates that several
hundredfold speedup is possible (for data frames in particular).

The changes in semantics mentioned above mean that the new language
will not be R. However, it won’t be all that far from R and it should
be easy to port R code to the new system, perhaps using some form of
automatic translation.

If we’re smart about building the new system, it should be possible to
make use of multi-cores and parallelism. Adding this to the mix might just
make it possible to get a three order-of-magnitude performance boost
with just a fraction of the memory that R uses. I think it’s something
really worth putting some effort into.

I also think one other change is necessary. The license will need to a
better job of protecting work donated to the commons than GPL2 seems
to have done. I’m not willing to have any more of my work purloined by
the likes of Revolution Analytics, so I’ll be looking for better
protection from the license (and being a lot more careful about who I
work with).

The discussion spilled over to Stack Overflow as well

http://stackoverflow.com/questions/3706990/is-r-that-bad-that-it-should-be-rewritten-from-scratch/3710667#3710667

n the past week I’ve been following a discussion where Ross Ihaka wrote (here ):

I’ve been worried for some time that R isn’t going to provide the base that we’re going to need for statistical computation in the future. (It may well be that the future is already upon us.) There are certainly efficiency problems (speed and memory use), but there are more fundamental issues too. Some of these were inherited from S and some are peculiar to R.

He then continued explaining. This discussion started from this post, and was then followed by commentshere, here, here, here, here, here and maybe some more places I don’t know of.

We all know the problem now.

R can be improved substantially in terms of speed.

For some solutions, here are the patches by Radford-

http://www.cs.toronto.edu/~radford/speed-patches-doc

patch-dollar

Speeds up access to lists, pairlists, and environments using the
$ operator. The speedup comes mainly from avoiding the overhead of
calling DispatchOrEval if there are no complexities, from passing
on the field to extract as a symbol, or a name, or both, as available,
and then converting only as necessary, from simplifying and inlining
the pstrmatch procedure, and from not translating string multiple
times.

Relevant timing test script: test-dollar.r

This test shows about a 40% decrease in the time needed to extract
elements of lists and environments.

Changes unrelated to speed improvement:

A small error-reporting bug is fixed, illustrated by the following
output with r52822:

> options(warnPartialMatchDollar=TRUE)
> pl <- pairlist(abc=1,def=2)
> pl$ab
[1] 1
Warning message:
In pl$ab : partial match of 'ab' to ''

Some code is changed at the end of R_subset3_dflt because it seems
to be more correct, as discussed in code comments.

patch-evalList

Speeds up a large number of operations by avoiding allocation of
an extra CONS cell in the procedures for evaluating argument lists.

Relevant timing test scripts: all of them, but will look at test-em.r

On test-em.r, the speedup from this patch is about 5%.

patch-fast-base

Speeds up lookup of symbols defined in the base environment, by
flagging symbols that have a base environment definition recorded
in the global cache. This allows the definition to be retrieved
quickly without looking in the hash table.

Relevant timing test scripts: all of them, but will look at test-em.r

On test-em.r, the speedup from this patch is about 3%.

Issue: This patch uses the "spare" bit for the flag. This bit is
misnamed, since it is already used elsewhere (for closures). It is
possible that one of the "gp" bits should be used instead. The
"gp" bits should really be divided up for faster access, and so that
their present use is apparent in the code.

In case this use of the "spare" bit proves unwise, the patch code is
conditional on FAST_BASE_CACHE_LOOKUP being defined at the start of
envir.r.

patch-fast-spec

Speeds up lookup of function symbols that begin with a character
other than a letter or ".", by allowing fast bypass of non-global
environments that do not contain (and have never contained) symbols
of this sort. Since it is expected that only functions will be
given names of this sort, the check is done only in findFun, though
it could also be done in findVar.

Relevant timing test scripts: all of them, but will look at test-em.r

On test-em.r, the speedup from this patch is about 8%.

Issue: This patch uses the "spare" bit to flag environments known
to not have symbols starting with a special character. See remarks
on patch-fast-base.

In case this use of the "spare" bit proves unwise, the patch code is
conditional on FAST_SPEC_BYPASS being defined at the start of envir.r.

patch-for

Speeds up for loops by not allocating new space for the loop
variable every iteration, unless necessary.

Relevant timing test script: test-for.r

This test shows a speedup of about 5%.

Change unrelated to speed improvement:

Fixes what I consider to be a bug, in which the loop clobbers a
global variable, as demonstrated by the following output with r52822:

> i <- 99
> f <- function () for (i in 1:3) { print(i); if (i==2) rm(i); }
> f()
[1] 1
[1] 2
[1] 3
> print(i)
[1] 3

patch-matprod

Speeds up matrix products, including vector dot products. The
speed issue here is that the R code checks for any NAs, and
does the multiply in the matprod procedure (in array.c) if so,
since BLAS isn't trusted with NAs. If this check takes about
as long as just doing the multiply in matprod, calling a BLAS
routine makes no sense.

Relevant time test script: test-matprod.r

With no external BLAS, this patch speeds up long vector-vector
products by a factor of about six, matrix-vector products by a
factor of about three, and some matrix-matrix products by a
factor of about two.

Issue: The matrix multiply code in matprod using an LDOUBLE
(long double) variable to accumulate sums, for improved accuracy.
On a SPARC system I tested on, operations on long doubles are
vastly slower than on doubles, so that the patch produces a
large slowdown rather than an improvement. This is also an issue
for the "sum" function, which also uses an LDOUBLE to accumulate
the sum. Perhaps an ordinarly double should be used in these
places, or perhaps the configuration script should define LDOUBLE
as double on architectures where long doubles are extraordinarily
slow.

Due to this issue, not defining MATPROD_CAN_BE_DONE_HERE at the
start of array.c will disable this patch.

patch-parens

Speeds up parentheses by making "(" a special operator whose
argument is not evaluated, thereby bypassing the overhead of
evalList. Also slightly speeds up curly brackets by inlining
a function that is stylistically better inline anyway.

Relevant test script: test-parens.r

In the parens part of test-parens.r, the speedup is about 9%.

patch-protect

Speeds up numerous operations by making PROTECT, UNPROTECT, etc.
be mostly macros in the files in src/main. This takes effect
only for files that include Defn.h after defining the symbol
USE_FAST_PROTECT_MACROS. With these macros, code of the form
v = PROTECT(...) must be replaced by PROTECT(v = ...).

Relevant timing test scripts: all of them, but will look at test-em.r

On test-em.r, the speedup from this patch is about 9%.

patch-save-alloc

Speeds up some binary and unary arithmetic operations by, when
possible, using the space holding one of the operands to hold
the result, rather than allocating new space. Though primarily
a speed improvement, for very long vectors avoiding this allocation
could avoid running out of space.

Relevant test script: test-complex-expr.r

On this test, the speedup is about 5% for scalar operands and about
8% for vector operands.

Issues: There are some tricky issues with attributes, but I think
I got them right. This patch relies on NAMED being set correctly
in the rest of the code. In case it isn't, the patch can be disabled
by not defining AVOID_ALLOC_IF_POSSIBLE at the top of arithmetic.c.

patch-square

Speeds up a^2 when a is a long vector by not checking for the
special case of an exponent of 2 over and over again for every
vector element.

Relevant test script: test-square.r

The time for squaring a long vector is reduced in this test by a
factor of more than five.

patch-sum-prod

Speeds up the "sum" and "prod" functions by not checking for NA
when na.rm=FALSE, and other detailed code improvements.

Relevant test script: test-sum-prod.r

For sum, the improvement is about a factor of 2.5 when na.rm=FALSE,
and about 10% when na.rm=TRUE.

Issue: See the discussion of patch-matprod regarding LDOUBLE.
There is no change regarding this issue due to this patch, however.

patch-transpose

Speeds up the transpose operation (the "t" function) from detailed
code improvements.

Relevant test script: test-transpose.r

The improvement for 200x60 matrices is about a factor of two.
There is little or no improvement for long row or column vectors.

patch-vec-arith

Speeds up arithmetic on vectors of the same length, or when on
vector is of length one. This is done with detailed code improvements.

Relevant test script: test-vec-arith.r

On long vectors, the +, -, and * operators are sped up by about
20% when operands are the same length or one operand is of length one.

Rather mysteriously, when the operands are not length one or the
same length, there is about a 20% increase in time required, though
this may be due to some strange C optimizer peculiarity or some
strange cache effect, since the C code for this is the same as before,
with negligible additional overhead getting to it. Regardless, this
case is much less common than equal lengths or length one.

There is little change for the / operator, which is much slower than
+, -, or *.

patch-vec-subset

Speeds up extraction of subsets of vectors or matrices (eg, v[10:20]
or M[1:10,101:110]). This is done with detailed code improvements.

Relevant test script: test-vec-subset.r

There are lots of tests in this script. The most dramatic improvement
is for extracting many rows and columns of a large array, where the
improvement is by about a factor of four. Extracting many rows from
one column of a matrix is sped up by about 30%.

Changes unrelated to speed improvement:

Fixes two latent bugs where the code incorrectly refers to NA_LOGICAL
when NA_INTEGER is appropriate and where LOGICAL and INTEGER types
are treated as interchangeable. These cause no problems at the moment,
but would if representations were changed.

patch-subscript

(Formerly part of patch-vec-subset) This patch also speeds up
extraction, and also replacement, of subsets of vectors or
matrices, but focuses on the creation of the indexes rather than
the copy operations. Often avoids a duplication (see below) and
eliminates a second scan of the subscript vector for zero
subscripts, folding it into a previous scan at no additional cost.

Relevant test script: test-vec-subset.r

Speeds up some operations with scalar or short vector indexes by
about 10%. Speeds up subscripting with a longer vector of
positive indexes by about 20%.

Issues: The current code duplicates a vector of indexes when it
seems unnecessary. Duplication is for two reasons: to handle
the situation where the index vector is itself being modified in
a replace operation, and so that any attributes can be removed, which
is helpful only for string subscripts, given how the routine to handle
them returns information via an attribute. Duplication for the
second reasons can easily be avoided, so I avoided it. The first
reason for duplication is sometimes valid, but can usually be avoided
by first only doing it if the subscript is to be used for replacement
rather than extraction, and second only doing it if the NAMED field
for the subscript isn't zero.

I also removed two layers of procedure call overhead (passing seven
arguments, so not trivial) that seemed to be doing nothing. Probably
it used to do something, but no longer does, but if instead it is
preparation for some future use, then removing it might be a mistake.

Software problems are best solved by writing code or patches in my opinion rather than discussing endlessly
Some other solutions to a BETTERRRR R
1) Complete Code Design Review
2) Version 3 - Tuneup
3) Better Documentation
4) Suing Revolution Analytics for the code - Hand over da code pardner

Dryad- Microsoft's answer to MR

While reading across the internet I came across Microsoft’s version to MapReduce called Dryad- which has been around for some time, but has not generated quite the buzz that Hadoop or MapReduce are doing.

http://research.microsoft.com/en-us/projects/dryadlinq/

DryadLINQ

DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC clusters.

Overview

New! An academic release of Dryad/DryadLINQ is now available for public download.

The goal of DryadLINQ is to make distributed computing on large compute cluster simple enough for every programmers. DryadLINQ combines two important pieces of Microsoft technology: the Dryad distributed execution engine and the .NET Language Integrated Query (LINQ).

Dryad provides reliable, distributed computing on thousands of servers for large-scale data parallel applications. LINQ enables developers to write and debug their applications in a SQL-like query language, relying on the entire .NET library and using Visual Studio.

DryadLINQ translates LINQ programs into distributed Dryad computations:

C# and LINQ data objects become distributed partitioned files.

LINQ queries become distributed Dryad jobs.

C# methods become code running on the vertices of a Dryad job.

DryadLINQ has the following features:

Declarative programming: computations are expressed in a high-level language similar to SQL

Automatic parallelization: from sequential declarative code the DryadLINQ compiler generates highly parallel query plans spanning large computer clusters. For exploiting multi-core parallelism on each machine DryadLINQ relies on the PLINQ parallelization framework.

Integration with Visual Studio: programmers in DryadLINQ take advantage of the comprehensive VS set of tools: Intellisense, code refactoring, integrated debugging, build, source code management.

Integration with .Net: all .Net libraries, including Visual Basic, and dynamic languages are available.

and

Conciseness: the following line of code is a complete implementation of the Map-Reduce computation framework in DryadLINQ:

and http://research.microsoft.com/en-us/projects/dryad/

Dryad

The Dryad Project is investigating programming models for writing parallel and distributed programs to scale from a small cluster to a large data-center.

Overview

New! An academic release of DryadLINQ is now available for public download.

Dryad is an infrastructure which allows a programmer to use the resources of a computer cluster or a data center for running data-parallel programs. A Dryad programmer can use thousands of machines, each of them with multiple processors or cores, without knowing anything about concurrent programming.

The Structure of Dryad Jobs

A Dryad programmer writes several sequential programs and connects them using one-way channels. The computation is structured as a directed graph: programs are graph vertices, while the channels are graph edges. A Dryad job is a graph generator which can synthesize any directed acyclic graph. These graphs can even change during execution, in response to important events in the computation.

Dryad is quite expressive. It completely subsumes other computation frameworks, such as Google’s map-reduce, or the relational algebra. Moreover, Dryad handles job creation and management, resource management, job monitoring and visualization, fault tolerance, re-execution, scheduling, and accounting.

The Dryad Software Stack

As a proof of Dryad’s versatility, a rich software ecosystem has been built on top Dryad:

SSIS on Dryad executes many instances of SQL server, each in a separate Dryad vertex, taking advantage of Dryad’s fault tolerance and scheduling. This system is currently deployed in a live production system as part of one of Microsoft’s AdCenter log processing pipelines.

DryadLINQ generates Dryad computations from the LINQ Language-Integrated Query extensions to C#.

The distributed shell is a generalization of the pipe concept from the Unix shell in three ways. If Unix pipes allow the construction of one-dimensional (1-D) process structures, the distributed shell allows the programmer to build 2-D structures in a scripting language. The distributed shell generalizes Unix pipes in three ways:

It allows processes to easily connect multiple file descriptors of each process — hence the 2-D aspect.

It allows the construction of pipes spanning multiple machines, across a cluster.

It virtualizes the pipelines, allowing the execution of pipelines with many more processes than available machines, by time-multiplexing processors and buffering results.

Several languages are compiled to distributed shell processes. PSQL is an early version, recently replaced with Scope.

Publications

Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly
European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007

Video of a presentation on Dryad at the Google Campus, given by Michael Isard, Nov 1, 2007.

Also interesting to read-

Why does Dryad use a DAG?

he basic computational model we decided to adopt for Dryad is the directed-acyclic graph (DAG). Each node in the graph is a computation, and each edge in the graph is a stream of data traveling in the direction of the edge. The amount of data on any given edge is assumed to be finite, the computations are assumed to be deterministic, and the inputs are assumed to be immutable. This isn’t by any means a new way of structuring a distributed computation (for example Condor had DAGMan long before Dryad came along), but it seemed like a sweet spot in the design space given our other constraints.

So, why is this a sweet spot? A DAG is very convenient because it induces an ordering on the nodes in the graph. That makes it easy to design scheduling policies, since you can define a node to be ready when its inputs are available, and at any time you can choose to schedule as many ready nodes as you like in whatever order you like, and as long as you always have at least one scheduled you will continue to make progress and never deadlock. It also makes fault-tolerance easy, since given our determinism and immutability assumptions you can backtrack as far as you want in the DAG and re-execute as many nodes as you like to regenerate intermediate data that has been lost or is unavailable due to cluster failures.

from

http://blogs.msdn.com/b/dryad/archive/2010/07/23/why-does-dryad-use-a-dag.aspx

System Overview

Sphere: Parallel Data Processing Framework

Space: Column-based Distbuted Data Table

and just when you thought Hadoop was the only way to be on the cloud.

The Terasort Benchmark

Please share:

We all know the problem now.

R can be improved substantially in terms of speed.

Please share:

Overview

Overview

The Structure of Dryad Jobs

The Dryad Software Stack

Publications

Why does Dryad use a DAG?

Please share: