Sector/ Sphere – Faster than Hadoop/Mapreduce at Terasort

Here is a preview of a relatively young software Sector and Sphere- which are claimed to be better than Hadoop /MapReduce at TeraSort Benchmark among others.

System Overview

The Sector/Sphere stack consists of the Sector distributed file system and the Sphere parallel data processing framework. The objective is to support highly effective and efficient large data storage and processing over commodity computer clusters.

Sector/Sphere Architecture

Sector consists of 4 parts, as shown in the above diagram. The Security server maintains the system security configurations such as user accounts, data IO permissions, and IP access control lists. The master servers maintain file system metadata, schedule jobs, and respond users’ requests. Sector supports multiple active masters that can join and leave at run time and they all actively respond users’ requests. The slave nodes are racks of computers that store and process data. The slaves nodes can be located within a single data center to across multiple data centers with high speed network connections. Finally, the client includes tools and programming APIs to access and process Sector data.

Sphere: Parallel Data Processing Framework

Sphere allows developers to write parallel data processing applications with a very simple set of API. It applies user-defined functions (UDF) on all input data segments in parallel. In a Sphere application, both inputs and outputs are Sector files. Multiple Sphere processing can be combined to support more complicated applications, with inputs/outputs exchanged/shared via the Sector file system.

Data segments are processed at their storage locations whenever possible (data locality). Failed data segments may be restarted on other nodes to achieve fault tolerance.

The Sphere framework can be compared to MapReduce as they both enforce data locality and provide simplified programming interfaces. In fact, Sphere can simulate any MapReduce operations, but Sphere is more efficient and flexible. Sphere can provide better data locality for applications that process files or multiple files as minimum input units and for applications that involve with iterative/combinative processing, which requires coordination of multiple UDFs to obtain the final result.

A Sphere application includes two parts: the client program that organizes inputs (including certain parameters), outputs, and UDFs; and the UDFs that process data segments. Data segmentation, load balancing, and fault tolerance are transparent to developers.

Space: Column-based Distbuted Data Table

Space stores data tables in Sector and uses Sphere for parallel query processing. Space is similar to BigTable. Table is stored by columns and is segmented on to multiple slave nodes. Tables are independent and no relationship between tables are supported. A reduced set of SQL operations is supported, including but not limited to table creation and modification, key-value update and lookup, and select operations based on UDF.

Supported by the Sector data placement mechanism and the Sphere parallel processing framework, Space can support efficient key-value lookup and certain SQL queries on very large data tables.

Space is currently still in development.

and just when you thought Hadoop was the only way to be on the cloud.

The Terasort Benchmark

The table below lists the performance (total processing time in seconds) of the Terasort benchmark of both Sphere and Hadoop. (Terasort benchmark: suppose there are N nodes in the system, the benchmark generates a 10GB file on each node and sorts the total N*10GB data. Data generation time is excluded.) Note that it is normal to see a longer processing time for more nodes because the total amount of data also increases proportionally.

The performance value listed in this page was achieved using the Open Cloud Testbed. Currently the testbed consists of 4 racks. Each rack has 32 nodes, including 1 NFS server, 1 head node, and 30 compute/slave nodes. The head node is a Dell 1950, dual dual-core Xeon 3.0GHz, 16GB RAM. The compute nodes are Dell 1435s, single dual core AMD Opteron 2.0GHz, 4GB RAM, and 1TB single disk. The 4 racks are located in JHU (Baltimore), StarLight (Chicago), UIC (Chicago), and Calit2(San Diego). The inter-rack bandwidth is 10GE, supported by CiscoWave deployed over National Lambda Rail.

Hadoop (3 replicas)
Hadoop (1 replica)
1265 2889 2252
UIC + StarLight
1361 2896 2617
UIC + StarLight + Calit2
1430 4341 3069
UIC + StarLight + Calit2 + JHU
1526 6675 3702

The benchmark uses the testfs/testdc examples of Sphere and randomwriter/sort examples of Hadoop. Hadoop parameters were tuned to reach good results.

Updated on Sep. 22, 2009: We have benchmarked the most recent versions of Sector/Sphere (1.24a) and Hadoop (0.20.1) on a new set of servers. Each server node costs $2,200 and consits of a single Intel Xeon E5410 2.4GHz CPU, 16GB RAM, 4*1TB RAID0 disk, and 1Gb/s NIC. The 120 nodes are hosted on 4 racks within the same data center and the inter-rack bandwidth is 20Gb/s.

The table below lists the performance of sorting 1TB data using Sector/Sphere version 1.24a and Hadoop 0.20.1. Related Hadoop parameters have been tuned for better performance (e.g., big block size), while Sector/Sphere does not require tuning. In addition, to achieve the highest performance, replication is disabled in both systems (note that replication does not afftect the performance of Sphere but will significantly decrease the performance of Hadoop).

Number of Racks
28m 25s 85m 49s
15m 20s 37m 0s
10m 19s 25m 14s
7m 56s 17m 45s