Sector/ Sphere – Faster than Hadoop/Mapreduce at Terasort

Here is a preview of a relatively young software Sector and Sphere- which are claimed to be better than Hadoop /MapReduce at TeraSort Benchmark among others.

http://sector.sourceforge.net/tech.html

System Overview

The Sector/Sphere stack consists of the Sector distributed file system and the Sphere parallel data processing framework. The objective is to support highly effective and efficient large data storage and processing over commodity computer clusters.

Sector/Sphere Architecture

Sector consists of 4 parts, as shown in the above diagram. The Security server maintains the system security configurations such as user accounts, data IO permissions, and IP access control lists. The master servers maintain file system metadata, schedule jobs, and respond users’ requests. Sector supports multiple active masters that can join and leave at run time and they all actively respond users’ requests. The slave nodes are racks of computers that store and process data. The slaves nodes can be located within a single data center to across multiple data centers with high speed network connections. Finally, the client includes tools and programming APIs to access and process Sector data.

Sphere: Parallel Data Processing Framework

Sphere allows developers to write parallel data processing applications with a very simple set of API. It applies user-defined functions (UDF) on all input data segments in parallel. In a Sphere application, both inputs and outputs are Sector files. Multiple Sphere processing can be combined to support more complicated applications, with inputs/outputs exchanged/shared via the Sector file system.

Data segments are processed at their storage locations whenever possible (data locality). Failed data segments may be restarted on other nodes to achieve fault tolerance.

The Sphere framework can be compared to MapReduce as they both enforce data locality and provide simplified programming interfaces. In fact, Sphere can simulate any MapReduce operations, but Sphere is more efficient and flexible. Sphere can provide better data locality for applications that process files or multiple files as minimum input units and for applications that involve with iterative/combinative processing, which requires coordination of multiple UDFs to obtain the final result.

A Sphere application includes two parts: the client program that organizes inputs (including certain parameters), outputs, and UDFs; and the UDFs that process data segments. Data segmentation, load balancing, and fault tolerance are transparent to developers.

Space: Column-based Distbuted Data Table

Space stores data tables in Sector and uses Sphere for parallel query processing. Space is similar to BigTable. Table is stored by columns and is segmented on to multiple slave nodes. Tables are independent and no relationship between tables are supported. A reduced set of SQL operations is supported, including but not limited to table creation and modification, key-value update and lookup, and select operations based on UDF.

Supported by the Sector data placement mechanism and the Sphere parallel processing framework, Space can support efficient key-value lookup and certain SQL queries on very large data tables.

Space is currently still in development.

and just when you thought Hadoop was the only way to be on the cloud.

http://sector.sourceforge.net/benchmark.html

The Terasort Benchmark

The table below lists the performance (total processing time in seconds) of the Terasort benchmark of both Sphere and Hadoop. (Terasort benchmark: suppose there are N nodes in the system, the benchmark generates a 10GB file on each node and sorts the total N*10GB data. Data generation time is excluded.) Note that it is normal to see a longer processing time for more nodes because the total amount of data also increases proportionally.

The performance value listed in this page was achieved using the Open Cloud Testbed. Currently the testbed consists of 4 racks. Each rack has 32 nodes, including 1 NFS server, 1 head node, and 30 compute/slave nodes. The head node is a Dell 1950, dual dual-core Xeon 3.0GHz, 16GB RAM. The compute nodes are Dell 1435s, single dual core AMD Opteron 2.0GHz, 4GB RAM, and 1TB single disk. The 4 racks are located in JHU (Baltimore), StarLight (Chicago), UIC (Chicago), and Calit2(San Diego). The inter-rack bandwidth is 10GE, supported by CiscoWave deployed over National Lambda Rail.

Sphere
Hadoop (3 replicas)
Hadoop (1 replica)
UIC
1265 2889 2252
UIC + StarLight
1361 2896 2617
UIC + StarLight + Calit2
1430 4341 3069
UIC + StarLight + Calit2 + JHU
1526 6675 3702

The benchmark uses the testfs/testdc examples of Sphere and randomwriter/sort examples of Hadoop. Hadoop parameters were tuned to reach good results.

Updated on Sep. 22, 2009: We have benchmarked the most recent versions of Sector/Sphere (1.24a) and Hadoop (0.20.1) on a new set of servers. Each server node costs $2,200 and consits of a single Intel Xeon E5410 2.4GHz CPU, 16GB RAM, 4*1TB RAID0 disk, and 1Gb/s NIC. The 120 nodes are hosted on 4 racks within the same data center and the inter-rack bandwidth is 20Gb/s.

The table below lists the performance of sorting 1TB data using Sector/Sphere version 1.24a and Hadoop 0.20.1. Related Hadoop parameters have been tuned for better performance (e.g., big block size), while Sector/Sphere does not require tuning. In addition, to achieve the highest performance, replication is disabled in both systems (note that replication does not afftect the performance of Sphere but will significantly decrease the performance of Hadoop).

Number of Racks
Sphere
Hadoop
1
28m 25s 85m 49s
2
15m 20s 37m 0s
3
10m 19s 25m 14s
4
7m 56s 17m 45s

R on Windows HPC Server

From HPC Wire, the newsletter/site for all HPC news-

Source- Link

PALO ALTO, Calif., Sept. 20 — Revolution Analytics, the leading commercial provider of software and support for the popular open source R statistics language, today announced it will deliver Revolution R Enterprise for Microsoft Windows HPC Server 2008 R2, released today, enabling users to analyze very large data sets in high-performance computing environments.

R is a powerful open source statistics language and the modern system for predictive analytics. Revolution Analytics recently introduced RevoScaleR, new “Big Data” analysis capabilities, to its R distribution, Revolution R Enterprise. RevoScaleR solves the performance and capacity limitations of the R language by with parallelized algorithms that stream data across multiple cores on a laptop, workstation or server. Users can now process, visualize and model terabyte-class data sets at top speeds — without the need for specialized hardware.

“Revolution Analytics is pleased to support Microsoft’s Technical Computing initiative, whose efforts will benefit scientists, engineers and data analysts,” said David Champagne, CTO at Revolution. “We believe the engineering we have done for Revolution R Enterprise, in particular our work on big-data statistics and multicore computing, along with Microsoft’s HPC platform for technical computing, makes an ideal combination for high-performance large scale statistical computing.”

“Processing and analyzing this ‘big data’ is essential to better prediction and decision making,” said Bill Hamilton, director of technical computing at Microsoft Corp. “Revolution R Enterprise for Windows HPC Server 2008 R2 gives customers an extremely powerful tool that handles analysis of very large data and high workloads.”

To learn more about Revolution R Enterprise and its Big Data capabilities, download thewhite paper. Revolution Analytics also has an on-demand webcast, “High-performance analytics with Revolution R and Windows HPC Server,” available online.

AND from Microsoft’s website

http://www.microsoft.com/hpc/en/us/solutions/hpc-for-life-sciences.aspx

REvolution R Enterprise »

REvolution Computing

REvolution R Enterprise is designed for both novice and experienced R users looking for a production-grade R distribution to perform mission critical predictive analytics tasks right from the desktop and scale across multiprocessor environments. Featuring RPE™ REvolution’s R Productivity Environment for Windows.

Of course R Enterprise is available on Linux but on Red Hat Enterprise Linux- it would be nice to see Amazom Machine Images as well as Ubuntu versions as well.

An Amazon Machine Image (AMI) is a special type of virtual appliance which is used to instantiate (create) a virtual machine within the Amazon Elastic Compute Cloud. It serves as the basic unit of deployment for services delivered using EC2.[1]

Like all virtual appliances, the main component of an AMI is a read-only filesystem image which includes an operating system (e.g., Linux, UNIX, or Windows) and any additional software required to deliver a service or a portion of it.[2]

The AMI filesystem is compressed, encrypted, signed, split into a series of 10MB chunks and uploaded into Amazon S3 for storage. An XML manifest file stores information about the AMI, including name, version, architecture, default kernel id, decryption key and digests for all of the filesystem chunks.

An AMI does not include a kernel image, only a pointer to the default kernel id, which can be chosen from an approved list of safe kernels maintained by Amazon and its partners (e.g., RedHat, Canonical, Microsoft). Users may choose kernels other than the default when booting an AMI.[3]

[edit]Types of images

  • Public: an AMI image that can be used by any one.
  • Paid: a for-pay AMI image that is registered with Amazon DevPay and can be used by any one who subscribes for it. DevPay allows developers to mark-up Amazon’s usage fees and optionally add monthly subscription fees.

September Roundup by Revolution

From the monthly newsletter- which I consider quite useful for keeping updated on application of R

——————————————————————————————————————————————————————————————————–

Revolution News
Every month, we’ll bring you the latest news about Revolution’s products and events in this section.
Follow us on Twitter at @RevolutionR for up-to-the-minute news and updates from Revolution Analytics!

Revolution R Enterprise 4.0 for Windows now available. Based on the latest R 2.11.1 and including the RevoScaleR package for big-data analysis in R, Revolution R Enterprise is now available for download for Windows 32-bit and 64-bit systems. Click here to subscribe, or available free to academia.

New! Integrate R with web applications, BI dashboards and more with web services. RevoDeployR is a new Web Services framework that integrates dynamic R-based computations into applications for business users. It will be available September 30 with Revolution R Enterprise Server on RHEL 5. Click here to learn more.

Free Webinar, September 22: In a joint webinar from Revolution Analytics and Jaspersoft, learn how to use RevoDeployR to integrate advanced analytics on-demand in applications, BI dashboards, and on the web. Register here.

Revolution in the News:
SearchBusinessAnalytics.com previews the forthcoming Revolution R GUI; Channel Register introduces RevoDeployR, while IT Business Edge shows off the Web Services architecture; and ReadWriteWeb.com looks at how RevoScaleR tackles the Big Data explosion.

Inside-R: A new site for the R Community. At www.inside-R.org you’ll find the latest information about R from around the Web, searchable R documentation and packages, hints and tips about R, and more. You can even add a “Download R” badge to your own web-page to help spread the word about R.

R News, Tips and Tricks from the Revolutions blog
The Revolutions blog brings you daily news and tips about R, statistics and open source. Here are some highlights from Revolutions from the past month
.

R’s key role in the oil spill response: Read how NIST’s Division Chief of Statistical Engineering used R to provide critical analysis in real time to the Secretaries of Energy and the Interior, and helped coordinate the government’s response.

Animating data with R and Google Earth: Learn how to use R to create animated visualizations of geographical data with Google Earth, such as this video showing how tuna migrations intersect with the location of the Gulf oil spill.

Are baseball games getting longer? Or is it just Red Sox games? Ryan Elmore uses nonparametric regression in R to find out.

Keynote presentations from useR! 2010: the worldwide R user’s conference was a great success, and there’s a wealth of useful tips and information in the presentations. Video of the keynote presentations are available too: check out in particular Frank Harrell’s talk Information Allergy, and Friedrich Leisch’s talk on reproducible statistical research.

Looking for more R tips and tricks? Check out the monthly round-ups at the Revolutions blog.

Upcoming Events
Every month, we’ll highlight some upcoming events from R Community Calendar.

September 23: The San Diego R User Group has a meetup on BioConductor and microarray data analysis.

September 28: The Sydney Users of R Forum has a meetup on building world-class predictive models in R (with dinner to follow).

September 28: The Los Angeles R User Group presents an introduction to statistical finance with R.

September 28: The Seattle R User Group meets to discuss, “What are you doing with R?”

September 29: The Raleigh-Durham-Chapel Hill R Users Group has its first meeting.

October 7: The NYC R User Group features a presentation by Prof. Andrew Gelman.

There are also new R user groups in SingaporeSeoulDenverBrisbane, and New Jersey.  Please let us know if we’re missing your R user group, or if want to get a new one started.

———————————————————————————————-Editor

David Smith, VP Marketing
david@revolutionanalytics.com
Twitter: @revodavid

subscribe here for Revo’s Monthly newsletter-

SAS/Blades/Servers/ GPU Benchmarks

Just checked out cool new series from NVidia servers.

Now though SAS Inc/ Jim Goodnight thinks HP Blade Servers are the cool thing- the GPU takes hardware high performance computing to another level. It would be interesting to see GPU based cloud computers as well – say for the on Demand SAS (free for academics and students) but which has had some complaints of being slow.

See this for SAS and Blade Servers-

http://www.sas.com/success/ncsu_analytics.html

To give users hands-on experience, the program is underpinned by a virtual computing lab (VCL), a remote access service that allows users to reserve a computer configured with a desired set of applications and operating system and then access that computer over the Internet. The lab is powered by an IBM BladeCenter infrastructure, which includes more than 500 blade servers, distributed between two locations. The assignment of the blade servers can be changed to meet shifts in the balance of demand among the various groups of users. Laura Ladrie, MSA Classroom Coordinator and Technical Support Specialist, says, “The virtual computing lab chose IBM hardware because of its quality, reliability and performance. IBM hardware is also energy efficient and lends itself well to high performance/low overhead computing.

Thats interesting since IBM now competes (as owner of SPSS) and also cooperates with SAS Institute

And

http://www.theaustralian.com.au/australian-it/the-world-according-to-jim-goodnight-blade-switch-slashes-job-times/story-e6frgakx-1225888236107

You’re effectively turbo-charging through deployment of many processors within the blade servers?

Yes. We’ve got machines with 192 blades on them. One of them has 202 or 203 blades. We’re using Hewlett-Packard blades with 12 CP cores on each, so it’s a total 2300 CPU cores doing the computation.

Our idea was to give every one of those cores a little piece of work to do, and we came up with a solution. It involved a very small change to the algorithm we were using, and it’s just incredible how fast we can do things now.

I don’t think of it as a grid, I think of it as essentially one computer. Most people will take a blade and make a grid out of it, where everything’s a separate computer running separate jobs.

We just look at it as one big machine that has memory and processors all over the place, so it’s a totally different concept.

GPU servers can be faster than CPU servers, though , Professor G.




Source-

http://www.nvidia.com/object/preconfigured_clusters.html

TESLA GPU COMPUTING SOLUTIONS FOR DATA CENTERS
Supercharge your cluster with the Tesla family of GPU computing solutions. Deploy 1U systems from NVIDIA or hybrid CPU-GPU servers from OEMs that integrate NVIDIA® Tesla™ GPU computing processors.

When compared to the latest quad-core CPU, Tesla 20-series GPU computing processors deliver equivalent performance at 1/20th the power consumption and 1/10th the cost. Each Tesla GPU features hundreds of parallel CUDA cores and is based on the revolutionary NVIDIA® CUDA™ parallel computing architecture with a rich set of developer tools (compilers, profilers, debuggers) for popular programming languages APIs like C, C++, Fortran, and driver APIs like OpenCL and DirectCompute.

NVIDIA’s partners provide turnkey easy-to-deploy Preconfigured Tesla GPU clusters that are customizable to your needs. For 3D cloud computing applications, our partners offer the Tesla RS clusters that are optimized for running RealityServer with iray.

Available Tesla Products for Data Centers:
– Tesla S2050
– Tesla M2050/M2070
– Tesla S1070
– Tesla M1060

Also I liked the hybrid GPU and CPU

And from a paper on comparing GPU and CPU using Benchmark tests on BLAS from a Debian- Dirk E’s excellent blog

http://dirk.eddelbuettel.com/blog/

Usage of accelerated BLAS libraries seems to shrouded in some mystery, judging from somewhat regularly recurring requests for help on lists such as r-sig-hpc(gmane version), the R list dedicated to High-Performance Computing. Yet it doesn’t have to be; installation can be really simple (on appropriate systems).

Another issue that I felt needed addressing was a comparison between the different alternatives available, quite possibly including GPU computing. So a few weeks ago I sat down and wrote a small package to run, collect, analyse and visualize some benchmarks. That package, called gcbd (more about the name below) is now onCRAN as of this morning. The package both facilitates the data collection for the paper it also contains (in the vignette form common among R packages) and provides code to analyse the data—which is also included as a SQLite database. All this is done in the Debian and Ubuntu context by transparently installing and removing suitable packages providing BLAS implementations: that we can fully automate data collection over several competing implementations via a single script (which is also included). Contributions of benchmark results is encouraged—that is the idea of the package.

And from his paper on the same-

Analysts are often eager to reap the maximum performance from their computing platforms.

A popular suggestion in recent years has been to consider optimised basic linear algebra subprograms (BLAS). Optimised BLAS libraries have been included with some (commercial) analysis platforms for a decade (Moler 2000), and have also been available for (at least some) Linux distributions for an equally long time (Maguire 1999). Setting BLAS up can be daunting: the R language and environment devotes a detailed discussion to the topic in its Installation and Administration manual (R Development Core Team 2010b, appendix A.3.1). Among the available BLAS implementations, several popular choices have emerged. Atlas (an acronym for Automatically Tuned Linear Algebra System) is popular as it has shown very good performance due to its automated and CPU-speci c tuning (Whaley and Dongarra 1999; Whaley and Petitet 2005). It is also licensed in such a way that it permits redistribution leading to fairly wide availability of Atlas.1 We deploy Atlas in both a single-threaded and a multi-threaded con guration. Another popular BLAS implementation is Goto BLAS which is named after its main developer, Kazushige Goto (Goto and Van De Geijn 2008). While `free to use’, its license does not permit redistribution putting the onus of con guration, compilation and installation on the end-user. Lastly, the Intel Math Kernel Library (MKL), a commercial product, also includes an optimised BLAS library. A recent addition to the tool chain of high-performance computing are graphical processing units (GPUs). Originally designed for optimised single-precision arithmetic to accelerate computing as performed by graphics cards, these devices are increasingly used in numerical analysis. Earlier criticism of insucient floating-point precision or severe performance penalties for double-precision calculation are being addressed by the newest models. Dependence on particular vendors remains a concern with NVidia’s CUDA toolkit (NVidia 2010) currently still the preferred development choice whereas the newer OpenCL standard (Khronos Group 2008) may become a more generic alternative that is independent of hardware vendors. Brodtkorb et al. (2010) provide an excellent recent survey. But what has been lacking is a comparison of the e ective performance of these alternatives. This paper works towards answering this question. By analysing performance across ve di erent BLAS implementations|as well as a GPU-based solution|we are able to provide a reasonably broad comparison.

Performance is measured as an end-user would experience it: we record computing times from launching commands in the interactive R environment (R Development Core Team 2010a) to their completion.

And

Basic Linear Algebra Subprograms (BLAS) provide an Application Programming Interface
(API) for linear algebra. For a given task such as, say, a multiplication of two conformant
matrices, an interface is described via a function declaration, in this case sgemm for single
precision and dgemm for double precision. The actual implementation becomes interchangeable
thanks to the API de nition and can be supplied by di erent approaches or algorithms. This
is one of the fundamental code design features we are using here to benchmark the di erence
in performance from di erent implementations.
A second key aspect is the di erence between static and shared linking. In static linking,
object code is taken from the underlying library and copied into the resulting executable.
This has several key implications. First, the executable becomes larger due to the copy of
the binary code. Second, it makes it marginally faster as the library code is present and
no additional look-up and subsequent redirection has to be performed. The actual amount
of this performance penalty is the subject of near-endless debate. We should also note that
this usually amounts to only a small load-time penalty combined with a function pointer
redirection|the actual computation e ort is unchanged as the actual object code is identi-
cal. Third, it makes the program more robust as fewer external dependencies are required.
However, this last point also has a downside: no changes in the underlying library will be
reected in the binary unless a new build is executed. Shared library builds, on the other
hand, result in smaller binaries that may run marginally slower|but which can make use of
di erent libraries without a rebuild.

Basic Linear Algebra Subprograms (BLAS) provide an Application Programming Interface(API) for linear algebra. For a given task such as, say, a multiplication of two conformantmatrices, an interface is described via a function declaration, in this case sgemm for singleprecision and dgemm for double precision. The actual implementation becomes interchangeablethanks to the API de nition and can be supplied by di erent approaches or algorithms. Thisis one of the fundamental code design features we are using here to benchmark the di erencein performance from di erent implementations.A second key aspect is the di erence between static and shared linking. In static linking,object code is taken from the underlying library and copied into the resulting executable.This has several key implications. First, the executable becomes larger due to the copy ofthe binary code. Second, it makes it marginally faster as the library code is present andno additional look-up and subsequent redirection has to be performed. The actual amountof this performance penalty is the subject of near-endless debate. We should also note thatthis usually amounts to only a small load-time penalty combined with a function pointerredirection|the actual computation e ort is unchanged as the actual object code is identi-cal. Third, it makes the program more robust as fewer external dependencies are required.However, this last point also has a downside: no changes in the underlying library will bereected in the binary unless a new build is executed. Shared library builds, on the otherhand, result in smaller binaries that may run marginally slower|but which can make use ofdi erent libraries without a rebuild.

And summing up,

reference BLAS to be dominated in all cases. Single-threaded Atlas BLAS improves on the reference BLAS but loses to multi-threaded BLAS. For multi-threaded BLAS we nd the Goto BLAS dominate the Intel MKL, with a single exception of the QR decomposition on the xeon-based system which may reveal an error. The development version of Atlas, when compiled in multi-threaded mode is competitive with both Goto BLAS and the MKL. GPU computing is found to be compelling only for very large matrix sizes. Our benchmarking framework in the gcbd package can be employed by others through the R packaging system which could lead to a wider set of benchmark results. These results could be helpful for next-generation systems which may need to make heuristic choices about when to compute on the CPU and when to compute on the GPU.

Source – DirkE’paper and blog http://dirk.eddelbuettel.com/papers/gcbd.pdf

Quite appropriately-,

Hardware solutions or atleast need to be a part of Revolution Analytic’s thinking as well. SPSS does not have any choice anymore though 😉

It would be interesting to see how the new SAS Cloud Computing/ Server Farm/ Time Sharing facility is benchmarking CPU and GPU for SAS analytics performance – if being done already it would be nice to see a SUGI paper on the same at http://sascommunity.org.

Multi threading needs to be taken care automatically by statistical software to optimize current local computing (including for New R)

Acceptable benchmarks for testing hardware as well as software need to be reinforced and published across vendors, academics  and companies.

What do you think?


Professors and Patches: For a Betterrrr R

Professors sometime throw out provocative statements to ensure intellectual debate. I have had almost 1500+ hits in less than 2 days ( and I am glad I am on wordpress.com , my old beloved server would have crashed))

The remarks from Ross Ihaka, covered before and also at Xian’s blog at

Note most of his remarks are techie- and only a single line refers to Revlution Analytics.

Other senior members of community (read- professors are silent, though brobably some thought may have been ignited behind scenes)

http://xianblog.wordpress.com/2010/09/06/insane/comment-page-4/#comments

Ross Ihaka Says:
September 12, 2010 at 1:23 pm

Since (something like) my name has been taken in vain here, let me
chip in.

I’ve been worried for some time that R isn’t going to provide the base
that we’re going to need for statistical computation in the
future. (It may well be that the future is already upon us.) There
are certainly efficiency problems (speed and memory use), but there
are more fundamental issues too. Some of these were inherited from S
and some are peculiar to R.

One of the worst problems is scoping. Consider the following little
gem.

f =
function() {
if (runif(1) > .5)
x = 10
x
}

The x being returned by this function is randomly local or global.
There are other examples where variables alternate between local and
non-local throughout the body of a function. No sensible language
would allow this. It’s ugly and it makes optimisation really
difficult. This isn’t the only problem, even weirder things happen
because of interactions between scoping and lazy evaluation.

In light of this, I’ve come to the conclusion that rather than
“fixing” R, it would be much more productive to simply start over and
build something better. I think the best you could hope for by fixing
the efficiency problems in R would be to boost performance by a small
multiple, or perhaps as much as an order of magnitude. This probably
isn’t enough to justify the effort (Luke Tierney has been working on R
compilation for over a decade now).

To try to get an idea of how much speedup is possible, a number of us
have been carrying out some experiments to see how much better we
could do with something new. Based on prototyping we’ve been doing at
Auckland, it looks like it should be straightforward to get two orders
of magnitude speedup over R, at least for those computations which are
currently bottle-necked. There are a couple of ways to make this
happen.

First, scalar computations in R are very slow. This in part because
the R interpreter is very slow, but also because there are a no scalar
types. By introducing scalars and using compilation it looks like its
possible to get a speedup by a factor of several hundred for scalar
computations. This is important because it means that many ghastly
uses of array operations and the apply functions could be replaced by
simple loops. The cost of these improvements is that scope
declarations become mandatory and (optional) type declarations are
necessary to help the compiler.

As a side-effect of compilation and the use of type-hinting it should
be possible to eliminate dispatch overhead for certain (sealed)
classes (scalars and arrays in particular). This won’t bring huge
benefits across the board, but it will mean that you won’t have to do
foreign language calls to get efficiency.

A second big problem is that computations on aggregates (data frames
in particular) run at glacial rates. This is entirely down to
unnecessary copying because of the call-by-value semantics.
Preserving call-by-value semantics while eliminating the extra copying
is hard. The best we can probably do is to take a conservative
approach. R already tries to avoid copying where it can, but fails in
an epic fashion. The alternative is to abandon call-by-value and move
to reference semantics. Again, prototyping indicates that several
hundredfold speedup is possible (for data frames in particular).

The changes in semantics mentioned above mean that the new language
will not be R. However, it won’t be all that far from R and it should
be easy to port R code to the new system, perhaps using some form of
automatic translation.

If we’re smart about building the new system, it should be possible to
make use of multi-cores and parallelism. Adding this to the mix might just
make it possible to get a three order-of-magnitude performance boost
with just a fraction of the memory that R uses. I think it’s something
really worth putting some effort into.

I also think one other change is necessary. The license will need to a
better job of protecting work donated to the commons than GPL2 seems
to have done. I’m not willing to have any more of my work purloined by
the likes of Revolution Analytics, so I’ll be looking for better
protection from the license (and being a lot more careful about who I
work with).

The discussion spilled over to Stack Overflow as well

http://stackoverflow.com/questions/3706990/is-r-that-bad-that-it-should-be-rewritten-from-scratch/3710667#3710667

n the past week I’ve been following a discussion where Ross Ihaka wrote (here ):

I’ve been worried for some time that R isn’t going to provide the base that we’re going to need for statistical computation in the future. (It may well be that the future is already upon us.) There are certainly efficiency problems (speed and memory use), but there are more fundamental issues too. Some of these were inherited from S and some are peculiar to R.

He then continued explaining. This discussion started from this post, and was then followed by commentsherehereherehereherehere and maybe some more places I don’t know of.

We all know the problem now.

R can be improved substantially in terms of speed.

For some solutions, here are the patches by Radford-

http://www.cs.toronto.edu/~radford/speed-patches-doc

patch-dollar

    Speeds up access to lists, pairlists, and environments using the
    $ operator.  The speedup comes mainly from avoiding the overhead of 
    calling DispatchOrEval if there are no complexities, from passing
    on the field to extract as a symbol, or a name, or both, as available,
    and then converting only as necessary, from simplifying and inlining
    the pstrmatch procedure, and from not translating string multiple
    times.  

    Relevant timing test script:  test-dollar.r 

    This test shows about a 40% decrease in the time needed to extract
    elements of lists and environments.

    Changes unrelated to speed improvement:

    A small error-reporting bug is fixed, illustrated by the following
    output with r52822:

    > options(warnPartialMatchDollar=TRUE)
    > pl <- pairlist(abc=1,def=2)
    > pl$ab
    [1] 1
    Warning message:
    In pl$ab : partial match of 'ab' to ''

    Some code is changed at the end of R_subset3_dflt because it seems 
    to be more correct, as discussed in code comments. 

patch-evalList

    Speeds up a large number of operations by avoiding allocation of
    an extra CONS cell in the procedures for evaluating argument lists.

    Relevant timing test scripts:  all of them, but will look at test-em.r 

    On test-em.r, the speedup from this patch is about 5%.

patch-fast-base

    Speeds up lookup of symbols defined in the base environment, by
    flagging symbols that have a base environment definition recorded
    in the global cache.  This allows the definition to be retrieved
    quickly without looking in the hash table.  

    Relevant timing test scripts:  all of them, but will look at test-em.r 

    On test-em.r, the speedup from this patch is about 3%.

    Issue:  This patch uses the "spare" bit for the flag.  This bit is
    misnamed, since it is already used elsewhere (for closures).  It is
    possible that one of the "gp" bits should be used instead.  The
    "gp" bits should really be divided up for faster access, and so that
    their present use is apparent in the code.

    In case this use of the "spare" bit proves unwise, the patch code is 
    conditional on FAST_BASE_CACHE_LOOKUP being defined at the start of
    envir.r.

patch-fast-spec

    Speeds up lookup of function symbols that begin with a character
    other than a letter or ".", by allowing fast bypass of non-global
    environments that do not contain (and have never contained) symbols 
    of this sort.  Since it is expected that only functions will be
    given names of this sort, the check is done only in findFun, though
    it could also be done in findVar.

    Relevant timing test scripts:  all of them, but will look at test-em.r 

    On test-em.r, the speedup from this patch is about 8%.    

    Issue:  This patch uses the "spare" bit to flag environments known
    to not have symbols starting with a special character.  See remarks
    on patch-fast-base.

    In case this use of the "spare" bit proves unwise, the patch code is 
    conditional on FAST_SPEC_BYPASS being defined at the start of envir.r.

patch-for

    Speeds up for loops by not allocating new space for the loop
    variable every iteration, unless necessary.  

    Relevant timing test script:  test-for.r

    This test shows a speedup of about 5%.  

    Change unrelated to speed improvement:

    Fixes what I consider to be a bug, in which the loop clobbers a
    global variable, as demonstrated by the following output with r52822:

    > i <- 99
    > f <- function () for (i in 1:3) { print(i); if (i==2) rm(i); }
    > f()
    [1] 1
    [1] 2
    [1] 3
    > print(i)
    [1] 3

patch-matprod

    Speeds up matrix products, including vector dot products.  The
    speed issue here is that the R code checks for any NAs, and 
    does the multiply in the matprod procedure (in array.c) if so,
    since BLAS isn't trusted with NAs.  If this check takes about
    as long as just doing the multiply in matprod, calling a BLAS
    routine makes no sense.  

    Relevant time test script:  test-matprod.r

    With no external BLAS, this patch speeds up long vector-vector 
    products by a factor of about six, matrix-vector products by a
    factor of about three, and some matrix-matrix products by a 
    factor of about two.

    Issue:  The matrix multiply code in matprod using an LDOUBLE
    (long double) variable to accumulate sums, for improved accuracy.  
    On a SPARC system I tested on, operations on long doubles are 
    vastly slower than on doubles, so that the patch produces a 
    large slowdown rather than an improvement.  This is also an issue 
    for the "sum" function, which also uses an LDOUBLE to accumulate
    the sum.  Perhaps an ordinarly double should be used in these
    places, or perhaps the configuration script should define LDOUBLE 
    as double on architectures where long doubles are extraordinarily 
    slow.

    Due to this issue, not defining MATPROD_CAN_BE_DONE_HERE at the
    start of array.c will disable this patch.

patch-parens

    Speeds up parentheses by making "(" a special operator whose
    argument is not evaluated, thereby bypassing the overhead of
    evalList.  Also slightly speeds up curly brackets by inlining
    a function that is stylistically better inline anyway.

    Relevant test script:  test-parens.r

    In the parens part of test-parens.r, the speedup is about 9%.

patch-protect

    Speeds up numerous operations by making PROTECT, UNPROTECT, etc.
    be mostly macros in the files in src/main.  This takes effect
    only for files that include Defn.h after defining the symbol
    USE_FAST_PROTECT_MACROS.  With these macros, code of the form
    v = PROTECT(...) must be replaced by PROTECT(v = ...).  

    Relevant timing test scripts:  all of them, but will look at test-em.r 

    On test-em.r, the speedup from this patch is about 9%.

patch-save-alloc

    Speeds up some binary and unary arithmetic operations by, when
    possible, using the space holding one of the operands to hold
    the result, rather than allocating new space.  Though primarily
    a speed improvement, for very long vectors avoiding this allocation 
    could avoid running out of space.

    Relevant test script:  test-complex-expr.r

    On this test, the speedup is about 5% for scalar operands and about
    8% for vector operands.

    Issues:  There are some tricky issues with attributes, but I think
    I got them right.  This patch relies on NAMED being set correctly 
    in the rest of the code.  In case it isn't, the patch can be disabled 
    by not defining AVOID_ALLOC_IF_POSSIBLE at the top of arithmetic.c.

patch-square

    Speeds up a^2 when a is a long vector by not checking for the
    special case of an exponent of 2 over and over again for every 
    vector element.

    Relevant test script:  test-square.r

    The time for squaring a long vector is reduced in this test by a
    factor of more than five.

patch-sum-prod

    Speeds up the "sum" and "prod" functions by not checking for NA
    when na.rm=FALSE, and other detailed code improvements.

    Relevant test script:  test-sum-prod.r

    For sum, the improvement is about a factor of 2.5 when na.rm=FALSE,
    and about 10% when na.rm=TRUE.

    Issue:  See the discussion of patch-matprod regarding LDOUBLE.
    There is no change regarding this issue due to this patch, however.

patch-transpose

    Speeds up the transpose operation (the "t" function) from detailed
    code improvements.

    Relevant test script:  test-transpose.r

    The improvement for 200x60 matrices is about a factor of two.
    There is little or no improvement for long row or column vectors.

patch-vec-arith

    Speeds up arithmetic on vectors of the same length, or when on
    vector is of length one.  This is done with detailed code improvements.

    Relevant test script:  test-vec-arith.r

    On long vectors, the +, -, and * operators are sped up by about     
    20% when operands are the same length or one operand is of length one.

    Rather mysteriously, when the operands are not length one or the
    same length, there is about a 20% increase in time required, though
    this may be due to some strange C optimizer peculiarity or some 
    strange cache effect, since the C code for this is the same as before,
    with negligible additional overhead getting to it.  Regardless, this 
    case is much less common than equal lengths or length one.

    There is little change for the / operator, which is much slower than
    +, -, or *.

patch-vec-subset

    Speeds up extraction of subsets of vectors or matrices (eg, v[10:20]
    or M[1:10,101:110]).  This is done with detailed code improvements.

    Relevant test script:  test-vec-subset.r

    There are lots of tests in this script.  The most dramatic improvement
    is for extracting many rows and columns of a large array, where the 
    improvement is by about a factor of four.  Extracting many rows from
    one column of a matrix is sped up by about 30%. 

    Changes unrelated to speed improvement:

    Fixes two latent bugs where the code incorrectly refers to NA_LOGICAL
    when NA_INTEGER is appropriate and where LOGICAL and INTEGER types
    are treated as interchangeable.  These cause no problems at the moment,
    but would if representations were changed.

patch-subscript

    (Formerly part of patch-vec-subset)  This patch also speeds up
    extraction, and also replacement, of subsets of vectors or
    matrices, but focuses on the creation of the indexes rather than
    the copy operations.  Often avoids a duplication (see below) and
    eliminates a second scan of the subscript vector for zero
    subscripts, folding it into a previous scan at no additional cost.

    Relevant test script:  test-vec-subset.r

    Speeds up some operations with scalar or short vector indexes by
    about 10%.  Speeds up subscripting with a longer vector of
    positive indexes by about 20%.

    Issues:  The current code duplicates a vector of indexes when it
    seems unnecessary.  Duplication is for two reasons:  to handle
    the situation where the index vector is itself being modified in
    a replace operation, and so that any attributes can be removed, which 
    is helpful only for string subscripts, given how the routine to handle 
    them returns information via an attribute.  Duplication for the
    second reasons can easily be avoided, so I avoided it.  The first
    reason for duplication is sometimes valid, but can usually be avoided
    by first only doing it if the subscript is to be used for replacement
    rather than extraction, and second only doing it if the NAMED field
    for the subscript isn't zero.

    I also removed two layers of procedure call overhead (passing seven
    arguments, so not trivial) that seemed to be doing nothing.  Probably 
    it used to do something, but no longer does, but if instead it is 
    preparation for some future use, then removing it might be a mistake.

Software problems are best solved by writing code or patches in my opinion rather than discussing endlessly
Some other solutions to a BETTERRRR R
1) Complete Code Design Review
2) Version 3 - Tuneup
3) Better Documentation
4) Suing Revolution Analytics for the code - Hand over da code pardner

KDNuggets Poll on SAS: Churn in Analytics Users

Here are the some surprising results from the Bible of all Data Miners , KDNuggets.com with some interesting comments about SAS being the Microsoft of analytics.

I believe technically advanced users will probably want to try out R before going in for a commercial license from Revolution Analytics as it is free to try out. Also WPS offers a one month free preview for its software- the latest release of it competes with SAS/Stat and SAS/Access, SAS/Graph and Base SAS- so anyone having these installations on a server would be interested to atleast test it for free. Also WPS would be interested in increasing engines (like they have for Oracle and Teradata).

One very crucial difference for SAS is it’s ability to pull in data from almost all data formats- so if you are using SAS/Connect to remote submit code- then you may not be able to switch soon.

Also the more license heavy customers are not the kind of cutomers who have lots of data in their local desktops but is usually pulled and then crunched before analysed. R has recently made some strides with the RevoScaler package from Revolution Analytics but it’s effectiveness would be tested and tried in the coming months- it seems like a great step in the right direction.

For SAS, the feedback should be a call to improve their product bundling – some of which can feel like over selling at times- but they have been fighting off challenges since past 4 decades and have the pockets and intention to sustain market share battles including discounts ( for repeat customers SAS can be much cheaper than say a first time user of WPS or R)

http://teamwpc.co.uk/home

This really should come as a surprise to some people. You can see the comments on WPS and R at the site itself. Interesting stufff and we can see after say 1 year to see how many actually DID switch.

http://www.kdnuggets.com/polls/2010/switching-from-sas-to-wps.html

Interview Stephanie McReynolds Director Product Marketing, AsterData

Here is an interview with Stephanie McReynolds who works as as Director of Product Marketing with AsterData. I asked her a couple of questions about the new product releases from AsterData in analytics and MapReduce.

Ajay – How does the new Eclipse Plugin help people who are already working with huge datasets but are new to AsterData’s platform?

Stephanie- Aster Data Developer Express, our new SQL-MapReduce development plug-in for Eclipse, makes MapReduce applications easy to develop. With Aster Data Developer Express, developers can develop, test and deploy a complete SQL-MapReduce application in under an hour. This is a significant increase in productivity over the traditional analytic application development process for Big Data applications, which requires significant time coding applications in low-level code and testing applications on sample data.

Ajay – What are the various analytical functions that are introduced by you recently- list say the top 10.

Stephanie- At Aster Data, we have an intense focus on making the development process easier for SQL-MapReduce applications. Aster Developer Express is a part of this initiative, as is the release of pre-defined analytic functions. We recently launched both a suite of analytic modules and a partnership program dedicated to delivering pre-defined analytic functions for the Aster Data nCluster platform. Pre-defined analytic functions delivered by Aster Data’s engineering team are delivered as modules within the Aster Data Analytic Foundation offering and include analytics in the areas of pattern matching, clustering, statistics, and text analysis– just to name a few areas. Partners like Fuzzy Logix and Cobi Systems are extending this library by delivering industry-focused analytics like Monte Carlo Simulations for Financial Services and geospatial analytics for Public Sector– to give you a few examples.

Ajay – So okay I want to do a K Means Cluster on say a million rows (and say 200 columns) using the Aster method. How do I go about it using the new plug-in as well as your product.

Stephanie- The power of the Aster Data environment for analytic application development is in SQL-MapReduce. SQL is a powerful analytic query standard because it is a declarative language. MapReduce is a powerful programming framework because it can support high performance parallel processing of Big Data and extreme expressiveness, by supporting a wide variety of programming languages, including Java, C/C#/C++, .Net, Python, etc. Aster Data has taken the performance and expressiveness of MapReduce and combined it with the familiar declarativeness of SQL. This unique combination ensures that anyone who knows standard SQL can access advanced analytic functions programmed for Big Data analysis using MapReduce techniques.

kMeans is a good example of an analytic function that we pre-package for developers as part of the Aster Data Analytic Foundation. What does that mean? It means that the MapReduce portion of the development cycle has been completed for you. Each pre-packaged Aster Data function can be called using standard SQL, and executes the defined analytic in a fully parallelized manner in the Aster Data database using MapReduce techniques. The result? High performance analytics with the expressiveness of low-level languages accessed through declarative SQL.

Ajay – I see an an increasing focus on Analytics. Is this part of your product strategy and how do you see yourself competing with pure analytics vendors.

Stephanie – Aster Data is an infrastructure provider. Our core product is a massively parallel processing database called nCluster that performs at or beyond the capabilities of any other analytic database in the market today. We developed our analytics strategy as a response to demand from our customers who were looking beyond the price/performance wars being fought today and wanted support for richer analytics from their database provider. Aster Data analytics are delivered in nCluster to enable analytic applications that are not possible in more traditional database architectures.

Ajay – Name some recent case studies in Analytics of implementation of MR-SQL with Analytical functions

Stephanie – There are three new classes of applications that Aster Data Express and Aster Analytic Foundation support: iterative analytics, prediction and optimization, and ad hoc analysis.

Aster Data customers are uncovering critical business patterns in Big Data by performing hypothesis-driven, iterative analytics. They are exploring interactively massive volumes of data—terabytes to petabytes—in a top-down deductive manner. ComScore, an Aster Data customer that performs website experience analysis is a good example of an Aster Data customer performing this type of analysis.

Other Aster Data customers are building applications for prediction and optimization that discover trends, patterns, and outliers in data sets. Examples of these types of applications are propensity to churn in telecommunications, proactive product and service recommendations in retail, and pricing and retention strategies in financial services. Full Tilt Poker, who is using Aster Data for fraud prevention is a good example of a customer in this space.

The final class of application that I would like to highlight is ad hoc analysis. Examples of ad hoc analysis that can be performed includes social network analysis, advanced click stream analysis, graph analysis, cluster analysis and a wide variety of mathematical, trigonometry, and statistical functions. LinkedIn, whose analysts and data scientists have access to all of their customer data in Aster Data are a good example of a customer using the system in this manner.

While Aster Data customers are using nCluster in a number of other ways, these three new classes of applications are areas in which we are seeing particularly innovative application development.

Biography-

Stephanie McReynolds is Director of Product Marketing at Aster Data, where she is an evangelist for Aster Data’s massively parallel data-analytics server product. Stephanie has over a decade of experience in product management and marketing for business intelligence, data warehouse, and complex event processing products at companies such as Oracle, Peoplesoft, and Business Objects. She holds both a master’s and undergraduate degree from Stanford University.