R on Windows HPC Server

From HPC Wire, the newsletter/site for all HPC news-

Source- Link

PALO ALTO, Calif., Sept. 20 — Revolution Analytics, the leading commercial provider of software and support for the popular open source R statistics language, today announced it will deliver Revolution R Enterprise for Microsoft Windows HPC Server 2008 R2, released today, enabling users to analyze very large data sets in high-performance computing environments.

R is a powerful open source statistics language and the modern system for predictive analytics. Revolution Analytics recently introduced RevoScaleR, new “Big Data” analysis capabilities, to its R distribution, Revolution R Enterprise. RevoScaleR solves the performance and capacity limitations of the R language by with parallelized algorithms that stream data across multiple cores on a laptop, workstation or server. Users can now process, visualize and model terabyte-class data sets at top speeds — without the need for specialized hardware.

“Revolution Analytics is pleased to support Microsoft’s Technical Computing initiative, whose efforts will benefit scientists, engineers and data analysts,” said David Champagne, CTO at Revolution. “We believe the engineering we have done for Revolution R Enterprise, in particular our work on big-data statistics and multicore computing, along with Microsoft’s HPC platform for technical computing, makes an ideal combination for high-performance large scale statistical computing.”

“Processing and analyzing this ‘big data’ is essential to better prediction and decision making,” said Bill Hamilton, director of technical computing at Microsoft Corp. “Revolution R Enterprise for Windows HPC Server 2008 R2 gives customers an extremely powerful tool that handles analysis of very large data and high workloads.”

To learn more about Revolution R Enterprise and its Big Data capabilities, download thewhite paper. Revolution Analytics also has an on-demand webcast, “High-performance analytics with Revolution R and Windows HPC Server,” available online.

AND from Microsoft’s website

http://www.microsoft.com/hpc/en/us/solutions/hpc-for-life-sciences.aspx

REvolution R Enterprise »

REvolution Computing

REvolution R Enterprise is designed for both novice and experienced R users looking for a production-grade R distribution to perform mission critical predictive analytics tasks right from the desktop and scale across multiprocessor environments. Featuring RPE™ REvolution’s R Productivity Environment for Windows.

Of course R Enterprise is available on Linux but on Red Hat Enterprise Linux- it would be nice to see Amazom Machine Images as well as Ubuntu versions as well.

An Amazon Machine Image (AMI) is a special type of virtual appliance which is used to instantiate (create) a virtual machine within the Amazon Elastic Compute Cloud. It serves as the basic unit of deployment for services delivered using EC2.[1]

Like all virtual appliances, the main component of an AMI is a read-only filesystem image which includes an operating system (e.g., Linux, UNIX, or Windows) and any additional software required to deliver a service or a portion of it.[2]

The AMI filesystem is compressed, encrypted, signed, split into a series of 10MB chunks and uploaded into Amazon S3 for storage. An XML manifest file stores information about the AMI, including name, version, architecture, default kernel id, decryption key and digests for all of the filesystem chunks.

An AMI does not include a kernel image, only a pointer to the default kernel id, which can be chosen from an approved list of safe kernels maintained by Amazon and its partners (e.g., RedHat, Canonical, Microsoft). Users may choose kernels other than the default when booting an AMI.[3]

[edit]Types of images

  • Public: an AMI image that can be used by any one.
  • Paid: a for-pay AMI image that is registered with Amazon DevPay and can be used by any one who subscribes for it. DevPay allows developers to mark-up Amazon’s usage fees and optionally add monthly subscription fees.

Windows Azure vs Amazon EC2 (and Google Storage)

Here is a comparison of Windows Azure instances vs Amazon compute instances

Compute Instance Sizes:

Developers have the ability to choose the size of VMs to run their application based on the applications resource requirements. Windows Azure compute instances come in four unique sizes to enable complex applications and workloads.

Compute Instance Size CPU Memory Instance Storage I/O Performance
Small 1.6 GHz 1.75 GB 225 GB Moderate
Medium 2 x 1.6 GHz 3.5 GB 490 GB High
Large 4 x 1.6 GHz 7 GB 1,000 GB High
Extra large 8 x 1.6 GHz 14 GB 2,040 GB High

Standard Rates:

Windows Azure

  • Compute
    • Small instance (default): $0.12 per hour
    • Medium instance: $0.24 per hour
    • Large instance: $0.48 per hour
    • Extra large instance: $0.96 per hour
  • Storage
    • $0.15 per GB stored per month
    • $0.01 per 10,000 storage transactions
  • Content Delivery Network (CDN)
    • $0.15 per GB for data transfers from European and North American locations*
    • $0.20 per GB for data transfers from other locations*
    • $0.01 per 10,000 transactions*

Source –

http://www.microsoft.com/windowsazure/offers/popup/popup.aspx?lang=en&locale=en-US&offer=MS-AZR-0001P

and

http://www.microsoft.com/windowsazure/windowsazure/

Amazon EC2 has more options though——————————-

http://aws.amazon.com/ec2/pricing/

Standard On-Demand Instances Linux/UNIX Usage Windows Usage
Small (Default) $0.085 per hour $0.12 per hour
Large $0.34 per hour $0.48 per hour
Extra Large $0.68 per hour $0.96 per hour
Micro On-Demand Instances Linux/UNIX Usage Windows Usage
Micro $0.02 per hour $0.03 per hour
High-Memory On-Demand Instances
Extra Large $0.50 per hour $0.62 per hour
Double Extra Large $1.00 per hour $1.24 per hour
Quadruple Extra Large $2.00 per hour $2.48 per hour
High-CPU On-Demand Instances
Medium $0.17 per hour $0.29 per hour
Extra Large $0.68 per hour $1.16 per hour
Cluster Compute Instances
Quadruple Extra Large $1.60 per hour N/A*
* Windows is not currently available for Cluster Compute Instances.

http://aws.amazon.com/ec2/instance-types/

Standard Instances

Instances of this family are well suited for most applications.

Small Instance – default*

1.7 GB memory
1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit)
160 GB instance storage (150 GB plus 10 GB root partition)
32-bit platform
I/O Performance: Moderate
API name: m1.small

Large Instance

7.5 GB memory
4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)
850 GB instance storage (2×420 GB plus 10 GB root partition)
64-bit platform
I/O Performance: High
API name: m1.large

Extra Large Instance

15 GB memory
8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
1,690 GB instance storage (4×420 GB plus 10 GB root partition)
64-bit platform
I/O Performance: High
API name: m1.xlarge

Micro Instances

Instances of this family provide a small amount of consistent CPU resources and allow you to burst CPUcapacity when additional cycles are available. They are well suited for lower throughput applications and web sites that consume significant compute cycles periodically.

Micro Instance

613 MB memory
Up to 2 EC2 Compute Units (for short periodic bursts)
EBS storage only
32-bit or 64-bit platform
I/O Performance: Low
API name: t1.micro

High-Memory Instances

Instances of this family offer large memory sizes for high throughput applications, including database and memory caching applications.

High-Memory Extra Large Instance

17.1 GB of memory
6.5 EC2 Compute Units (2 virtual cores with 3.25 EC2 Compute Units each)
420 GB of instance storage
64-bit platform
I/O Performance: Moderate
API name: m2.xlarge

High-Memory Double Extra Large Instance

34.2 GB of memory
13 EC2 Compute Units (4 virtual cores with 3.25 EC2 Compute Units each)
850 GB of instance storage
64-bit platform
I/O Performance: High
API name: m2.2xlarge

High-Memory Quadruple Extra Large Instance

68.4 GB of memory
26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each)
1690 GB of instance storage
64-bit platform
I/O Performance: High
API name: m2.4xlarge

High-CPU Instances

Instances of this family have proportionally more CPU resources than memory (RAM) and are well suited for compute-intensive applications.

High-CPU Medium Instance

1.7 GB of memory
5 EC2 Compute Units (2 virtual cores with 2.5 EC2 Compute Units each)
350 GB of instance storage
32-bit platform
I/O Performance: Moderate
API name: c1.medium

High-CPU Extra Large Instance

7 GB of memory
20 EC2 Compute Units (8 virtual cores with 2.5 EC2 Compute Units each)
1690 GB of instance storage
64-bit platform
I/O Performance: High
API name: c1.xlarge

Cluster Compute Instances

Instances of this family provide proportionally high CPU resources with increased network performance and are well suited for High Performance Compute (HPC) applications and other demanding network-bound applications. Learn more about use of this instance type for HPC applications.

Cluster Compute Quadruple Extra Large Instance

23 GB of memory
33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core “Nehalem” architecture)
1690 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
API name: cc1.4xlarge

Also http://www.microsoft.com/en-us/sqlazure/default.aspx

offers SQL Databases as a service with a free trial offer

If you are into .Net /SQL big time or too dependent on MS, Azure is a nice option to EC2 http://www.microsoft.com/windowsazure/offers/popup/popup.aspx?lang=en&locale=en-US&offer=COMPARE_PUBLIC

Updated- I just got approved for Google Storage so am adding their info- though they are in Preview (and its free right now) 🙂

https://code.google.com/apis/storage/docs/overview.html

Functionality

Google Storage for Developers offers a rich set of features and capabilities:

Basic Operations

  • Store and access data from anywhere on the Internet.
  • Range-gets for large objects.
  • Manage metadata.

Security and Sharing

  • User authentication using secret keys or Google account.
  • Authenticated downloads from a web browser for Google account holders.
  • Secure access using SSL.
  • Easy, powerful sharing and collaboration via ACLs for individuals and groups.

Performance and scalability

  • Up to 100 gigabytes per object and 1,000 buckets per account during the preview.
  • Strong data consistency—read-after-write consistency for all upload and delete operations.
  • Namespace for your domain—only you can create bucket URIs containing your domain name.
  • Data replicated in multiple data centers across the U.S. and within the same data center.

Tools

  • Web-based storage manager.
  • GSUtil, an open source command line tool.
  • Compatible with many existing cloud storage tools and libraries.

Read the Getting Started Guide to learn more about the service.

Note: Google Storage for Developers does not support Google Apps accounts that use your company domain name at this time.

Back to top

Pricing

Google Storage for Developers pricing is based on usage.

  • Storage—$0.17/gigabyte/month
  • Network
    • Upload data to Google
      • $0.10/gigabyte
    • Download data from Google
      • $0.15/gigabyte for Americas and EMEA
      • $0.30/gigabyte for Asia-Pacific
  • Requests
    • PUT, POST, LIST—$0.01 per 1,000 requests
    • GET, HEAD—$0.01 per 10,000 requests

Matlab-Mathematica-R and GPU Computing

Matlab announced they have a parallel computing toolbox- specially to enable GPU computing as well

http://www.mathworks.com/products/parallel-computing/

Parallel Computing Toolbox™ lets you solve computationally and data-intensive problems using multicore processors, GPUs, and computer clusters. High-level constructs—parallel for-loops, special array types, and parallelized numerical algorithms—let you parallelize MATLAB® applications without CUDA or MPI programming. You can use the toolbox with Simulink® to run multiple simulations of a model in parallel.

MATLAB GPU Support

The toolbox provides eight workers (MATLAB computational engines) to execute applications locally on a multicore desktop. Without changing the code, you can run the same application on a computer cluster or a grid computing service (using MATLAB Distributed Computing Server™). You can run parallel applications interactively or in batch.

Parallel Computing with MATLAB on Amazon Elastic Compute Cloud (EC2)

Also a video of using Mathematica and GPU

Also R has many packages for GPU computing

Parallel computing: GPUs

from http://cran.r-project.org/web/views/HighPerformanceComputing.html

  • The gputools package by Buckner provides several common data-mining algorithms which are implemented using a mixture of nVidia‘s CUDA langauge and cublas library. Given a computer with an nVidia GPU these functions may be substantially more efficient than native R routines. The rpud package provides an optimised distance metric for NVidia-based GPUs.
  • The cudaBayesreg package by da Silva implements the rhierLinearModel from the bayesm package using nVidia’s CUDA langauge and tools to provide high-performance statistical analysis of fMRI voxels.
  • The rgpu package (see below for link) aims to speed up bioinformatics analysis by using the GPU.
  • The magma package provides an interface to the hybrid GPU/CPU library Magma (see below for link).
  • The gcbd package implements a benchmarking framework for BLAS and GPUs (using gputools).

I tried to search for SAS and GPU and SPSS and GPU but got nothing. Maybe they would do well to atleast test these alternative hardwares-

Also see Matlab on GPU comparison for the product Jacket vs Parallel Computing Toolbox

http://www.accelereyes.com/products/compare

Making NeW R

Tal G in his excellent blog piece talks of “Why R Developers  should not be paid” http://www.r-statistics.com/2010/09/open-source-and-money-why-r-developers-shouldnt-be-paid/

His argument of love is not very original though it was first made by these four guys

I am going to argue that “some” R developers should be paid, while the main focus should be volunteers code. These R developers should be paid as per usage of their packages.

Let me expand.

Imagine the following conversation between Ross Ihaka, Norman Nie and Peter Dalgaard.

Norman- Hey Guys, Can you give me some code- I got this new startup.

Ross Ihaka and Peter Dalgaard- Sure dude. Here is 100,000 lines of code, 2000 packages and 2 decades of effort.

Norman- Thanks guys.

Ross Ihaka- Hey, What you gonna do with this code.

Norman- I will better it. Sell it. Finally beat Jim Goodnight and his **** Proc GLM and **** Proc Reg.

Ross- Okay, but what will you give us? Will you give us some code back of what you improve?

Norman – Uh, let me explain this open core …

Peter D- Well how about some royalty?

Norman- Sure, we will throw parties at all conferences, snacks you know at user groups.

Ross – Hmm. That does not sound fair. (walks away in a huff muttering)-He takes our code, sells it and wont share the code

Peter D- Doesnt sound fair. I am back to reading Hamlet, the great Dane, and writing the next edition of my book. I am glad I wrote a book- Ross didnt even write that.

Norman-Uh Oh. (picks his phone)- Hey David Smith, We need to write some blog articles pronto – these open source guys ,man…

———–I think that sums what has been going on in the dynamics of R recently. If Ross Ihaka and R Gentleman had adopted an open core strategy- meaning you can create packages to R but not share the original where would we all be?

At this point if he is reading this, David Smith , long suffering veteran of open source  flameouts is rolling his eyes while Tal G is wondering if he will publish this on R Bloggers and if so when or something.

Lets bring in another R veteran-  Hadley Wickham who wrote a book on R and also created ggplot. Thats the best quality, most often used graphics package.

In terms of economic utilty to end user- the ggplot package may be as useful if not more as the foreach package developed by Revolution Computing/Analytics.

Now http://cran.r-project.org/web/packages/foreach/index.html says that foreach is licensed under http://www.apache.org/licenses/LICENSE-2.0

However lets come to open core licensing ( read it here http://alampitt.typepad.com/lampitt_or_leave_it/2008/08/open-core-licen.html ) which is where the debate is- Revolution takes code- enhances it (in my opinion) substantially with new formats XDF for better efficieny, web services API, and soon coming next year a GUI (thanks in advance , Dr Nie and guys)

and sells this advanced R code to businesses happy to pay ( they are currently paying much more to DR Goodnight and HIS guys)

Why would any sane customer buy it from Revolution- if he could download exactly the same thing from http://r-project.org

Hence the business need for Revolution Analytics to have an enhanced R- as they are using a product based software model not software as a service model.

If Revolution gives away source code of these new enhanced codes to R core team- how will R core team protect the above mentioned intelectual property- given they have 2 decades experience of giving away free code , and back and forth on just code.

Now Revolution also has a marketing budget- and thats how they sponsor some R Core events, conferences, after conference snacks.

How would people decide if they are being too generous or too stingy in their contribution (compared to the formidable generosity of SAS Institute to its employees, stakeholders and even third party analysts).

Would it not be better- IF Revolution can shift that aspect of relationship to its Research and Development budget than it’s marketing budget- come with some sort of incentive for “SOME” developers – even researchers need grants and assistantships, scholarships, make a transparent royalty formula say 17.5 % of the NEW R sales goes to R PACKAGE Developers pool, which in turn examines usage rate of packages and need/merit before allocation- that would require Revolution to evolve from a startup to a more sophisticated corporate and R Core can use this the same way as John M Chambers software award/scholarship

Dont pay all developers- it would be an insult to many of them – say Prof Harrell creator of HMisc to accept – but can Revolution expand its dev base (and prospect for future employees) by even sponsoring some R Scholarships.

And I am sure that if Revolution opens up some more code to the community- they would the rest of the world and it’s help useful. If it cant trust people like R Gentleman with some source code – well he is a board member.

——————————————————————————————–

Now to sum up some technical discussions on NeW R

1)  An accepted way of benchmarking efficiencies.

2) Code review and incorporation of efficiencies.

3) Multi threading- Multi core usage are trends to be incorporated.

4) GUIs like R Commander E Plugins for other packages, and Rattle for Data Mining to have focussed (or Deducer). This may involve hiring User Interface Designers (like from Apple 😉  who will work for love AND money ( Even the Beatles charge royalty for that song)

5) More support to cloud computing initiatives like Biocep and Elastic R – or Amazon AMI for using cloud computers- note efficiency arguements dont matter if you just use a Chrome Browser and pay 2 cents a hour for an Amazon Instance. Probably R core needs more direct involvement of Google (Cloud OS makers) and Amazon as well as even Salesforce.com (for creating Force.com Apps). Note even more corporates here need to be involved as cloud computing doesnot have any free and open source infrastructure (YET)

_______________________________________________________

Debates will come and go. This is an interesting intellectual debate and someday the liitle guys will win the Revolution-

From Hugh M of Gaping Void-

http://www.gapingvoid.com/Moveable_Type/archives/cat_microsoft_blue_monster_series.html

HOW DOES A SOFTWARE COMPANY MAKE MONEY, IF ALL

SOFTWARE IS FREE?

“If something goes wrong with Microsoft, I can phone Microsoft up and have it fixed. With Open Source, I have to rely on the community.”

And the community, as much as we may love it, is unpredictable. It might care about your problem and want to fix it, then again, it may not. Anyone who has ever witnessed something online go “viral”, good or bad, will know what I’m talking about.

and especially-

http://gapingvoid.com/2007/04/16/how-well-does-open-source-currently-meet-the-needs-of-shareholders-and-ceos/

Source-http://gapingvoidgallery.com/

Kind of sums up why the open core licensing is all about.

Kill R? Wait a sec

1) Is R efficient? (scripting wise, and performance wise) _ Depends on how you code it- some Packages like foreach can help but basic efficiency come from programmer. XDF formats from Revoscalar -the non open R package further improve programming efficiency

2) Should R be written from scratch?

You got to be kidding- It depends on how you define scratch after 2 million users

This has been done with S, then S Plus and now R.

3) What should be the license of R (if it was made a new)?

GPL license is fine. You need to do a better job of executing the license. Currently interfaces to R exist from SPSS, SAS, KXEN , other companies as well. To my knowledge royalty payments as well as formal code sharing does not agree.

R core needs to do a better job of protecting the work of 2500 package-creators rather than settling for a few snacks at events, sponsorships, Corporate Board Membership for Prof Gentleman, and 4-5 packages donated to it. The only way R developers can currently support their research is write a book (ny Springer mostly)

Eg GGplot and Hmisc are likely to be used more by average corporate user. Do their creators deserve royalty if creators of RevoScalar are getting it?

If some of 2 million users gave 1 $ to R core (compared to 9 million in last round of funding in Revolution Analytics)- you would have enough money to create a 64 bit optimized R for Linux (missing in Enterprise R), Amazon R APIs (like Karim Chine’s efforts), R GUIs (like Rattle’s commercial version) etc etc

The developments are not surprising given that Microsoft and Intel are funding Revolution Analytics http://www.dudeofdata.com/?p=1967

R controversies come and go (this has happened before including the NYT article and shakeup at Revo)

An interesting debate on whether R should be killed to make an upgrade to a more efficient language.

From Tal (creator R Bloggers) and on R help list-

There is currently a (very !) lively discussions happening around the web, surrounding the following topics:
1) Is R efficient? (scripting wise, and performance wise)
2) Should R be written from scratch?
3) What should be the license of R (if it was made a new)?

Very serious people have taken part in the debates so far.  I hope to let you know of the places I came by, so you might be able to follow/participate
in these (IMHO) important discussions.

The discussions started in the response for the following blog post on
Xi’An’s blog:
http://xianblog.wordpress.com/2010/09/06/insane/


Followed by the (short) response post by Ross Ihaka:
http://xianblog.wordpress.com/2010/09/13/simply-start-over-and-build-something-better/


Other discussions started to appear on Andrew Gelman’s blog:
http://www.stat.columbia.edu/~cook/movabletype/archives/2010/09/ross_ihaka_to_r.html

And (many) more responses started to appear in the hackers news website:
http://news.ycombinator.com/item?id=1687054

I hope these discussions will have fruitful results for our community,
Tal

—————-Contact
Details:——————————————————-
Contact me: Tal.Galili@gmail.com |  972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)

My 0 cents ( see it would 2 cents but it;s free)

Google AppInventor -Android and Business Intelligence

Here is a great new tool for techies to start creating Android Apps right away- even if you have no knowledge of the platform. Of course there are existing great number of apps- including my favorite Android Data Mining App in R – called AnalyticDroid http://analyticdroid.togaware.com/

Basically it calls the Rattle (R Analytical Tool To Learn Easily) Data Mining GUI -enabling data mining from an Android Mobile using remote computing.

I dont know if any other statistical application is available on Android Mobiles- though SAS did have a presentation on using SAS on IPhone

http://www.wuss.org/proceedings09/09WUSSProceedings/papers/dpr/DPR-Truong.pdf



SAS Mobile -Iphone App

All you need to do is go to http://appinventor.googlelabs.com/about/index.html and request access (yes there is a 2 week approval waiting line)

Because App Inventor provides access to a GPS-location sensor, you can build apps that know where you are. You can build an app to help you remember where you parked your car, an app that shows the location of your friends or colleagues at a concert or conference, or your own custom tour app of your school, workplace, or a museum.
You can write apps that use the phone features of an Android phone. You can write an app that periodically texts “missing you” to your loved ones, or an app “No Text While Driving” that responds to all texts automatically with “sorry, I’m driving and will contact you later”. You can even have the app read the incoming texts aloud to you (though this might lure you into responding).
App Inventor provides a way for you to communicate with the web. If you know how to write web apps, you can use App Inventor to write Android apps that talk to your favorite web sites, such as Amazon and Twitter.

Here is a not so statistical Android App I am trying to create called Hang-Out

using the current GPS location of your phone to find nearest Pub, Movie or Diner and catch Bus- Train based on your location city, the GPS and time of request and schedule of those cities public transport- very much WIP

Amazon announces Micro Instances for cloud computing

From Amazon http://aws.amazon.com/ec2

Micro instances provide 613 MB of memory and support 32-bit and 64-bit platforms on both Linux and Windows. Micro instance pricing for On-Demand instances starts at $0.02 per hour for Linux and $0.03 per hour for Windows.

Customers have asked us for a lower priced instance type that could satisfy the needs of their less demanding applications. Micro instances are optimized for applications that require lower throughput, but which still may consume significant compute cycles periodically. Micro instances provide a small amount of consistent CPU resources, and also allow you to burst CPU capacity when additional cycles are available.

Micro instances are available immediately in all regions, and we invite you to go and try one out for yourself today! Learn more about Amazon EC2’s new Micro instances ataws.amazon.com/ec2.

Micro Instances

Instances of this family provide a small amount of consistent CPU resources and allow you to burst CPU capacity when additional cycles are available. They are well suited for lower throughput applications and web sites that consume significant compute cycles periodically.

  • Micro Instance 613 MB of memory, up to 2 ECUs (for short periodic bursts), EBS storage only, 32-bit or 64-bit platform

So dont buy that new CPU yet- use existing hardware in tandem with these micro instances (and internet) to compute- (but  only if your corporate IP administrator wasn’t trained in Windows only certifications 😉