2010 – Page 30 – DECISION STATS

Windows Azure vs Amazon EC2 (and Google Storage)

Here is a comparison of Windows Azure instances vs Amazon compute instances

Compute Instance Sizes:

Developers have the ability to choose the size of VMs to run their application based on the applications resource requirements. Windows Azure compute instances come in four unique sizes to enable complex applications and workloads.

Compute Instance Size	CPU	Memory	Instance Storage	I/O Performance
Small	1.6 GHz	1.75 GB	225 GB	Moderate
Medium	2 x 1.6 GHz	3.5 GB	490 GB	High
Large	4 x 1.6 GHz	7 GB	1,000 GB	High
Extra large	8 x 1.6 GHz	14 GB	2,040 GB	High

Standard Rates:

Windows Azure

Compute
- Small instance (default): $0.12 per hour
- Medium instance: $0.24 per hour
- Large instance: $0.48 per hour
- Extra large instance: $0.96 per hour
Storage
- $0.15 per GB stored per month
- $0.01 per 10,000 storage transactions
Content Delivery Network (CDN)
- $0.15 per GB for data transfers from European and North American locations*
- $0.20 per GB for data transfers from other locations*
- $0.01 per 10,000 transactions*

Source –

http://www.microsoft.com/windowsazure/offers/popup/popup.aspx?lang=en&locale=en-US&offer=MS-AZR-0001P

and

http://www.microsoft.com/windowsazure/windowsazure/

Amazon EC2 has more options though——————————-

http://aws.amazon.com/ec2/pricing/

Standard On-Demand Instances	Linux/UNIX Usage	Windows Usage
Small (Default)	$0.085 per hour	$0.12 per hour
Large	$0.34 per hour	$0.48 per hour
Extra Large	$0.68 per hour	$0.96 per hour
Micro On-Demand Instances	Linux/UNIX Usage	Windows Usage
Micro	$0.02 per hour	$0.03 per hour
High-Memory On-Demand Instances
Extra Large	$0.50 per hour	$0.62 per hour
Double Extra Large	$1.00 per hour	$1.24 per hour
Quadruple Extra Large	$2.00 per hour	$2.48 per hour
High-CPU On-Demand Instances
Medium	$0.17 per hour	$0.29 per hour
Extra Large	$0.68 per hour	$1.16 per hour
Cluster Compute Instances
Quadruple Extra Large	$1.60 per hour	N/A*
`*` Windows is not currently available for Cluster Compute Instances.

http://aws.amazon.com/ec2/instance-types/

Standard Instances

Instances of this family are well suited for most applications.

Small Instance – default*

1.7 GB memory
1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit)
160 GB instance storage (150 GB plus 10 GB root partition)
32-bit platform
I/O Performance: Moderate
API name: m1.small

Large Instance

7.5 GB memory
4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)
850 GB instance storage (2×420 GB plus 10 GB root partition)
64-bit platform
I/O Performance: High
API name: m1.large

Extra Large Instance

15 GB memory
8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
1,690 GB instance storage (4×420 GB plus 10 GB root partition)
64-bit platform
I/O Performance: High
API name: m1.xlarge

Micro Instances

Instances of this family provide a small amount of consistent CPU resources and allow you to burst CPUcapacity when additional cycles are available. They are well suited for lower throughput applications and web sites that consume significant compute cycles periodically.

Micro Instance

613 MB memory
Up to 2 EC2 Compute Units (for short periodic bursts)
EBS storage only
32-bit or 64-bit platform
I/O Performance: Low
API name: t1.micro

High-Memory Instances

Instances of this family offer large memory sizes for high throughput applications, including database and memory caching applications.

High-Memory Extra Large Instance

17.1 GB of memory
6.5 EC2 Compute Units (2 virtual cores with 3.25 EC2 Compute Units each)
420 GB of instance storage
64-bit platform
I/O Performance: Moderate
API name: m2.xlarge

High-Memory Double Extra Large Instance

34.2 GB of memory
13 EC2 Compute Units (4 virtual cores with 3.25 EC2 Compute Units each)
850 GB of instance storage
64-bit platform
I/O Performance: High
API name: m2.2xlarge

High-Memory Quadruple Extra Large Instance

68.4 GB of memory
26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each)
1690 GB of instance storage
64-bit platform
I/O Performance: High
API name: m2.4xlarge

High-CPU Instances

Instances of this family have proportionally more CPU resources than memory (RAM) and are well suited for compute-intensive applications.

High-CPU Medium Instance

1.7 GB of memory
5 EC2 Compute Units (2 virtual cores with 2.5 EC2 Compute Units each)
350 GB of instance storage
32-bit platform
I/O Performance: Moderate
API name: c1.medium

High-CPU Extra Large Instance

7 GB of memory
20 EC2 Compute Units (8 virtual cores with 2.5 EC2 Compute Units each)
1690 GB of instance storage
64-bit platform
I/O Performance: High
API name: c1.xlarge

Cluster Compute Instances

Instances of this family provide proportionally high CPU resources with increased network performance and are well suited for High Performance Compute (HPC) applications and other demanding network-bound applications. Learn more about use of this instance type for HPC applications.

Cluster Compute Quadruple Extra Large Instance

23 GB of memory
33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core “Nehalem” architecture)
1690 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
API name: cc1.4xlarge

Also http://www.microsoft.com/en-us/sqlazure/default.aspx

offers SQL Databases as a service with a free trial offer

If you are into .Net /SQL big time or too dependent on MS, Azure is a nice option to EC2 http://www.microsoft.com/windowsazure/offers/popup/popup.aspx?lang=en&locale=en-US&offer=COMPARE_PUBLIC

Updated- I just got approved for Google Storage so am adding their info- though they are in Preview (and its free right now) 🙂

https://code.google.com/apis/storage/docs/overview.html

Functionality

Google Storage for Developers offers a rich set of features and capabilities:

Basic Operations

Store and access data from anywhere on the Internet.
Range-gets for large objects.
Manage metadata.

Security and Sharing

User authentication using secret keys or Google account.
Authenticated downloads from a web browser for Google account holders.
Secure access using SSL.
Easy, powerful sharing and collaboration via ACLs for individuals and groups.

Performance and scalability

Up to 100 gigabytes per object and 1,000 buckets per account during the preview.
Strong data consistency—read-after-write consistency for all upload and delete operations.
Namespace for your domain—only you can create bucket URIs containing your domain name.
Data replicated in multiple data centers across the U.S. and within the same data center.

Tools

Web-based storage manager.
GSUtil, an open source command line tool.
Compatible with many existing cloud storage tools and libraries.

Read the Getting Started Guide to learn more about the service.

Note: Google Storage for Developers does not support Google Apps accounts that use your company domain name at this time.

Pricing

Google Storage for Developers pricing is based on usage.

Storage—$0.17/gigabyte/month
Network
- Upload data to Google
  - $0.10/gigabyte
- Download data from Google
  - $0.15/gigabyte for Americas and EMEA
  - $0.30/gigabyte for Asia-Pacific
Requests
- PUT, POST, LIST—$0.01 per 1,000 requests
- GET, HEAD—$0.01 per 10,000 requests

Matlab-Mathematica-R and GPU Computing

Matlab announced they have a parallel computing toolbox- specially to enable GPU computing as well

http://www.mathworks.com/products/parallel-computing/

Parallel Computing Toolbox™ lets you solve computationally and data-intensive problems using multicore processors, GPUs, and computer clusters. High-level constructs—parallel for-loops, special array types, and parallelized numerical algorithms—let you parallelize MATLAB^® applications without CUDA or MPI programming. You can use the toolbox with Simulink^® to run multiple simulations of a model in parallel.

MATLAB GPU Support

The toolbox provides eight workers (MATLAB computational engines) to execute applications locally on a multicore desktop. Without changing the code, you can run the same application on a computer cluster or a grid computing service (using MATLAB Distributed Computing Server™). You can run parallel applications interactively or in batch.

Parallel Computing with MATLAB on Amazon Elastic Compute Cloud (EC2)

Also a video of using Mathematica and GPU

Also R has many packages for GPU computing

Parallel computing: GPUs

from http://cran.r-project.org/web/views/HighPerformanceComputing.html

The gputools package by Buckner provides several common data-mining algorithms which are implemented using a mixture of nVidia‘s CUDA langauge and cublas library. Given a computer with an nVidia GPU these functions may be substantially more efficient than native R routines. The rpud package provides an optimised distance metric for NVidia-based GPUs.
The cudaBayesreg package by da Silva implements the rhierLinearModel from the bayesm package using nVidia’s CUDA langauge and tools to provide high-performance statistical analysis of fMRI voxels.
The rgpu package (see below for link) aims to speed up bioinformatics analysis by using the GPU.
The magma package provides an interface to the hybrid GPU/CPU library Magma (see below for link).
The gcbd package implements a benchmarking framework for BLAS and GPUs (using gputools).

I tried to search for SAS and GPU and SPSS and GPU but got nothing. Maybe they would do well to atleast test these alternative hardwares-

Also see Matlab on GPU comparison for the product Jacket vs Parallel Computing Toolbox

http://www.accelereyes.com/products/compare

A Google App for Sales- ERPLY

While not quite Salesforce.com, a promising start for the first ERP Google App at https://www.google.com/enterprise/marketplace/viewListing?productListingId=5759+8485502070963042532

An interesting development-maybe there could be some statistical or BI apps on Google App Marketplace soon 😉

The Comic Water Games (aka Common Wealth Games)

We in Delhi, India are a tough people. With summer temperatures from 46 Degree Celcius (114 Degree Fahrenheit) and Winter temperatures from 2-3 Degree Celcius (just above freezing), high pollution levels, the worst traffic jams (and highest per capita cars)- there is very little that intimidates the Average Delhiite-

But the Return of the British Empire is scaring us- and it is called Common Wealth Games. The Common Wealth is a group of countries that used to be colonized by Britain in her colonial days ( USA is not a member though- as they probably kicked way too much British butt while gaining independence).

And every 4 years they have CommonWealth games (read games for the non US English speaking world). So when our commie neighborhood– the Chinese went and got themselves an Olympics- we decided to get ourselves this CWG games too. Big deal- national pride- rising economic power and all that.

So far the Games has meant the following- lots of roads dug up, lot of stadiums in various degrees of preparation, a total cost of 2 Billion USD, rampant allegations of corruption due to the ten times increase in budget – including rather suspicious looking documents procured by our local press (yes Indian press is free as it is a democracy)

And add divine grace. Delhi has the wettest monsoon since 1978- it rains cats and dogs in September- and we now have a mini dengue malaria epidemic. 4 countries have declared the living quarters for athletes as uninhabitable , some have walked out, the inevitable terrorists injured two Taiwanese tourists this weekend (in a semi ironic email they said they were prepared as the government was prepared- it isn’t)

Today a bridge collapsed-

http://www.nytimes.com/2010/09/22/sports/22iht-GAMES.html?_r=1&hp

On Tuesday afternoon, a bridge next to Jawaharlal Nehru Stadium, the main Games venue, fell apart. The footbridge collapsed into three pieces, taking several workers with it and uprooting one side of the arch that supported it.

A police officer at the scene said that 27 people had been injured, four of them seriously, in the collapse.

“This will not affect the Games,” said Raj Kumar Chauhan, a Delhi minister for development, who spoke on the scene. “We can put the bridge up again, or make a new one.”

and

http://www.nytimes.com/2010/09/20/world/asia/20india.html?ref=sports

“We really need to learn how to plan,” said Vrinda Walavalkar, a public relations executive who is not connected to the Games.

“Maybe we feel we have so many lifetimes to achieve things” that it does not matter if it gets done this time, she said.

Mr. Gupta, the shopkeeper, found a metaphor in Hindu wedding tradition.

The groom’s party, known as the barat, traditionally marches to the bride’s house on horseback with his friends and family, he explained. When the barat appears, the bride has to come to the door, he said.

“If the bride is not ready, you patch her up and try to hide all her defects,” Mr. Gupta said, and then you send her outside.

————————————————————————————————————–

To some this may be shocking. To the average Delhi-ite battling traffic and rain , this is one more episode in the chaotic Capital. As a small solace- Delhi still has the best and cheapest street food this part of the world- with golgappas, tikki and chat. If only you can beat the rain to get them !

Also see http://en.wikipedia.org/wiki/Delhi if you like to know more.

Hearst DataMining Challenge

Check out the Hearst Data Mining Challenge- a new competition-sponsored by DMA, Hearst Magazine, and EXL

THE HEARST CHALLENGE STARTS ON OCTOBER 14TH

CHALLENGE

DESCRIPTION

Over the years, the magazine publishing industry has made significant strides in improving subscription based circulation by developing analytic frameworks that better predict customer response to acquisition and renewal offers. The objective of this contest is to apply the same analytic discipline and effectively predict newsstand locations “response”. Specifically the objective is to predict the number of copies to be placed in each newsstand location to optimize the overall contribution of the newsstand location typically referred to as draw.

Data for the competition is provided by CMG and Experian.

and

RULES

HOW TO ENTER: Beginning October 14th, 2010 at 12:01 AM (ET) throughDecember 3rd, 2010 at 11:59 PM (ET) go to the Hearst Challenge website located at http://www.HearstChallenge.com (the “Site”) and complete and submit the entry form pursuant to the onscreen instructions. Entrants will be provided a historical sample of newsstand location draw, sales and associated location level data to help develop their predictive algorithm. Hearst will in turn hold back two distinct sets of draw/sales data, one to be used as a validation set by the contestant and one to be used as a final contest evaluation set. Entrants may not include any other external variables for the challenge. Additional details will be provided with the data. Entrants will be able to track their performance against the validation set throughout the course of the challenge via a leader tracking board to be made available on the Site. Entries must include the following documentation:

Data file with id variables and expected sales values by store and publication

The final model/ algorithm code used to score the final data set

Any supporting documentation that pertains to the development of the submitted model/algorithm including variable creation. Variables that were used in the model need to be traced through from input to coefficient / node (if using a tree based methodology).

Check out http://www.hearstchallenge.com/index.php for further details.

Where is Waldo? Webcast on Network Intelligence

From the good folks at AsterData, a webcast on a slightly interesting analytics topic

Enterprises and government agencies can become overwhelmed with information. The value of all that data lies in the insights it can reveal. To get the maximum value, you need an analytic platform that lets you analyze terabytes of information rapidly for immediate actionable insights.

Aster Data’s massively parallel database with an integrated analytics engine can quickly reveal hard-to-recognize trends on huge datasets which other systems miss. The secret? A patent-pending SQL-MapReduce framework that enables business analysts and business intelligence (BI) tools to iteratively analyze big data more quickly. This allows you to find anomalies more quickly and stop disasters before they happen.

Discover how you can improve:

Network intelligence via graph analysis to understand connectivity among suspects, information propagation, and the flow of goods
Security analysis to prevent fraud, bot attacks, and other breaches
Geospatial analytics to quickly uncover details about regions and subsets within those communities
Visual analytics to derive deeper insights more quickly

September Roundup by Revolution

From the monthly newsletter- which I consider quite useful for keeping updated on application of R

——————————————————————————————————————————————————————————————————–

Revolution News
Every month, we’ll bring you the latest news about Revolution’s products and events in this section. Follow us on Twitter at @RevolutionR for up-to-the-minute news and updates from Revolution Analytics!

Revolution R Enterprise 4.0 for Windows now available. Based on the latest R 2.11.1 and including the RevoScaleR package for big-data analysis in R, Revolution R Enterprise is now available for download for Windows 32-bit and 64-bit systems. Click here to subscribe, or available free to academia.

New! Integrate R with web applications, BI dashboards and more with web services. RevoDeployR is a new Web Services framework that integrates dynamic R-based computations into applications for business users. It will be available September 30 with Revolution R Enterprise Server on RHEL 5. Click here to learn more.

Free Webinar, September 22: In a joint webinar from Revolution Analytics and Jaspersoft, learn how to use RevoDeployR to integrate advanced analytics on-demand in applications, BI dashboards, and on the web. Register here.

Revolution in the News: SearchBusinessAnalytics.com previews the forthcoming Revolution R GUI; Channel Register introduces RevoDeployR, while IT Business Edge shows off the Web Services architecture; and ReadWriteWeb.com looks at how RevoScaleR tackles the Big Data explosion.

Inside-R: A new site for the R Community. At www.inside-R.org you’ll find the latest information about R from around the Web, searchable R documentation and packages, hints and tips about R, and more. You can even add a “Download R” badge to your own web-page to help spread the word about R.

R News, Tips and Tricks from the Revolutions blog
The Revolutions blog brings you daily news and tips about R, statistics and open source. Here are some highlights from Revolutions from the past month.

R’s key role in the oil spill response: Read how NIST’s Division Chief of Statistical Engineering used R to provide critical analysis in real time to the Secretaries of Energy and the Interior, and helped coordinate the government’s response.

Animating data with R and Google Earth: Learn how to use R to create animated visualizations of geographical data with Google Earth, such as this video showing how tuna migrations intersect with the location of the Gulf oil spill.

Are baseball games getting longer? Or is it just Red Sox games? Ryan Elmore uses nonparametric regression in R to find out.

Keynote presentations from useR! 2010: the worldwide R user’s conference was a great success, and there’s a wealth of useful tips and information in the presentations. Video of the keynote presentations are available too: check out in particular Frank Harrell’s talk Information Allergy, and Friedrich Leisch’s talk on reproducible statistical research.

Looking for more R tips and tricks? Check out the monthly round-ups at the Revolutions blog.

Upcoming Events
Every month, we’ll highlight some upcoming events from R Community Calendar.

September 23: The San Diego R User Group has a meetup on BioConductor and microarray data analysis.

September 28: The Sydney Users of R Forum has a meetup on building world-class predictive models in R (with dinner to follow).

September 28: The Los Angeles R User Group presents an introduction to statistical finance with R.

September 28: The Seattle R User Group meets to discuss, “What are you doing with R?”

September 29: The Raleigh-Durham-Chapel Hill R Users Group has its first meeting.

October 7: The NYC R User Group features a presentation by Prof. Andrew Gelman.

There are also new R user groups in Singapore, Seoul, Denver, Brisbane, and New Jersey. Please let us know if we’re missing your R user group, or if want to get a new one started.

———————————————————————————————-Editor

David Smith, VP Marketing
david@revolutionanalytics.com
Twitter: @revodavid

subscribe here for Revo’s Monthly newsletter-

Compute Instance Sizes:

Standard Instances

Micro Instances

High-Memory Instances

High-CPU Instances

Cluster Compute Instances

Functionality

Basic Operations

Security and Sharing

Performance and scalability

Tools

Pricing

Please share:

Please share:

Please share:

Please share:

THE HEARST CHALLENGE STARTS ON OCTOBER 14TH

CHALLENGE

DESCRIPTION

RULES

Please share:

Please share:

Please share: