Different kinds of Clouds

Some slides I liked on cloud computing infrastructure as offered by Amazon, IBM, Google , Windows and Oracle

 

 

New Amazon Instance: High I/O for NoSQL

Latest from the Amazon Cloud-

hi1.4xlarge instances come with eight virtual cores that can deliver 35 EC2 Compute Units (ECUs) of CPU performance, 60.5 GiB of RAM, and 2 TiB of storage capacity across two SSD-based storage volumes. Customers using hi1.4xlarge instances for their applications can expect over 120,000 4 KB random write IOPS, and as many as 85,000 random write IOPS (depending on active LBA span). These instances are available on a 10 Gbps network, with the ability to launch instances into cluster placement groups for low-latency, full-bisection bandwidth networking.

High I/O instances are currently available in three Availability Zones in US East (N. Virginia) and two Availability Zones in EU West (Ireland) regions. Other regions will be supported in the coming months. You can launch hi1.4xlarge instances as On Demand instances starting at $3.10/hour, and purchase them as Reserved Instances

http://aws.amazon.com/ec2/instance-types/

High I/O Instances

Instances of this family provide very high instance storage I/O performance and are ideally suited for many high performance database workloads. Example applications include NoSQL databases like Cassandra and MongoDB. High I/O instances are backed by Solid State Drives (SSD), and also provide high levels of CPU, memory and network performance.

High I/O Quadruple Extra Large Instance

60.5 GB of memory
35 EC2 Compute Units (8 virtual cores with 4.4 EC2 Compute Units each)
2 SSD-based volumes each with 1024 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
Storage I/O Performance: Very High*
API name: hi1.4xlarge

*Using Linux paravirtual (PV) AMIs, High I/O Quadruple Extra Large instances can deliver more than 120,000 4 KB random read IOPS and between 10,000 and 85,000 4 KB random write IOPS (depending on active logical block addressing span) to applications. For hardware virtual machines (HVM) and Windows AMIs, performance is approximately 90,000 4 KB random read IOPS and between 9,000 and 75,000 4 KB random write IOPS. The maximum sequential throughput on all AMI types (Linux PV, Linux HVM, and Windows) per second is approximately 2 GB read and 1.1 GB write.

Talking on Big Data Analytics

I am going  being sponsored to a Government of India sponsored talk on Big Data Analytics at Bangalore on Friday the 13 th of July. If you are in Bangalore, India you may drop in for a dekko. Schedule and Abstracts (i am on page 7 out 9) .

Your tax payer money is hard at work- (hassi majak only if you are a desi. hassi to fassi.)

13 July 2012 (9.30 – 11.00 & 11.30 – 1.00)
Big Data Big Analytics
The talk will showcase using open source technologies in statistical computing for big data, namely the R programming language and its use cases in big data analysis. It will review case studies using the Amazon Cloud, custom packages in R for Big Data, tools like Revolution Analytics RevoScaleR package, as well as the newly launched SAP Hana used with R. We will also review Oracle R Enterprise. In addition we will show some case studies using BigML.com (using Clojure) , and approaches using PiCloud. In addition it will showcase some of Google APIs for Big Data Analysis.

Lastly we will talk on social media analysis ,national security use cases (i.e. cyber war) and privacy hazards of big data analytics.

Schedule

View more presentations from Ajay Ohri.
Abstracts

View more documents from Ajay Ohri.

 

How big is R on CRAN #rstats

3.87 GB and 3786 packages. Thats what you need to install the whole of R as on CRAN

( Note- Many IT administrators /Compliance Policies in enterprises forbid installing from the Internet in work offices.

Which is where the analytics,$$, and people are)

As downloaded from the CRAN Mirror at UCLA.

Takes 3 hours to download at 1 mbps (I was on an Amazon Ec2 instance)

See screenshot.

Next question- who is the man responsible in the R project for deleting old /depreciated/redundant packages if the authors dont do it.

 

Visualizing Bigger Data in R using Tabplot

The amazing tabplot package creates the tableplot feature for visualizing huge chunks of data. This is a great example of creative data visualization that is resource lite and extremely fast in a first look at the data. (note- The tabplot package is being used and table plot function is being used . The TABLEPLOT package is different and is NOT being used here).

library(ggplot2)
data(diamonds)
library(tabplot)
tableplot(diamonds)
system.time(tableplot(diamonds))

visualizing a 50000 row by 10 variable dataset in 0.7 s is fast !!

click on screenshot to see it

and some say R is slow 😉

 

Note I used a free Windows Amazon EC2 Instance for it-

See screenshot for hardware configuration

 

the best thing is there is a handy GTK GUI for this package. You can check it out at

 

 

send email by R

For automated report delivery I have often used send email options in BASE SAS. For R, for scheduling tasks and sending me automated mails on completion of tasks I have two R options and 1 Windows OS scheduling option. Note red font denotes the parameters that should be changed. Anything else should NOT be changed.

Option 1-

Use the mail package at

http://cran.r-project.org/web/packages/mail/mail.pdf

> library(mail)

Attaching package: ‘mail’

The following object(s) are masked from ‘package:sendmailR’:

sendmail

>
> sendmail(“ohri2007@gmail.com“, subject=”Notification from R“,message=“Calculation finished!”, password=”rmail”)
[1] “Message was sent to ohri2007@gmail.com! You have 19 messages left.”

Disadvantage- Only 20 email messages by IP address per day. (but thats ok!)

Option 2-

use sendmailR package at http://cran.r-project.org/web/packages/sendmailR/sendmailR.pdf

install.packages()
library(sendmailR)
from <- sprintf(“<sendmailR@%s>”, Sys.info()[4])
to <- “<ohri2007@gmail.com>”
subject <- “Hello from R
body <- list(“It works!”, mime_part(iris))
sendmail(from, to, subject, body,control=list(smtpServer=”ASPMX.L.GOOGLE.COM”))

 

 

BiocInstaller version 1.2.1, ?biocLite for help
> install.packages(“sendmailR”)
Installing package(s) into ‘/home/ubuntu/R/library’
(as ‘lib’ is unspecified)
also installing the dependency ‘base64’

trying URL ‘http://cran.at.r-project.org/src/contrib/base64_1.1.tar.gz&#8217;
Content type ‘application/x-gzip’ length 61109 bytes (59 Kb)
opened URL
==================================================
downloaded 59 Kb

trying URL ‘http://cran.at.r-project.org/src/contrib/sendmailR_1.1-1.tar.gz&#8217;
Content type ‘application/x-gzip’ length 6399 bytes
opened URL
==================================================
downloaded 6399 bytes

BiocInstaller version 1.2.1, ?biocLite for help
* installing *source* package ‘base64’ …
** package ‘base64’ successfully unpacked and MD5 sums checked
** libs
gcc -std=gnu99 -I/usr/local/lib64/R/include -I/usr/local/include -fpic -g -O2 -c base64.c -o base64.o
gcc -std=gnu99 -shared -L/usr/local/lib64 -o base64.so base64.o -L/usr/local/lib64/R/lib -lR
installing to /home/ubuntu/R/library/base64/libs
** R
** preparing package for lazy loading
** help
*** installing help indices
** building package indices …
** testing if installed package can be loaded
BiocInstaller version 1.2.1, ?biocLite for help

* DONE (base64)
BiocInstaller version 1.2.1, ?biocLite for help
* installing *source* package ‘sendmailR’ …
** package ‘sendmailR’ successfully unpacked and MD5 sums checked
** R
** preparing package for lazy loading
** help
*** installing help indices
** building package indices …
** testing if installed package can be loaded
BiocInstaller version 1.2.1, ?biocLite for help

* DONE (sendmailR)

The downloaded packages are in
‘/tmp/RtmpsM222s/downloaded_packages’
> library(sendmailR)
Loading required package: base64
> from <- sprintf(“<sendmailR@%s>”, Sys.info()[4])
> to <- “<ohri2007@gmail.com>”
> subject <- “Hello from R”
> body <- list(“It works!”, mime_part(iris))
> sendmail(from, to, subject, body,
+ control=list(smtpServer=”ASPMX.L.GOOGLE.COM”))
$code
[1] “221”

$msg
[1] “2.0.0 closing connection ff2si17226764qab.40”

Disadvantage-This worked when I used the Amazon Cloud using the BioConductor AMI (for free 2 hours) at http://www.bioconductor.org/help/cloud/

It did NOT work when I tried it use it from my Windows 7 Home Premium PC from my Indian ISP (!!) .

It gave me this error

or in wait_for(250) :
SMTP Error: 5.7.1 [180.215.172.252] The IP you’re using to send mail is not authorized

 

PAUSE–

ps Why do this (send email by R)?

Note you can add either of the two programs of the end of the code that you want to be notified automatically. (like daily tasks)

This is mostly done for repeated business analytics tasks (like reports and analysis that need to be run at specific periods of time)

pps- What else can I do with this?

Can be modified to include sms or tweets  or even blog by email by modifying the   “to”  location appropriately.

3) Using Windows Task Scheduler to run R codes automatically (either the above)

or just sending an email

got to Start>  All Programs > Accessories >System Tools > Task Scheduler ( or by default C:Windowssystem32taskschd.msc)

Create a basic task

Now you can use this to run your daily/or scheduled R code  or you can send yourself email as well.

and modify the parameters- note the SMTP server (you can use the ones for google in example 2 at ASPMX.L.GOOGLE.COM)

and check if it works!

 

Related

 Geeky Things , Bro

Configuring IIS on your Windows 7 Home Edition-

note path to do this is-

Control Panel>All Control Panel Items> Program and Features>Turn Windows features on or off> Internet Information Services

and

http://stackoverflow.com/questions/709635/sending-mail-from-batch-file

 

Interview: Hjálmar Gíslason, CEO of DataMarket.com

Here is an interview with Hjálmar Gíslason, CEO of Datamarket.com  . DataMarket is an active marketplace for structured data and statistics. Through powerful search and visual data exploration, DataMarket connects data seekers with data providers.

 

Ajay-  Describe your journey as an entrepreneur and techie in Iceland. What are the 10 things that surprised you most as a tech entrepreneur.

HG- DataMarket is my fourth tech start-up since at age 20 in 1996. The previous ones have been in gaming, mobile and web search. I come from a technical background but have been moving more and more to the business side over the years. I can still prototype, but I hope there isn’t a single line of my code in production!

Funny you should ask about the 10 things that have surprised me the most on this journey, as I gave a presentation – literally yesterday – titled: “9 things nobody told me about the start-up business”

They are:
* Do NOT generalize – especially not to begin with
* Prioritize – and find a work-flow that works for you
* Meet people – face to face
* You are a sales person – whether you like it or not
* Technology is not a product – it’s the entire experience
* Sell the current version – no matter how amazing the next one is
* Learn from mistakes – preferably others’
* Pick the right people – good people is not enough
* Tell a good story – but don’t make them up

I obviously elaborate on each of these points in the talk, but the points illustrate roughly some of the things I believe I’ve learned … so far 😉

9 things nobody told me about the start-up business

Ajay-

Both Amazon  and Google  have entered the public datasets space. Infochimps  has 14,000+ public datasets. The US has http://www.data.gov/

So clearly the space is both competitive and yet the demand for public data repositories is clearly under served still. 

How does DataMarket intend to address this market in a unique way to differentiate itself from others.

HG- DataMarket is about delivering business data to decision makers. We help data seekers find the data they need for planning and informed decision making, and data publishers reaching this audience. DataMarket.com is the meeting point, where data seekers can come to find the best available data, and data publishers can make their data available whether for free or for a fee. We’ve populated the site with a wealth of data from public sources such as the UN, Eurostat, World Bank, IMF and others, but there is also premium data that is only available to those that subscribe to and pay for the access. For example we resell the entire data offering from the EIU (Economist Intelligence Unit) (link: http://datamarket.com/data/list/?q=provider:eiu)

DataMarket.com allows all this data to be searched, visualized, compared and downloaded in a single place in a standard, unified manner.

We see many of these efforts not as competition, but as valuable potential sources of data for our offering, while others may be competing with parts of our proposition, such as easy access to the public data sets.

 

Ajay- What are your views on data confidentiality and access to data owned by Governments funded by tax payer money.

HG- My views are very simple: Any data that is gathered or created for taxpayers’ money should be open and free of charge unless higher priorities such as privacy or national security indicate otherwise.

Reflecting that, any data that is originally open and free of charge is still open and free of charge on DataMarket.com, just easier to find and work with.

Ajay-  How is the technology entrepreneurship and venture capital scene in Iceland. What things work and what things can be improved?

HG- The scene is quite vibrant, given the small community. Good teams with promising concepts have been able to get the funding they need to get started and test their footing internationally. When the rapid growth phase is reached outside funding may still be needed.

There are positive and negative things about any location. Among the good things about Iceland from the stand point of a technology start-up are highly skilled tech people and a relatively simple corporate environment. Among the bad things are a tiny local market, lack of skills in international sales and marketing and capital controls that were put in place after the crash of the Icelandic economy in 2008.

I’ve jokingly said that if a company is hot in the eyes of VCs it would get funding even if it was located in the jungles of Congo, while if they’re only lukewarm towards you, they will be looking for any excuse not to invest. Location can certainly be one of them, and in that case being close to the investor communities – physically – can be very important.

We’re opening up our sales and marketing offices in Boston as we speak. Not to be close to investors though, but to be close to our market and current customers.

Ajay- Describe your hobbies when you are not founding amazing tech startups.

HG- Most of my time is spent working – which happens to by my number one hobby.

It is still important to step away from it all every now and then to see things in perspective and come back with a clear mind.

I *love* traveling to exotic places. Me and my wife have done quite a lot of traveling in Africa and S-America: safari, scuba diving, skiing, enjoying nature. When at home I try to do some sports activities 3-4 times a week at least, and – recently – play with my now 8 month old son as much as I can.

About-

http://datamarket.com/p/about/team/

Management

Hjalmar GislasonHjálmar Gíslason, Founder and CEO: Hjalmar is a successful entrepreneur, founder of three startups in the gaming, mobile and web sectors since 1996. Prior to launching DataMarket, Hjalmar worked on new media and business development for companies in the Skipti Group (owners of Iceland Telecom) after their acquisition of his search startup – Spurl. Hjalmar offers a mix of business, strategy and technical expertise. DataMarket is based largely on his vision of the need for a global exchange for structured data.

hjalmar.gislason@datamarket.com

To know more, have a quick  look at  http://datamarket.com/