So I wanted to really find out how cheap the cloud was- but I got confused by the 23 kinds of instances than Amazon has http://aws.amazon.com/ec2/pricing/ and 15 kinds of instances at https://developers.google.com/compute/pricing.
or whether there is any price collusion between them ;)
Now Amazon has spot pricing so I can bid for prices as well (http://aws.amazon.com/ec2/purchasing-options/spot-instances/ ) and upto 60% off for reserved instances (http://aws.amazon.com/ec2/purchasing-options/reserved-instances/) but charges $2 for dedicated instances (which are not dedicated but pay as you go)
- $2 per hour – An additional fee is charged once per hour in which at least one Dedicated Instance of any type is running in a Region.
Google has sustained discounts ( will not offer Windows on the cloud though!)
The table below describes the discount at each usage level. These discounts apply for all instance types.
|Usage Level (% of month)||% at which incremental is charged||Example incremental rate (USD/per hour) for an n1-standard-1 instance|
|0%-25%||100% of base rate||$0.07|
|25%-50%||80% of base rate||$0.056|
|50%-75%||60% of base rate||$0.042|
|75%-100%||40% of base rate||$0.028|
Anyways- I tried to create this simple table to help me with it- after all hard disks are cheap- it is memory I want on the cloud !
Or maybe I am wrong and the cloud is not so cheap- or its just too complicated for someone to build a pricing calculator that can take in prices from all providers (Amazon, Azure, Google Compute) and show us the money!
|vCPU||RAM(GiB)||$ per Hour||Type -Linux Usage||Provider||Notes|
|t2.micro||1||1||$0.01||General Purpose – Current Generation||Amazon (North Virginia)||Amazon also has spot instances|
|t2.small||1||2||$0.03||General Purpose – Current Generation||Amazon (North Virginia)||that can lower prices|
|t2.medium||2||4||$0.05||General Purpose – Current Generation||Amazon (North Virginia)|
|m3.medium||1||3.75||$0.07||General Purpose – Current Generation||Amazon (North Virginia)|
|m3.large||2||7.5||$0.14||General Purpose – Current Generation||Amazon (North Virginia)|
|m3.xlarge||4||15||$0.28||General Purpose – Current Generation||Amazon (North Virginia)|
|m3.2xlarge||8||30||$0.56||General Purpose – Current Generation||Amazon (North Virginia)|
|c3.large||2||3.75||$0.11||Compute Optimized – Current Generation||Amazon (North Virginia)|
|c3.xlarge||4||7.5||$0.21||Compute Optimized – Current Generation||Amazon (North Virginia)|
|c3.2xlarge||8||15||$0.42||Compute Optimized – Current Generation||Amazon (North Virginia)|
|c3.4xlarge||16||30||$0.84||Compute Optimized – Current Generation||Amazon (North Virginia)|
|c3.8xlarge||32||60||$1.68||Compute Optimized – Current Generation||Amazon (North Virginia)|
|g2.2xlarge||8||15||$0.65||GPU Instances – Current Generation||Amazon (North Virginia)|
|r3.large||2||15||$0.18||Memory Optimized – Current Generation||Amazon (North Virginia)|
|r3.xlarge||4||30.5||$0.35||Memory Optimized – Current Generation||Amazon (North Virginia)|
|r3.2xlarge||8||61||$0.70||Memory Optimized – Current Generation||Amazon (North Virginia)|
|r3.4xlarge||16||122||$1.40||Memory Optimized – Current Generation||Amazon (North Virginia)|
|r3.8xlarge||32||244||$2.80||Memory Optimized – Current Generation||Amazon (North Virginia)|
|i2.xlarge||4||30.5||$0.85||Storage Optimized – Current Generation||Amazon (North Virginia)|
|i2.2xlarge||8||61||$1.71||Storage Optimized – Current Generation||Amazon (North Virginia)|
|i2.4xlarge||16||122||$3.41||Storage Optimized – Current Generation||Amazon (North Virginia)|
|i2.8xlarge||32||244||$6.82||Storage Optimized – Current Generation||Amazon (North Virginia)|
|hs1.8xlarge||16||117||$4.60||Storage Optimized – Current Generation||Amazon (North Virginia)|
|n1-standard-1||1||3.75||$0.07||Standard||Google -US||Google charges per minute|
|n1-standard-2||2||7.5||$0.14||Standard||Google -US||of usage (subject to minimum of 10 minutes)|
|n1-highmem-2||2||13||$0.16||High Memory||Google -US|
|n1-highmem-4||4||26||$0.33||High Memory||Google -US|
|n1-highmem-8||8||52||$0.66||High Memory||Google -US|
|n1-highmem-16||16||104||$1.31||High Memory||Google -US|
|n1-highcpu-2||2||1.8||$0.09||High CPU||Google -US|
|n1-highcpu-4||4||3.6||$0.18||High CPU||Google -US|
|n1-highcpu-8||8||7.2||$0.35||High CPU||Google -US|
|n1-highcpu-16||16||14.4||$0.70||High CPU||Google -US|
|f1-micro||1||0.6||$0.01||Shared Core||Google -US|
|g1-small||1||1.7||$0.04||Shared Core||Google -US|
A Brief Tutorial I wrote by playing with the software at manage.windowsazure.com
Here is an interview with Louis Bajuk-Yorgan, from TIBCO. TIBCO which was the leading commercial vendor to S Plus, the precursor of the R language makes a commercial enterprise version of R called TIBCO Enterprise Runtime for R (TERR). Louis also presented recently at User2014 http://user2014.stat.ucla.edu/abstracts/talks/54_Bajuk-Yorgan.pdf
DecisionStats(DS)- How is TERR different from Revolution Analytics or Oracle R. How is it similar.
DS- How much of R is TERR compatible with?
DS- Describe Tibco Cloud Compute Grid, What are it’s applications for data science.
DS- What advantages does TIBCO’s rich history with the S project give it for the R project.
Lou- Our 20+ years of experience with S-PLUS gave us a unique knowledge of the commercial applications of the S/R language, deep experience with architecting, extending and maintaining a commercial S language engine, strong ties to the R community and a rich trove of algorithms we could apply on developing the TERR engine.
DS- Describe some benchmarks of TERR with open source of R.
DS- TERR is not open source. Why is that?
DS- How is TIBCO a company to work for potential data scientists.
DS- How is TIBCO giving back to the R Community globally. What are it’s plans on community.
DS- As a sixth time attendee of UseR, Describe the evolution of R ecosystem as you have observed it.
Here is an interview with one of the greatest statisticians and educator of this generation, Prof Jan de Leeuw. In this exclusive and free wheeling interview, Prof De Leeuw talks on the evolution of technology, education, statistics and generously shares nuggets of knowledge of interest to present and future statisticians.
DecisionStats(DS)- You have described UCLA Dept of Statistics as your magnum opus.Name a couple of turning points in your career which helped in this creation .
Jan de Leeuw (JDL) -From about 1980 until 1987 I was head of the Department of Data Theory at Leiden University. Our work there produced a large number of dissertations which we published using our own publishing company. I also became president of the Psychometric Society in 1987. These developments resulted in an invitation from UCLA to apply for the position of director of the interdepartmental program in social statistics, with a joint appointment in Mathematics and Psychology. I accepted the offer in 1987. The program eventually morphed into the new Department of Statistics in 1998.
DS- Describe your work with Gifi software and non linear multivariate analysis.
JDL- I started working on NLMVA and MDS in 1968, while I was a graduate student researcher in the new Department of Data Theory. Joe Kruskal and Doug Carroll invited me to spend a year at Bells Labs in Murray Hill in 1973-1974. At that time I also started working with Forrest Young and his student Yoshio Takane. This led to the sequence of “alternating least squares” papers, mainly in Psychometrika. After I returned to Leiden we set up a group of young researchers, supported by NWO (the Dutch equivalent of the NSF) and by SPSS, to develop a series of Fortran programs for NLMVA and MDS.
In 1980 the group had grown to about 10-15 people, and we gave a succesful postgraduate course on the “Gifi methods”, which eventually became the 1990 Gifi book. By the time I left Leiden most people in the group had gone on to do other things, although I continued to work in the area with some graduate students from Leiden and Los Angeles. Then around 2010 I worked with Patrick Mair, visiting scholar at UCLA, to produce the R packages smacof, anacor, homals, aspect, and isotone. Also see https://www.youtube.com/watch?v=u733Mf7jX24
JDL- I started in 1968 with PL/I. Card decks had to be flown to Paris to be compiled and executed on the IBM/360 mainframes. Around the same time APL came up and satisfied my personal development needs, although of course APL code was difficult to communicate. It was even difficult to underatand your own code after a week. We had APL symbol balls on the Selectrix typewriters and APL symbols on the character terminals. The basic model was there — you develop in an interpreted language (APL) and then for production you use a compiled language (FORTRAN). Over the years APL was replaced by XLISP and then by R. Fortran was largely replaced by C, I never switched to C++ or Java. We discouraged our students to use SAS or SPSS or MATLAB. UCLA Statistics promoted XLISP-STAT for quite a long time, but eventually we had to give it up. See http://www.stat.ucla.edu/~deleeuw/janspubs/2005/articles/deleeuw_A_05.pdf.
(In 1998 the UCLA Department of Statistics, which had been one of the major users of Lisp-Stat, and one of the main producers of Lisp-Stat code, decided to switch to S/R. This paper discusses why this decision was made, and what the pros and the cons were. )
Generally, there has never been an computational environment like R — so integrated with statistical practice and development, and so enormous, accessible and democratic. I must admit I personally still prefer to use R as originally intended: as a convenient wrapper around and command line interface for compiled libraries and code. But it is also great for rapid prototyping, and in that role it has changed the face of statistics.
The fact that you cannot really propose statistical computations without providing R references and in many cases R code has contributed a great deal to reproducibility and open access.
JDL- I really don’t know how to answer this. A life cannot be corrected, repeated, or relived.
DecisionStats(DS)- Describe your career journey from being a computer science student to one of the principal creators for RHadoop. What motivated you, what challenges did you overcome. What were the turning points.(You have 3500+ citations. What are most of those citations regarding.)
DS- What do you think is the future of R as an enterprise and research software in terms of computing on mobile, desktop, cloud and how do you see things evolve from here
Here is an interview with Vivian Zhang, CTO and co-founder Supstat which is an interesting startup in the R ecosystem. In this interview Vivian talks of the startup journey, helping spread R in China and New York City, and managing Meetups, conferences and training business with balance and excellence.
DecisionStats- (DS) Describe the story behind creation of SupStat Inc and the journey so far along with milestones and turning points. What is your vision for SupStat and what do you want it to accomplish and how.
Vivian Zhang(VZ) -
SupStat was born in 2012 out of the collaboration of 60+ individuals(Statistician, Computer Engineers, Mathematician,Professors, graduate students and talend Data genius)who met through a well-known non-profit organization in China, Capital of Statistics. The SupStat team met through various collaborations on R packages and analytic work. In 2013, SupStat became involved in the New York City data science community through hosting the NYC Open Data Meetup, and soon began offering formal courses through the NYC Data Science Academy. SupStat offers consulting services in the areas of R development, data visualization, and big data solutions. We are experienced with many technologies and languages including R, Python, Hadoop, Spark, Node.js, etc. Courses offered include Data Science with R (Beginner, Intermediate), Data Science with Python (Beginner, Intermediate), and Hadoop (Beginner, Intermediate), as well as many targeted courses on R packages and data visualization tools.
Allen and I, the two co-founders, have been passionate about Data Mining since a young age (we talked about it back in 1997). With industry experience as Chief Data scientist/Senior Analyst and a spirit of entrepreneurship, we started the firm by gathering all the top R/Hadoop/D3.js programmers we knew.
Milestones of SupStat:
June 2012, Established in Beijing
July 2012, Offered R intensive Bootcamp in Beijing to 50+ college students
June 2013, Established in NYC
Nov 2013, Launched our NYC training brand: NYC Data Science Academy
Jan 2014, Became premium partner of Revolution Analytics in China
Feb 2014, Became training and reseller partner of RStudio in US and China
April 2014, Became Exclusive reseller partner of Transwarp in US
Started to offer R built-in and professional services for Hadoop/Spark
May 2014, Organized and sponsored R conference in Beijing
NYC Open Data Meetup had 1800+ members in one year
Jun 2014, Sponsored UCLA R conference (Vivian was panelist for female R programmer talk.)
The major turning point was in November, 2013, when we decided to start our NYC office and launched the NYC Data Science Academy.
We are committed to helping our clients make distinctive, lasting and substantial improvement in their performance, sales, clients and employee satisfaction by fully utilizing data. We are a value-driven firm. For us this means:
Solving the hardest problems
Utilizing state-of-the-art data science to help our clients succeed
Applying a structured problem-solving approach where all options are considered, researched, and analyzed carefully before recommendations are made
Our Vision: Be a firm that attracts exceptional people to meet the growing demand for data analysis and visualization.
With engaged clients, we want to share the excitement, unlimited potential and methodologies of using data to create business value. We want to be the go-to firm when people think of getting data analytic training, consulting, and big data products.
With top data scientists, we want to be the home for those who want different data challenges all the time. We promote their open data/demo work in the community and expand the impact of the analytic tools and methodologies they developed. We connect the best ones to build the strongest team.
With new college students and young professionals, we want to help them succeed and be able to handle real world problems right away though our short-term, intensive training programs and internship programs. Through our rich experience, we have tailored our training program to solve some of the critical problems people face in their workplace.
Through our partnerships we want to spread the best technologies between the US and China. We want to close the gap and bring solutions and offerings to clients we serve. We are at the frontline to pick what is the best product for our clients.
We are glad we have the opportunity to do what we love and are good at, and will continue to enjoy doing it with a growing and energetic team.
DS -What is the state of open source statistical software in China? How have you contributed to R in China and how do you see the future of open source evolve there?
VZ- People love R and embrace R. In May 2014, We helped to organize and sponsor the R conference in Beijing, with 1400 attendants. See our blog post for more details: http://www.r-bloggers.com/the-7th-china-r-conference-in-beijing/
We have helped organize two R conferences in China in the past year, Spring in Beijing and Winter in Shanghai. And we will do a Summer R conference in Guangzhou this year. That’s three R conferences in one year!
DS- Describe some of your work with your partners in helping sell and support R in China and USA
VZ- Revolution Analytics and RStudio are very respected in the R community. We are glad to work and learn from them through collaboration.
With Revolution, we provide services to do proof-of-concept and professional services including analytics and visualization. We also sell Revolution products and licenses in China. With RStudio, we sell Rstudio Server Pro and Shiny and promote training programs around those products in NYC. We plan to sell these products in China starting this summer. With Transwarp, we offer the best R analytic and paralleling experience through the Hadoop/Spark ecosystem.
DS- You have done many free workshops in multiple cities. What has been the response so far.
VZ- Let us first talk about what happened in NYC.
I went to a few meetups before I started my own meetup group. Most of the presentation/talks were awesome but they were not delivered and constructed in a way that attendants could learn and apply the technology right away. Most of the time, those events didn’t offer source code or technical details in the slides.
When I started my own group, my goal was “whatever cool stuff we showed you, we will teach you how to do it.” The majority of the events were designed as hands-on workshops while we hosted a few high profile speakers’ talks from time to time (including the chief data science scientist for the Obama Campaign).
My workshops cover a wide range of topics, including R, Python, Hadoop, D3.js, data processing, Tableau, location data query, open data, etc. People are super excited and keep saying “oh wow oh wow”, “never thought that I could do it”, ”it is pretty cool.” Soon our attendants started giving back to the group by teaching their skills and fun projects, offering free conference room, and sponsoring pizzas.
We are glad we have built a community of sharing experience and passion for data science. And I think this is a very unique thing we can do in NYC (due to the fact everything is close to half-hour subway distance). We host events 2-3 times per week and have attracted 1900 members in one year.
In other cities such as Shanghai and Beijing, we do free workshops for college students and scholars every month. We promise to go to the colleges as far as within 24 hours distance by train from Beijing. Through partnerships with Capital of Statistics and DataUnion, we hosted entrepreneur sharing events with devoted speeches and lighting talks.
In NYC, we typically see 15 to 150 people per event. U.S. sponsors have included McKinsey & Company, Thoughtworks, and others. Our Beijing monthly tech event sees over 500 attendees and gains attraction from event co-hosts including Baiyu, Youku and others.
DS- What are some interesting projects of Supstat that you would like to showcase.
VZ- Let me start with one interesting open data project on Citibike data done by our team. The blog post, slides and meetup videos can be found at http://nycdatascience.com/meetup/nyc-open-data-project-ii-my-citibike/
Citibike provides a public bike service. There are many bike stations in NYC. People want to take a bike from a station with at least one available bike. And when they get to the destination, they want to return their bike to a station with at least one available slot. Our goal was to predict where to rent and where to return Citibikes. We showed the complete workflow including data scraping, cleaning, manipulation, processing, modeling, and making algorithms into a product.
We built a program to scrape data and save it to our database automatically. Using this data we utilized models from time series theory and machine learning to predict bike numbers in all the stations. Based on the models, we built a website for this citibike system. This application helps users of citibike arrange their trips better. We also showed a few tricks such as how to set up cron job on Linux, Windows and Mac machines, and how to get around RAM limitations on servers with PostgreSQL.
We’ve done other projects in China using R to solve problems ranging from Telecommunications data caching to Cardiovascular disease prevention. Each of these projects has required a unique combination of statistical knowledge and data science tools, with R being the backbone of the solution. The commercial cases can be found at our website: http://www.supstat.com/consulting/
VIVIAN S. ZHANG Co-founder & CTO, NYC, Beijing and Shanghai Office
Vivian is a data scientist who has been devoted to the analytics industry and the development and use of data technologies for several years. She obtained expertise in data analysis and data management using various statistical analytical tools and programming languages as a Senior Analyst and Biostatistician at Memorial Sloan-Kettering Cancer Center and Scientific Programmer at Brown University. She is the co-founder SupStat, NYC Data Science Academy, NYC Open-Data meetup. She likes to portray herself as a programmer, data-holic, visualization evangelist.
You can read more about SupStat at http://www.supstat.com/team/
From naming the algorithm after himself ( PageRank ?) to forsaking his professors at Stanford ( who legally own the rights to many intellectual property), to first learning under Eric Schmidt and then pushing him out on the pretense of a political appointment to never came, to the era of silent cooperation with the US Government, to collecting a lot of data by assessing the risk of litigation (especially mobile), and to push intellectual property rights between open source and patent rights, to massive expensive lobbying and now even sidelining his brother in arms- Larry Page has emerged as the most ruthless combination of business savvy and formidable technological skills since Bill Gates.
He now owns a representative sample of nearly all the data on video (Youtube) , email (Gmail), website analytics ( Google Analytics), search engine (Google.com), advertising clicks ( Adwords and Adsense), a majority of mobile phones (Android).
And he wants more. To collect data from your thermostat. Your glasses. His government will not file an anti trust case because of national security. As an extension of US foreign policy, he will lead protests against Chinese hackers, censorship and even abandon the market than comply with Chinese Law, but he will gladly pay fines and delete links to comply with European Law.
There are ways to make money that are not evil. But they do not teach what is evil or not, at Stanford. Not even to dropouts.