Interview JJ Allaire Founder, RStudio

Here is an interview with JJ Allaire, founder of RStudio. RStudio is the IDE that has overtaken other IDE within the R Community in terms of ease of usage. On the eve of their latest product launch, JJ talks to DecisionStats on RStudio and more.

Ajay-  So what is new in the latest version of RStudio and how exactly is it useful for people?

JJ- The initial release of RStudio as well as the two follow-up releases we did last year were focused on the core elements of using R: editing and running code, getting help, and managing files, history, workspaces, plots, and packages. In the meantime users have also been asking for some bigger features that would improve the overall work-flow of doing analysis with R. In this release (v0.95) we focused on three of these features:

Projects. R developers tend to have several (and often dozens) of working contexts associated with different clients, analyses, data sets, etc. RStudio projects make it easy to keep these contexts well separated (with distinct R sessions, working directories, environments, command histories, and active source documents), switch quickly between project contexts, and even work with multiple projects at once (using multiple running versions of RStudio).

Version Control. The benefits of using version control for collaboration are well known, but we also believe that solo data analysis can achieve significant productivity gains by using version control (this discussion on Stack Overflow talks about why). In this release we introduced integrated support for the two most popular open-source version control systems: Git and Subversion. This includes changelist management, file diffing, and browsing of project history, all right from within RStudio.

Code Navigation. When you look at how programmers work a surprisingly large amount of time is spent simply navigating from one context to another. Modern programming environments for general purpose languages like C++ and Java solve this problem using various forms of code navigation, and in this release we’ve brought these capabilities to R. The two main features here are the ability to type the name of any file or function in your project and go immediately to it; and the ability to navigate to the definition of any function under your cursor (including the definition of functions within packages) using a keystroke (F2) or mouse gesture (Ctrl+Click).

Ajay- What’s the product road map for RStudio? When can we expect the IDE to turn into a full fledged GUI?

JJ- Linus Torvalds has said that “Linux is evolution, not intelligent design.” RStudio tries to operate on a similar principle—the world of statistical computing is too deep, diverse, and ever-changing for any one person or vendor to map out in advance what is most important. So, our internal process is to ship a new release every few months, listen to what people are doing with the product (and hope to do with it), and then start from scratch again making the improvements that are considered most important.

Right now some of the things which seem to be top of mind for users are improved support for authoring and reproducible research, various editor enhancements including code folding, and debugging tools.

What you’ll see is us do in a given release is to work on a combination of frequently requested features, smaller improvements to usability and work-flow, bug fixes, and finally architectural changes required to support current or future feature requirements.

While we do try to base what we work on as closely as possible on direct user-feedback, we also adhere to some core principles concerning the overall philosophy and direction of the product. So for example the answer to the question about the IDE turning into a full-fledged GUI is: never. We believe that textual representations of computations provide fundamental advantages in transparency, reproducibility, collaboration, and re-usability. We believe that writing code is simply the right way to do complex technical work, so we’ll always look for ways to make coding better, faster, and easier rather than try to eliminate coding altogether.

Ajay -Describe your journey in science from a high school student to your present work in R. I noticed you have been very successful in making software products that have been mostly proprietary products or sold to companies.

Why did you get into open source products with RStudio? What are your plans for monetizing RStudio further down the line?

JJ- In high school and college my principal areas of study were Political Science and Economics. I also had a very strong parallel interest in both computing and quantitative analysis. My first job out of college was as a financial analyst at a government agency. The tools I used in that job were SAS and Excel. I had a dim notion that there must be a better way to marry computation and data analysis than those tools, but of course no concept of what this would look like.

From there I went more in the direction of general purpose computing, starting a couple of companies where I worked principally on programming languages and authoring tools for the Web. These companies produced proprietary software, which at the time (between 1995 and 2005) was a workable model because it allowed us to build the revenue required to fund development and to promote and distribute the software to a wider audience.

By 2005 it was however becoming clear that proprietary software would ultimately be overtaken by open source software in nearly all domains. The cost of development had shrunken dramatically thanks to both the availability of high-quality open source languages and tools as well as the scale of global collaboration possible on open source projects. The cost of promoting and distributing software had also collapsed thanks to efficiency of both distribution and information diffusion on the Web.

When I heard about R and learned more about it, I become very excited and inspired by what the project had accomplished. A group of extremely talented and dedicated users had created the software they needed for their work and then shared the fruits of that work with everyone. R was a platform that everyone could rally around because it worked so well, was extensible in all the right ways, and most importantly was free (as in speech) so users could depend upon it as a long-term foundation for their work.

So I started RStudio with the aim of making useful contributions to the R community. We started with building an IDE because it seemed like a first-rate development environment for R that was both powerful and easy to use was an unmet need. Being aware that many other companies had built successful businesses around open-source software, we were also convinced that we could make RStudio available under a free and open-source license (the AGPLv3) while still creating a viable business. At this point RStudio is exclusively focused on creating the best IDE for R that we can. As the core product gets where it needs to be over the next couple of years we’ll then also begin to sell other products and services related to R and RStudio.

About-

http://rstudio.org/docs/about

Jjallaire

JJ Allaire

JJ Allaire is a software engineer and entrepreneur who has created a wide variety of products including ColdFusion,Windows Live WriterLose It!, and RStudio.

From http://en.wikipedia.org/wiki/Joseph_J._Allaire
In 1995 Joseph J. (JJ) Allaire co-founded Allaire Corporation with his brother Jeremy Allaire, creating the web development tool ColdFusion.[1] In March 2001, Allaire was sold to Macromedia where ColdFusion was integrated into the Macromedia MX product line. Macromedia was subsequently acquired by Adobe Systems, which continues to develop and market ColdFusion.
After the sale of his company, Allaire became frustrated at the difficulty of keeping track of research he was doing using Google. To address this problem, he co-founded Onfolio in 2004 with Adam Berrey, former Allaire co-founder and VP of Marketing at Macromedia.
On March 8, 2006, Onfolio was acquired by Microsoft where many of the features of the original product are being incorporated into the Windows Live Toolbar. On August 13, 2006, Microsoft released the public beta of a new desktop blogging client called Windows Live Writer that was created by Allaire’s team at Microsoft.
Starting in 2009, Allaire has been developing a web-based interface to the widely used R technical computing environment. A beta version of RStudio was publicly released on February 28, 2011.
JJ Allaire received his B.A. from Macalester College (St. Paul, MN) in 1991.
RStudio-

RStudio is an integrated development environment (IDE) for R which works with the standard version of R available from CRAN. Like R, RStudio is available under a free software license. RStudio is designed to be as straightforward and intuitive as possible to provide a friendly environment for new and experienced R users alike. RStudio is also a company, and they plan to sell services (support, training, consulting, hosting) related to the open-source software they distribute.

SAS Institute Financials 2011

SAS Institute has release it’s financials for 2011 at http://www.sas.com/news/preleases/2011financials.html,

Revenue surged across all solution and industry categories. Software to detect fraud saw a triple-digit jump. Revenue from on-demand solutions grew almost 50 percent. Growth from analytics and information management solutions were double digit, as were gains from customer intelligence, retail, risk and supply chain solutions

AJAY- and as a private company it is quite nice that they are willing to share so much information every year.

The graphics are nice ( and the colors much better than in 2010) , but pie-charts- seriously dude there is no way to compare how much SAS revenue is shifting across geographies or even across industries. So my two cents is – lose the pie charts, and stick to line graphs please for the share of revenue by country /industry.

In 2011, SAS grew staff 9.2 percent and reinvested 24 percent of revenue into research and development

AJAY- So that means 654 million dollars spent in Research and Development.  I wonder if SAS has considered investing in much smaller startups (than it’s traditional strategy of doing all research in-house and completely acquiring a smaller company)

Even a small investment of say 5-10 million USD in open source , or even Phd level research projects could greatly increase the ROI on that.

That means

Analyzing a private company’s financials are much more fun than a public company, and I remember the words of my finance professor ( “dig , dig”) to compare 2011 results with 2010 results.

http://www.sas.com/news/preleases/2010financials.html

The percentage invested in R and D is exactly the same (24%) and the percentages of revenue earned from each geography is exactly the same . So even though revenue growth increased from 5.2 % to 9% in 2011, both the geographic spread of revenues and share  R&D costs remained EXACTLY the same.

The Americas accounted for 46 percent of total revenue; Europe, Middle East and Africa (EMEA) 42 percent; and Asia Pacific 12 percent.

Overall, I think SAS remains a 35% market share (despite all that noise from IBM, SAS clones, open source) because they are good at providing solutions customized for industries (instead of just software products), the market for analytics is not saturated (it seems to be growing faster than 12% or is it) , and its ability to attract and retain the best analytical talent (which in a non -American tradition for a software company means no stock options, job security, and great benefits- SAS remains almost Japanese in HR practices).

In 2010, SAS grew staff by 2.4 percent, in 2011 SAS grew staff by 9 percent.

But I liked the directional statement made here-and I think that design interfaces, algorithmic and computational efficiencies should increase analytical time, time to think on business and reduce data management time further!

“What would you do with the extra time if your code ran in two minutes instead of five hours?” Goodnight challenged.

PMML Augustus

Here is a new-old system in open source for

for building and scoring statistical models designed to work with data sets that are too large to fit into memory.

http://code.google.com/p/augustus/

Augustus is an open source software toolkit for building and scoring statistical models. It is written in Python and its
most distinctive features are:
• Ability to be used on sets of big data; these are data sets that exceed either memory capacity or disk capacity, so
that existing solutions like R or SAS cannot be used. Augustus is also perfectly capable of handling problems
that can fit on one computer.
• PMML compliance and the ability to both:
– produce models with PMML-compliant formats (saved with extension .pmml).
– consume models from files with the PMML format.
Augustus has been tested and deployed on serveral operating systems. It is intended for developers who work in the
financial or insurance industry, information technology, or in the science and research communities.
Usage
Augustus produces and consumes Baseline, Cluster, Tree, and Ruleset models. Currently, it uses an event-based
approach to building Tree, Cluster and Ruleset models that is non-standard.

New to PMML ?

Read on http://code.google.com/p/augustus/wiki/PMML

The Predictive Model Markup Language or PMML is a vendor driven XML markup language for specifying statistical and data mining models. In other words, it is an XML language so that Continue reading “PMML Augustus”

Using Opera Unite to defeat SOPA?

Lets assume that the big bad world of American electoral politics forces some kind of modified SOPA to be passed, and the big American companies have to abide by that law (just as they do share data  for National Security under Patriot Act but quitely).

I belive Opera Unite is the way forward to sharing content on the Internet.

From-

http://dev.opera.com/articles/view/opera-unite-developer-primer-revisited/

Opera Unite features a Web server running inside the Opera browser, which allows you to do some amazing things. At the touch of a button, you can share images, documents, video, music, games, collaborative applications and all manner of other things with your friends and colleagues

I can share music, and files , and the web server is actually my own laptop. try beating 2 billion new web servers that sprout!! File system sharing is totally secure- you can create private, public, or password protected files, a messaging system that can be used for drop messages (called fridge), a secure messaging system and your own web server is ready to start at a click. the open web may just use opera instead of chromium, and US regulation would be solely to blame. even URL blocking is of limited appeal thanks to software like MafiaWire Extension

Throw in Ad block, embedded bit torrent sharing and some more  Tor level encryption within the browser and sorry Senator, but the internet belongs to the planet not to your lobbyist.

see-http://dev.opera.com/web

Some Ways Anonymous Could Disrupt the Internet if SOPA is passed

This is a piece of science fiction. I wrote while reading Isaac Assimov’s advice to writers in GOLD, while on a beach in Anjuna.

1) Identify senators, lobbyists, senior executives of companies advocating for SOPA. Go for selective targeting of these people than massive Denial of Service Attacks.

This could also include election fund raising websites in the United States.

2) Create hacking tools with simple interfaces to probe commonly known software errors, to enable wider audience including the Occupy Movement students to participate in hacking. thus making hacking more democratic. What are the top 25 errors as per  http://cwe.mitre.org/cwss/

http://www.decisionstats.com/top-25-most-dangerous-software-errors/ ?

 

Easy interface tools to check vulnerabilities would be the next generation to flooding tools like HOIC, LOIC – Massive DDOS atttacks make good press coverage but not so good technically

3) Disrupt digital payment mechanisms for selected targets (in step1) using tools developed in Step 2, and introduce random noise errors in payment transfers.

4) Help create a better secure internet by embedding Tor within Chromium with all tools for anonymity embedded for easy usage – a more secure peer to peer browser (like a mashup of Opera , tor and chromium).

or maybe embed bit torrents within a browser.

5) Disrupt media companies and cloud computing based companies like iTunes, Spotify or Google Music, just like virus, ant i viruses disrupted the desktop model of computing. After that offer solutions to the problems like companies of anti virus software did for decades.

6) Hacking websites is fine fun, but hacking internet databases and massively parallel data scrapers can help disrupt some of the status quo.

This applies to databases that offer data for sale, like credit bureaus etc. Making this kind of data public will eliminate data middlemen.

7) Use cross border, cross country regulatory arbitrage for better risk control of hacker attacks.

8) recruiting among universities using easy to use hacking tools to expand the pool of dedicated hacker armies.

9) using operations like those targeting child pornography to increase political acceptability of the hacker sub culture. Refrain from overtly negative and unimaginative bad Press Relations

10) If you cant convince  them to pass SOPA, confuse them 😉 Use bots for random clicks on ads to confuse internet commerce.

 

Quantitative Modeling for Arbitrage Positions in Ad KeyWords Internet Marketing

Assume you treat an ad keyword as an equity stock. There are slight differences in the cost for advertising for that keyword across various locations (Zurich vs Delhi) and various channels (Facebook vs Google) . You get revenue if your website ranks naturally in organic search for the keyword, and you have to pay costs for getting traffic to your website for that keyword.
An arbitrage position is defined as a riskless profit when cost of keyword is less than revenue from keyword. We take examples of Adsense  and Adwords primarily.
There are primarily two types of economic curves on the foundation of which commerce of the  internet  resides-
1) Cost Curve- Cost of Advertising to drive traffic into the website  (Google Adwords, Twitter Ads, Facebook , LinkedIn ads)
2) Revenue Curve – Revenue from ads clicked by the incoming traffic on website (like Adsense, LinkAds, Banner Ads, Ad Sharing Programs , In Game Ads)
The cost and revenue curves are primarily dependent on two things
1) Type of KeyWord-Also subdependent on
a) Location of Prospective Customer, and
b) Net Present Value of Good and Service to be eventually purchased
For example , keyword for targeting sales of enterprise “business intelligence software” should ideally be costing say X times as much as keywords for “flower shop for birthdays” where X is the multiple of the expected payoffs from sales of business intelligence software divided by expected payoff from sales of flowers (say in Location, Daytona Beach ,Florida or Austin, Texas)
2) Traffic Volume – Also sub-dependent on Time Series and
a) Seasonality -Annual Shoppping Cycle
b) Cyclicality– Macro economic shifts in time series
The cost and revenue curves are not linear and ideally should be continuous in a definitive exponential or polynomial manner, but in actual reality they may have sharp inflections , due to location, time, as well as web traffic volume thresholds
Type of Keyword – For example ,keywords for targeting sales for Eminem Albums may shoot up in a non linear manner after the musician dies.
The third and not so publicly known component of both the cost and revenue curves is factoring in internet industry dynamics , including relative market share of internet advertising platforms, as well as percentage splits between content creator and ad providing platforms.
For example, based on internet advertising spend, people belive that the internet advertising is currently heading for a duo-poly with Google and Facebook are the top two players, while Microsoft/Skype/Yahoo and LinkedIn/Twitter offer niche options, but primarily depend on price setting from Google/Bing/Facebook.
It is difficut to quantify  the elasticity and efficiency of market curves as most literature and research on this is by in-house corporate teams , or advisors or mentors or consultants to the primary leaders in a kind of incesteous fraternal hold on public academic research on this.
It is recommended that-
1) a balance be found in the need for corporate secrecy to protest shareholder value /stakeholder value maximization versus the need for data liberation for innovation and grow the internet ad pie faster-
2) Cost and Revenue Curves between different keywords, time,location, service providers, be studied by quants for hedging inetrent ad inventory or /and choose arbitrage positions This kind of analysis is done for groups of stocks and commodities in the financial world, but as commerce grows on the internet this may need more specific and independent quants.
3) attention be made to how cost and revenue curves mature as per level of sophistication of underlying economy like Brazil, Russia, China, Korea, US, Sweden may be in different stages of internet ad market evolution.
For example-
A study in cost and revenue curves for certain keywords across domains across various ad providers across various locations from 2003-2008 can help academia and research (much more than top ten lists of popular terms like non quantitative reports) as well as ensure that current algorithmic wightings are not inadvertently given away.
Part 2- of this series will explore the ways to create third party re-sellers of keywords and measuring impacts of search and ad engine optimization based on keywords.

2011 Analytics Recap

Events in the field of data that impacted us in 2011

1) Oracle unveiled plans for R Enterprise. This is one of the strongest statements of its focus on in-database analytics. Oracle also unveiled plans for a Public Cloud

2) SAS Institute released version 9.3 , a major analytics software in industry use.

3) IBM acquired many companies in analytics and high tech. Again.However the expected benefits from Cognos-SPSS integration are yet to show a spectacular change in market share.

2011 Selected acquisitions

Emptoris Inc. December 2011

Cúram Software Ltd. December 2011

DemandTec December 2011

Platform Computing October 2011

 Q1 Labs October 2011

Algorithmics September 2011

 i2 August 2011

Tririga March 2011

 

4) SAP promised a lot with SAP HANA- again no major oohs and ahs in terms of market share fluctuations within analytics.

http://www.sap.com/india/news-reader/index.epx?articleID=17619

5) Amazon continued to lower prices of cloud computing and offer more options.

http://aws.amazon.com/about-aws/whats-new/2011/12/21/amazon-elastic-mapreduce-announces-support-for-cc2-8xlarge-instances/

6) Google continues to dilly -dally with its analytics and cloud based APIs. I do not expect all the APIs in the Google APIs suit to survive and be viable in the enterprise software space.  This includes Google Cloud Storage, Cloud SQL, Prediction API at https://code.google.com/apis/console/b/0/ Some of the location based , translation based APIs may have interesting spin offs that may be very very commercially lucrative.

7) Microsoft -did- hmm- I forgot. Except for its investment in Revolution Analytics round 1 many seasons ago- very little excitement has come from MS plans in data mining- The plugins for cloud based data mining from Excel remain promising yet , while Azure remains a stealth mode starter.

8) Revolution Analytics promised us a GUI and didnt deliver (till yet 🙂 ) . But it did reveal a much better Enterprise software Revolution R 5.0 is one of the strongest enterprise software in the R /Stat Computing space and R’s memory handling problem is now an issue of perception than actual stuff thanks to newer advances in how it is used.

9) More conferences, more books and more news on analytics startups in 2011. Big Data analytics remained a strong buzzword. Expect more from this space including creative uses of Hadoop based infrastructure.

10) Data privacy issues continue to hamper and impede effective analytics usage. So does rational and balanced regulation in some of the most advanced economies. We expect more regulation and better guidelines in 2012.