The making of a R startup Part 1 #rstats

Note- Decisionstats.com has done almost 105 interviews in the field of analytics, technology startups and thought leaders ( you can see them here http://goo.gl/m3l31). We have covered some of the R authors ( R for SAS and SPSS users, Data Mining using R, Machine Learning for Hackers) , and noted R package creators (ggplot2, RCommander, rattle GUI, forecast)

But what we truly enjoy is interviews with startups in R ecosystem , including founders of Revolution Analytics,Inference for R, RStudio, Cloudnumbers

The latest startup in the R ecosystem with a promising product is RApporter.net . It has actually been there for some time, but with the launch of their new product we ask them the trials and tribulations of creating an open source startup in the data science field.

This is part 1 of the interview with Gergely Daróczi, co-founder of the Rapporter project.

Ajay- Describe the journey of Rapporter till now, and your product plans for 2013.

Greg- The idea of Rapporter presented itself more then 3 years ago while giving statistics, SPSS and R courses at different Hungarian universities and also creating custom statistical reports for a number of companies for a living at the same time.
Long story short, the three Hungarian co-founder faced similar problems at both sectors: students, just like business clients, admired the capabilities of R and the wide variety of tools found on CRAN,but were not eager at all to get into learn how to use that.
So we tried to make up some plans how to let the non-R users also build on the resources of R, and we came up with the idea of an intuitive web-interface as an R front-end.

The real development of a helper R package (which later become “rapport”) started in the January of 2011 by Aleksandar Blagotić and me1 in our spare time and rather just for fun, as we had a dream about using “annotated statistical templates” in R after a few conversations on StackOverflow. We also worked on a front-end in the means of an Rserve driven PHP engine with MySQL – to be dropped and completely rewritten later after some trying experiences and serious benchmarking.

We have released “rapport” package to the public at the end of 2011 on GitHub, and after a few weeks on CRAN too. Despite the fact that we did our best with creating a decent documentation and also some live examples, we somehow forgot to spread the news of the new package to the R community, so “rapport” did not attract any serious attention.

Even so, our enthusiasm for annotated R “templates” did not wane as time passed, so we continued to work on “rapport” by adding new features and also Aleksandar started to fortify his Ruby on Rails skills. We also dropped Rserve with MySQL back-end, and introduced Jeffrey Horner’s awesome RApache with some NoSQL databases.
To be honest, this change resulted in a one-year delay of releasing Rapporter and no ends of headaches on our end, but in the long run, it was a really smart move after all, as we own an easily scalable and a highly available cluster of servers at the moment.

But back to 2012.

As “rapport” got too complex as time passed with newly added features, Aleksandar and I decided to split the package, which move gave birth to “pander”. At that time “knitr” got more and more familiar among R users, so it was a brave move to release “another” similar package, but the roots of “pander” were more then one year old, we used some custom methods not available in “knitr” (like
capturing the R object beside the printed output of chunks), we needed tweakable global options instead of chunk options and we really wanted to build on the power of Pandoc – just like before.

So we had a package for converting R objects to Pandoc’s markdown with a general S3 method, another package to automatically run that and also capture plots and images a brew-like document with various output formats – like pdf, docx, odt etc.
In the summer, while Aleksandar dealt with the web interface, I worked on some new features in our packages:
• automatic and robust caching of chunks with various options for performance reasons,
• automatically unifying “base”, “lattice” and “ggplot2” images to the same style with user options – like major/minor grid color, font family, color palette, margins etc.
• adding other global options to “pander”, to let our expected clients later personalize their
custom report style with a few clicks.

At the same time, we were searching for different options to prevent running malicious code in the parallel R sessions, which might compromise all our users’ sensitive data. Unfortunately no full blown solution existed at that time, and we really wanted to stand clear of running some Java based interpreters in our network.
So I started to create a parser for R commands, which was supposed to filter out malicious R commands before evaluation, and a handful flu got me some spare time to implement “sandboxR” with an open and live “hack my R server” demo, which ended up in a great challenge on my side, but proved to really work after all.
I also had a few conversations with Jeroen Ooms (the author of the awesome OpenCPU), who faced similar problems on his servers and was eager to prevent the issues with the help of AppArmor. The great news of “RAppArmor” did make “sandboxR” needless (as AppArmor just cannot regulate inner R calls), but we started to evaluate all user specified R commands in a separate hat, which allowed me to make “sanboxR” more permissive with black-filtered functions.
In the middle of the summer, I realized that we have an almost working web application with any number of R workers being able to serve tons of users based on the flexible NoSQL database back- ends, but we had no legal background to release such a service, nor had I any solid financial background to found one – moreover the Rapporter project already took huge amount from my family budget.

As I was against of letting some venture capital to dominate the project, and did not found any accelerator that would take on a project with a maturing, almost market-ready product, me and a few associates decided to found a UK company on our own and having confidence in the future and God.

So we founded Easystats Ltd, the company running rapporter.net, in July, and decided to release the first beta and pretty stable version of the application to the public at the end of September. At that time users could:
• upload and use text or SPSS sav data sets,
• specify more then 20 global options to be applied to all generated reports (like plot themes, table width, date format, decimal mark and number of digits, separators and copula in vectors etc.),
• create reports with the help of predefined statistical “templates”,
• “fork” (clone) any of our templates and modify without restriction, or create new statistical templates from scratch,
• edit the body or remove any part of the reports, resize images with the mouse or even with finger on touch-devices,
• and export reports to pdf, odt or docx formats.

A number of new features were introduced since then:

OpenBUGS integration with more permissive security profiles, users can create custom styles for the exported documents (in LaTeX, docx and odt format) to generate unique and possibly branded reports, to share public or even private reports with anyone without the need for registering on rapporter.net by a simple hyperlink, and to let our users to integrate their templates in any homepage, blog post or even HTML mail, so that let anyone use the power of R with a few clicks building on the knowledge of template authors and our reliable back-end.
Although 2 years ago I was pretty sure that this job would be finished in a few months and that we would possibly have a successful project in a year or two, now I am certain, that bunch of new features will make Rapporter more and more user-friendly, intuitive and extensible in the next few years.
Currently, we are working hard on a redesigned GUI with the help of a dedicated UX team at last (which was a really important structural change in the life of Rapporter, as we can really assign and split tasks now just like we dreamed of when the project was a two-men show), which is to be finished no later then the first quarter of the year. Beside design issues, this change would also result
in some new features, like ordering the templates, data sets and reports by popularity, rating or relevance for the currently active data set; and also letting users to alter the style of the resulting reports in a more seamless way.

The next planned tasks for 2013 include:
• a “data transformation” front-end, which would let users to rename and label variables in any uploaded data set, specify the level of measurement, recode/categorize or create new variables with the help of existing ones and/or any R functions,
• edit tables in reports on the fly (change the decimal mark, highlight some elements, rename columns and split tables to multiple pages with a simple click),
• a more robust API to let third-party users temporary upload data to be used in the analysis,
• option to use multiple data sets in a template and to let users merge or connect data online,
• and some top-secret surprises.

Beside the above tasks, which was made up by us, our team is really interested in any feedback from the users, which might change the above order or add new tasks with higher priority, so be sure to add your two cent on our support page.

And we will have to come up with some account plans with reasonable pricing in 2013 for the hosted service to let us cover the server fees and development expenses. But of course Rapporter will remain free for ever for users with basic needs (like analyzing data sets with only a few hundreds of cases) or anyone in the academic sector, and we also plan to provide an option to run Rapporter “off-site” on any Unix-like environment.

Ajay- What are some of the Big Data use cases I can do with Rapporter?

Greg- Although we have released Rapporter beta only a few months ago, we already heard some pretty promising use-cases from our (potential) clients.

But I must emphasize that at first we are not committed to deal with Big Data in the means of user contributed data sets with billions of cases, but rather concentrating on providing an intuitive and responsive way of analyzing traditional, survey-like data frames up to about 100.000 cases.

Anyway, to be on topic: a really promising project of Optimum Dosing Strategies has been using Rapporter’s API for a number of weeks even in 2012 to compute optimal doses for different kind of antibiotics based on Monte-Carlo simulation and Bayesian adaptive feedback among other methods.
This collaboration lets the ID-ODS team develop a powerful calculator with full-blown reports ready to be attached to medical records – without any special technical knowledge on their side, as we maintain the R engine and the integration part, they code in R. This results in pleased clients all over the world, which makes us happy too.

We really look forward to ship a number of educational templates to be used in real life at several (multilingual) universities from September 2013. These templates would let teachers show customizable and interactive reports to the students with any number of comments and narrative paragraphs, which statistical introductory modules would provide a free alternative to other desktop
software used in education.

In the next few months, a part of our team will focus on spatial analysis templates, which would mean that our users could not just map, but really analyze any of their spatially related data with a few clicks and clear parameters.

Another feature request of a client seems to be a really exciting idea. Currently, Google Analytics and other tracking services provide basic options to view, filter and export the historical data of websites, blogs etc.
As creating an interface between Rapporter and the tracking services to be able to fetch the most recent data is not beyond possibility any more with the help of existing API resources, so our clients could generate annotated usage reports of any specified period of time – without restrictions. Just to emphasize some potential add-ons: using the time-series R packages in the analysis or creating real- time “dashboards” with optional forecasts about live data.

Of course you could think of other kind of live or historical data instead of Google Analytics, as creating a template for e.g. transaction data or gas usage of a household could be addressed at any time, and please do not forget about the above referenced use-cases in the 3 rd question (“[…]Rapporter can help: […]”).

But wait: the beauty of Rapporter is that you could implement all of the above ideas by yourself in our system, even without any help from us.

Ajay- What are some of things that can be easily done with Rapporter than with your plain vanilla R?

Greg- Rapporter is basically developed for creating reproducible, literative and annotated statistical modules (a.k.a. “templates”), which means the passing a data set and the list of variables with some optional arguments would end up in a full-blown written report with automatically styled tables and charts.

So using Rapporter is like writing “Sweave” or “knitr” documents, but you write the template only once, and then apply that to any number of data sets with a simple click on an intuitive user interface.

Beside this major objective: as Rapporter is running in the cloud and sharing reports and templates (or even data sets) with collaborators or with anyone on the Internet is really easy, our users can post, share any R code for free and without restrictions or release the templates with specified license and/or fees in a secured environment.

This means that Rapporter can help:

scholars sharing scientific results or methods with reproducible and instantly available demo and/or dedicated implementation along with publications,
teachers to create self-explanatory statistical templates which would help the students internationalize the subject by practice,
any R developer to share a live and interactive demo of the implemented features of the functions with a few clicks,
businesses could use a statistical platform without restrictions for a reasonable monthly fee instead of expensive and non-portable statistical programs,
governments and national statistical offices to publicize census or other big data with a scientific and reliable analytic tool with annotated and clear reports while insuring the anonymity of the respondents by automatically applying custom methods (like data swapping, rounding, micro-aggregation, PRAM, adding noise etc.) to the tables and results, etc.

And of course, do not forget about one of our main objectives to let us open up the world of R to non-R users too with an intuitive, driving user interface.

(To be continued)-

About

Gergely Daróczi is co-ordinating the development of Rapporter and maintaining their R packages. Beside he tries to be active in some open-source projects and on StackOverflow, he is a PhD candidate in sociology and also a lecturer at Corvinus University of Budapest and Pázmány Péter Catholic University in Hungary

Rapporter is a web application helping you to create comprehensive, reliable statistical reports on any mobile device or PC, using an intuitive user interface.

The application builds on the power of R beside other technologies and intended to be used in any browser doing the heavy computations on the server side. Some might consider Rapporter as a customizable graphical user interface to R – running in the cloud.

Currently, Rapporter is under heavily development and only invited alpha testers can access the application. Please sign up for an invitation if you want to have an early-bird insight on Rapporter.