So many R Packages Everywhere, which one do I use? #rstats

Some thoughts on R Packages

  • CRAN is no longer the sole repository for many useful R packages. This includes R Forge, Google Code and increasingly Github
  • CRAN lacks the flexibility and social aspect of Github.
  • CRAN Views is the only thing that lists subject wide listing of R packages. The categorization is however done more on methods than on use cases or business domains.
  • Multiple R packages for the same thing. Which one do I use? Only Stack Overflow helps with that. No rating , no recommendation system
  • The packages suggested by R package feature needs better and automatic association analysis . Right now it is manual and dependent on package author and maintainer.
  • Quis custodiet ipsos custodes? Who guards the guardians of R packages. In an era of cyber security, we need better transparency on security measures within R packages especially given the international nature of the project.  I am very sure I ( or anyone) can create R code to communicate discretely especially on Windows

  • I would rather not install anything on my local machine, and read the package directly from the CRAN . CRAN was designed in an era of low bandwidth- this needs to be upgraded.
  • Note I am refraining respectfully from the atrocious nature of aesthetics in the home website. Many statisticians feel no use of making R user friendly. My professors at U tenn (from which I dropped out in 2 sems) were horrified when I took courses in graphic design as I wanted to know more on the A and B, which make the A/B testing of statistical design. Now that I am getting older, I get horrified by the lack of HTML, CSS and JQuery by some of the brightest programmers in this project.
  • Please comment below.

 

Interview Jeroen Ooms OpenCPU #rstats

Below an interview with Jeroen Ooms, a pioneer in R and web development. Jeroen contributes to R by developing packages and web applications for multiple projects.

jeroen

Ajay- What are you working on these days?
Jeroen- My research revolves around challenges and opportunities of using R in embedded applications and scalable systems. After developing numerous web applications, I started the OpenCPU project about 1.5 year ago, as a first attempt at a complete framework for proper integration of R in web services. As I work on this, I run into challenges that shape my research, and sometimes become projects in their own. For example, the RAppArmor package provides the security framework for OpenCPU, but can be used for other purposes as well. RAppArmor interfaces to some methods in the Linux kernel, related to setting security and resource limits. The github page contains the source code, installation instructions, video demo’s, and a draft of a paper for the journal of statistical software. Another example of a problem that appeared in OpenCPU is that applications that used to work were breaking unexpectedly later on due to changes in dependency packages on CRAN. This is actually a general problem that affects almost all R users, as it compromises reliability of CRAN packages and reproducibility of results. In a paper (forthcoming in The R Journal), this problem is discussed in more detail and directions for improvement are suggested. A preprint of the paper is available on arXiv: http://arxiv.org/abs/1303.2140.

I am also working on software not directly related to R. For example, in project Mobilize we teach high school students in Los Angeles the basics of collecting and analyzing data. They use mobile devices to upload surveys with questions, photos, gps, etc using the ohmage software. Within Mobilize and Ohmage, I am in charge of developing web applications that help students to visualize the data they collaboratively collected. One public demo with actual data collected by students about snacking behavior is available at: http://jeroenooms.github.com/snack. The application allows students to explore their data, by filtering, zooming, browsing, comparing etc. It helps students and teachers to access and learn from their data, without complicated tools or programming. This approach would easily generalize to other fields, like medical data or BI. The great thing about this application is that it is fully client side; the backend is simply a CSV file. So it is very easy to deploy and maintain.

Ajay-What’s your take on difference between OpenCPU and RevoDeployR ?
Jeroen- RevoDeployR and OpenCPU both provide a system for development of R web applications, but in a fairly different context. OpenCPU is open source and written completely in R, whereas RevoDeployR is proprietary and written in Java. I think Revolution focusses more on a complete solution in a corporate environment. It integrates with the Revolution Enterprise suite and their other big data products, and has built-in functionality for authentication, managing privileges, server administration, support for MS Windows, etc. OpenCPU on the other hand is much smaller and should be seen as just a computational backend, analogous to a database backend. It exposes a clean HTTP api to call R functions to be embedded in larger systems, but is not a complete end-product in itself.

OpenCPU is designed to make it easy for a statistician to expose statistical functionality that will used by web developers that do not need to understand or learn R. One interesting example is how we use OpenCPU inside OpenMHealth, a project that designs an architecture for mobile applications in the health domain. Part of the architecture are so called “Data Processing Units”, aka DPU’s. These are simple, modular I/O units that do various sorts of data processing, similar to unix tools, but then over HTTPS. For example, the mobility dpu is used to calculate distances between gps coordinates via a simple http call, which OpenCPU maps to the corresponding R function implementing the harversine formula.

Ajay- What are your views on Shiny by RStudio?
Jeroen- RStudio seems very promising. Like Revolution, they deliver a more full featured product than any of my projects. However, RStudio is completely open source, which is great because it allows anyone to leverage the software and make it part of their projects. I think this is one of the reasons why the product has gotten a lot of traction in the community, which has in turn provided RStudio with great feedback to further improve the product. It illustrates how open source can be a win-win situation. I am currently developing a package to run OpenCPU inside RStudio, which will make developing and running OpenCPU apps much easier.

Ajay- Are you still developing excellent RApache web apps (which IMHO could be used for visualization like business intelligence tools?)
Jeroen–   The OpenCPU framework was a result of those webapps (including ggplot2 for graphical exploratory analysis, lme4 for online random effects modeling, stockplot for stock predictions and irttool.com, an R web application for online IRT analysis). I started developing some of those apps a couple of years ago, and realized that I was repeating a large share of the infrastructure for each application. Based on those experiences I extracted a general purpose framework. Once the framework is done, I’ll go back to developing applications 🙂

Ajay- You have helped  build web apps, openCPU, RAppArmor, Ohmage , Snack , mobility apps .What’s your thesis topic on?
Jeroen- My thesis revolves around all of the technical and social challenges of moving statistical computing beyond the academic and private labs, into more public, accessible and social places. Currently statistics is still done to mostly manually by specialists using software to load data, perform some analysis, and produce results that end up in a report or presentation. There are great opportunities to leverage the open source analysis and visualization methods that R has to offer as part of open source stacks, services, systems and applications. However, several problems need to be addressed before this can actually be put in production. I hope my doctoral research will contribute to taking a step in that direction.

Ajay- R is RAM constrained but the cloud offers lots of RAM. Do you see R increasing in usage on the cloud? why or why not?
Jeroen-   Statistical computing can greatly benefit from the resources that the cloud has to offer. Software like OpenCPU, RStudio, Shiny and RevoDeployR all provide some approach of moving computation to centralized servers. This is only the beginning. Statisticians, researchers and analysts will continue to increasingly share and publish data, code and results on social cloud-based computing platforms. This will address some of the hardware challenges, but also contribute towards reproducible research and further socialize data analysis, i.e. improve learning, collaboration and integration.

That said, the cloud is not going to solve all problems. You mention the need for more memory, but that is only one direction to scale in. Some of the issues we need to address are more fundamental and require new algorithms, different paradigms, or a cultural change. There are many exciting efforts going on that are at least as relevant as big hardware. Gelman’s mc-stan implements a new MC method that makes Bayesian inference easier and faster while supporting more complex models. This is going to make advanced Bayesian methods more accessible to applied researchers, i.e. scale in terms of complexity and applicability. Also Javascript is rapidly becoming more interesting. Performance of Google’s javascript engine V8 outruns any other scripting language at this point, and the huge Javascript community provides countless excellent software libraries. For example D3 is a graphics library that is about to surpass R in terms of functionality, reliability and user base. The snack viz that I developed for Mobilize is based largely on D3. Finally, Julia is another young language for technical computing with lots of activity and very smart people behind it. These developments are just as important for the future of statistical computing as big data solutions.

About-
You can read more on Jeroen and his work at  http://jeroenooms.github.com/ and reach out to him here http://www.linkedin.com/in/datajeroen

Interview Jeff Allen Trestle Technology #rstats #rshiny

Here is an interview with Jeff Allen who works with R and the new package Shiny in his technology startup. We featured his RGL Demo in our list of Shiny Demos- here

30cfc91

Ajay- Describe how you started using R. What are some of the benefits you noticed on moving to R?

Jeff- I began using R in an internship while working on my undergraduate degree. I was provided with some unformatted R code and asked to modularize the code then wrap it up into an R package for distribution alongside a publication.

To be honest, as a Computer Science student with training more heavily emphasizing the big high-level languages, R took some getting used to for me. It wasn’t until after I concluded that initial project and began using R to do my own data analysis that I began to realize its potential and value. It was the first scripting language which really made interactive use appealing to me — the experience of exploring a dataset in R was unlike anything Continue reading “Interview Jeff Allen Trestle Technology #rstats #rshiny”

Shiny 0.3 released . New era for #rstats

Message from Winston Cheng of R Studio.

—-

We’ve released Shiny 0.3.0, and it’s available on CRAN now. Glimmer will be updated with the latest version of Shiny some time later today.
To update your installation of Shiny, run:
  install.packages(‘shiny’)
Highlights of the new version include:
* Some bugs were fixed in `reactivePrint()` and `reactiveText()`, so that they have slightly different rules for collecting the output. Please be aware that some changes to your apps’ text output is possible. The help pages for these functions explain the behavior.
* New `runGitHub()` function, which can run apps directly from a repository on GitHub
* New `runUrl()` function, which can run apps stored as zip or tar files on a remote web server.
* New `isolate()` function, which allows you to access reactive values (from input) without making the function dependent on them.
* Improved scheduling of evaluation of reactive functions, which should reduce the number of “extra” times a reactive function is called.

Little Book of R For Time Series #rstats

I loved this book. Only 75 pages and very lucidly written and available on Github for free. Nice job by Avril Coghlan a.coghlan@ucc.ie

.Of course My usual suspects for Time Series Readings are –

1) The seminal pdf (2008!!) by  a certain Prof Hyndman

Click to access Rtimeseries-ohp.pdf

 

2) JSS Paper -Automatic Time Series Forecasting: The forecast
Package for R http://www.jstatsoft.org/v27/i03/paper

3) The CRAN View http://cran.r-project.org/web/views/TimeSeries.html

This is cluttered and getting more and more cluttered. Some help on helping recent converts to R, especially in the field of corporate forecasting or time series for business analytics would really help.

Avril does an awesome job with this curiously named ( 😉 ) booklet  at http://a-little-book-of-r-for-time-series.readthedocs.org/en/latest/src/timeseries.html

Who made Who in #Rstats

While Bob M, my old mentor and fellow TN man maintains the website http://r4stats.com/ how popular R is across various forums, I am interested in who within R community of 3 million (give or take a few) is contributing more. I am very sure by 2014, we can have a new fork of R called Hadley R, in which all packages would be made by Hadley Wickham and you wont need anything else.

But jokes apart, since I didnt have the time to

1) scrape CRAN for all package authors

2) scrape for lines of code across all packages

3) allocate lines of code (itself a dubious software productivity metric) to various authors of R packages-

OR

1) scraping the entire and 2011’s R help list

2) determine who is the most frequent r question and answer user (ala SAS-L’s annual MVP and rookie of the year awards)

I did the following to atleast who is talking about R across easily scrapable Q and A websites

Stack Overflow still rules over all.

http://stackoverflow.com/tags/r/topusers shows the statistics on who made whom in R on Stack Overflow

All in all, initial ardour seems to have slowed for #Rstats on Stack Overflow ? or is it just summer?

No the answer- credit to Rob J Hyndman is most(?) activity is shifting to Stats Exchange

http://stats.stackexchange.com/tags/r/topusers


You could also paste this in Notepad and some graphs on Average Score / Answer or even make a social network graph if you had the time.

Do NOT (Go/Bi) search for Stack Overflow API or web scraping stack overflow- it gives you all the answers on the website but 0 answers on how to scrape these websites.

I have added a new website called Meta Optimize to this list based on Tal G’s interview of Joseph Turian,  at http://www.r-statistics.com/2010/07/statistical-analysis-qa-website-did-stackoverflow-just-lose-it-to-metaoptimize-and-is-it-good-or-bad/

http://metaoptimize.com/qa/tags/r/?sort=hottest

There are only 17 questions tagged R but it seems a lot of views is being generated.

I also decided to add views from Quora since it is Q and A site (and one which I really like)

http://www.quora.com/R-software

Again very few questions but lot many followers

How big is R on CRAN #rstats

3.87 GB and 3786 packages. Thats what you need to install the whole of R as on CRAN

( Note- Many IT administrators /Compliance Policies in enterprises forbid installing from the Internet in work offices.

Which is where the analytics,$$, and people are)

As downloaded from the CRAN Mirror at UCLA.

Takes 3 hours to download at 1 mbps (I was on an Amazon Ec2 instance)

See screenshot.

Next question- who is the man responsible in the R project for deleting old /depreciated/redundant packages if the authors dont do it.