Using Views in R and comparing functions across multiple packages

Some RDF hacking relating to updating probabil...
Image via Wikipedia

R has almost 2923 available packages

This makes the task of searching among these packages and comparing functions for the same analytical task across different packages a bit tedious and prone to manual searching (of reading multiple Pdfs of help /vignette of packages) or sending an email to the R help list.

However using R Views is a slightly better way of managing all your analytical requirements for software rather than the large number of packages (see Graphics view below).

CRAN Task Views allow you to browse packages by topic and provide tools to automatically install all packages for special areas of interest. Currently, 28 views are available. http://cran.r-project.org/web/views/

Bayesian Bayesian Inference
ChemPhys Chemometrics and Computational Physics
ClinicalTrials Clinical Trial Design, Monitoring, and Analysis
Cluster Cluster Analysis & Finite Mixture Models
Distributions Probability Distributions
Econometrics Computational Econometrics
Environmetrics Analysis of Ecological and Environmental Data
ExperimentalDesign Design of Experiments (DoE) & Analysis of Experimental Data
Finance Empirical Finance
Genetics Statistical Genetics
Graphics Graphic Displays & Dynamic Graphics & Graphic Devices & Visualization
gR gRaphical Models in R
HighPerformanceComputing High-Performance and Parallel Computing with R
MachineLearning Machine Learning & Statistical Learning
MedicalImaging Medical Image Analysis
Multivariate Multivariate Statistics
NaturalLanguageProcessing Natural Language Processing
OfficialStatistics Official Statistics & Survey Methodology
Optimization Optimization and Mathematical Programming
Pharmacokinetics Analysis of Pharmacokinetic Data
Phylogenetics Phylogenetics, Especially Comparative Methods
Psychometrics Psychometric Models and Methods
ReproducibleResearch Reproducible Research
Robust Robust Statistical Methods
SocialSciences Statistics for the Social Sciences
Spatial Analysis of Spatial Data
Survival Survival Analysis
TimeSeries Time Series Analysis

To automatically install these views, the ctv package needs to be installed, e.g., via

install.packages("ctv")
library("ctv")
Created by Pretty R at inside-R.org


and then the views can be installed via install.views or update.views (which first assesses which of the packages are already installed and up-to-date), e.g.,

install.views("Econometrics")
 update.views("Econometrics")
 Created by Pretty R at inside-R.org

CRAN Task View: Graphic Displays & Dynamic Graphics & Graphic Devices & Visualization

Maintainer: Nicholas Lewin-Koh
Contact: nikko at hailmail.net
Version: 2009-10-28

R is rich with facilities for creating and developing interesting graphics. Base R contains functionality for many plot types including coplots, mosaic plots, biplots, and the list goes on. There are devices such as postscript, png, jpeg and pdf for outputting graphics as well as device drivers for all platforms running R. lattice and grid are supplied with R’s recommended packages and are included in every binary distribution. lattice is an R implementation of William Cleveland’s trellis graphics, while grid defines a much more flexible graphics environment than the base R graphics.

R’s base graphics are implemented in the same way as in the S3 system developed by Becker, Chambers, and Wilks. There is a static device, which is treated as a static canvas and objects are drawn on the device through R plotting commands. The device has a set of global parameters such as margins and layouts which can be manipulated by the user using par() commands. The R graphics engine does not maintain a user visible graphics list, and there is no system of double buffering, so objects cannot be easily edited without redrawing a whole plot. This situation may change in R 2.7.x, where developers are working on double buffering for R devices. Even so, the base R graphics can produce many plots with extremely fine graphics in many specialized instances.

One can quickly run into trouble with R’s base graphic system if one wants to design complex layouts where scaling is maintained properly on resizing, nested graphs are desired or more interactivity is needed. grid was designed by Paul Murrell to overcome some of these limitations and as a result packages like latticeggplot2vcd or hexbin (on Bioconductor ) use grid for the underlying primitives. When using plots designed with grid one needs to keep in mind that grid is based on a system of viewports and graphic objects. To add objects one needs to use grid commands, e.g., grid.polygon() rather than polygon(). Also grid maintains a stack of viewports from the device and one needs to make sure the desired viewport is at the top of the stack. There is a great deal of explanatory documentation included with grid as vignettes.

The graphics packages in R can be organized roughly into the following topics, which range from the more user oriented at the top to the more developer oriented at the bottom. The categories are not mutually exclusive but are for the convenience of presentation:

  • Plotting : Enhancements for specialized plots can be found in plotrix, for polar plotting, vcd for categorical data, hexbin (on Bioconductor ) for hexagon binning, gclus for ordering plots and gplots for some plotting enhancements. Some specialized graphs, like Chernoff faces are implemented in aplpack, which also has a nice implementation of Tukey’s bag plot. For 3D plots latticescatterplot3d and misc3d provide a selection of plots for different kinds of 3D plotting. scatterplot3d is based on R’s base graphics system, while misc3d is based on rgl. The package onion for visualizing quaternions and octonions is well suited to display 3D graphics based on derived meshes.
  • Graphic Applications : This area is not much different from the plotting section except that these packages have tools that may not for display, but can aid in creating effective displays. Also included are packages with more esoteric plotting methods. For specific subject areas, like maps, or clustering the excellent task views contributed by other dedicated useRs is an excellent place to start.
    • Effect ordering : The gclus package focuses on the ordering of graphs to accentuate cluster structure or natural ordering in the data. While not for graphics directly cba and seriation have functions for creating 1 dimensional orderings from higher dimensional criteria. For ordering an array of displays, biclust can be useful.
    • Large Data Sets : Large data sets can present very different challenges from moderate and small datasets. Aside from overplotting, rendering 1,000,000 points can tax even modern GPU’s. For univariate datalvplot produces letter value boxplots which alleviate some of the problems that standard boxplots exhibit for large data sets. For bivariate data ash can produce a bivariate smoothed histogram very quickly, and hexbin, on Bioconductor , can bin bivariate data onto a hexagonal lattice, the advantage being that the irregular lines and orientation of hexagons do not create linear artifacts. For multivariate data, hexbin can be used to create a scatterplot matrix, combined with lattice. An alternative is to use scagnostics to produce a scaterplot matrix of “data about the data”, and look for interesting combinations of variables.
    • Trees and Graphs ape and ade4 have functions for plotting phylogenetic trees, which can be used for plotting dendrograms from clustering procedures. While these packages produce decent graphics, they do not use sophisticated algorithms for node placement, so may not be useful for very large trees. igraph has the Tilford-Rheingold algorithm implementead and is useful for plotting larger trees. diagram as facilities for flow diagrams and simple graphs. For more sophisticated graphs Rgraphviz and igraph have functions for plotting and layout, especially useful for representing large networks.
  • Graphics Systems lattice is built on top of the grid graphics system and is an R implementation of William Cleveland’s trellis system for S-PLUS. lattice allows for building many types of plots with sophisticated layouts based on conditioning. ggplot2 is an R implementation of the system described in “A Grammar of Graphics” by Leland Wilkinson. Like latticeggplot (also built on top of grid) assists in trellis-like graphics, but allows for much more. Since it is built on the idea of a semantics for graphics there is much more emphasis on reshaping data, transformation, and assembling the elements of a plot.
  • Devices : Whereas grid is built on top of the R graphics engine, many in the R community have found the R graphics engine somewhat inflexible and have written separate device drivers that either emphasize interactivity or plotting in various graphics formats. R base supplies devices for PostScript, PDF, JPEG and other formats. Devices on CRAN include cairoDevice which is a device based libcairo, which can actually render to many device types. The cairo device is desgned to work with RGTK2, which is an interface to the Gimp Tool Kit, similar to pyGTK2. GDD provides device drivers for several bitmap formats, including GIF and BMP. RSvgDevice is an SVG device driver and interfaces well with with vector drawing programs, or R web development packages, such as Rpad. When SVG devices are for web display developers should be aware that internet explorer does not support SVG, but has their own standard. Trust Microsoft. rgl provides a device driver based on OpenGL, and is good for 3D and interactive development. Lastly, the Augsburg group supplies a set of packages that includes a Java-based device, JavaGD.
  • Colors : The package colorspace provides a set of functions for transforming between color spaces and mixcolor() for mixing colors within a color space. Based on the HCL colors provided in colorspacevcdprovides a set of functions for choosing color palettes suitable for coding categorical variables ( rainbow_hcl()) and numerical information ( sequential_hcl()diverge_hcl()). Similar types of palettes are provided in RColorBrewer and dichromat is focused on palettes for color-impaired viewers.
  • Interactive Graphics : There are several efforts to implement interactive graphics systems that interface well with R. In an interactive system the user can interactively query the graphics on the screen with the mouse, or a moveable brush to zoom, pan and query on the device as well as link with other views of the data. rggobi embeds the GGobi interactive graphics system within R, so that one can display a data frame or several in GGobi directly from R. The package has functions to support longitudinal data, and graphs using GGobi’s edge set functionality. The RoSuDA repository maintained and developed by the University of Augsburg group has two packages, iplots and iwidgets as well as their Java development environment including a Java device, JavaGD. Their interactive graphics tools contain functions for alpha blending, which produces darker shading around areas with more data. This is exceptionally useful for parallel coordinate plots where many lines can quickly obscure patterns. playwith has facilities for building interactive versions of R graphics using the cairoDevice and RGtk2. Lastly, the rgl package has mechanisms for interactive manipulation of plots, especially 3D rotations and surfaces.
  • Development : For development of specialized graphics packages in R, grid should probably be the first consideration for any new plot type. rgl has better tools for 3D graphics, since the device is interactive, though it can be slow. An alternative is to use Java and the Java device in the RoSuDA packages, though Java has its own drawbacks. For porting plotting code to grid, using the package gridBase presents a nice intermediate step to embed base graphics in grid graphics and vice versa.

Cloud Computing with R

Illusion of Depth and Space (4/22) - Rotating ...
Image by Dominic's pics via Flickr

Here is a short list of resources and material I put together as starting points for R and Cloud Computing It’s a bit messy but overall should serve quite comprehensively.

Cloud computing is a commonly used expression to imply a generational change in computing from desktop-servers to remote and massive computing connections,shared computers, enabled by high bandwidth across the internet.

As per the National Institute of Standards and Technology Definition,
Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.

(Citation: The NIST Definition of Cloud Computing

Authors: Peter Mell and Tim Grance
Version 15, 10-7-09
National Institute of Standards and Technology, Information Technology Laboratory
http://csrc.nist.gov/groups/SNS/cloud-computing/cloud-def-v15.doc)

R is an integrated suite of software facilities for data manipulation, calculation and graphical display.

From http://cran.r-project.org/doc/FAQ/R-FAQ.html#R-Web-Interfaces

R Web Interfaces

Rweb is developed and maintained by Jeff Banfield. The Rweb Home Page provides access to all three versions of Rweb—a simple text entry form that returns output and graphs, a more sophisticated JavaScript version that provides a multiple window environment, and a set of point and click modules that are useful for introductory statistics courses and require no knowledge of the R language. All of the Rweb versions can analyze Web accessible datasets if a URL is provided.
The paper “Rweb: Web-based Statistical Analysis”, providing a detailed explanation of the different versions of Rweb and an overview of how Rweb works, was published in the Journal of Statistical Software (http://www.jstatsoft.org/v04/i01/).

Ulf Bartel has developed R-Online, a simple on-line programming environment for R which intends to make the first steps in statistical programming with R (especially with time series) as easy as possible. There is no need for a local installation since the only requirement for the user is a JavaScript capable browser. See http://osvisions.com/r-online/ for more information.

Rcgi is a CGI WWW interface to R by MJ Ray. It had the ability to use “embedded code”: you could mix user input and code, allowing the HTMLauthor to do anything from load in data sets to enter most of the commands for users without writing CGI scripts. Graphical output was possible in PostScript or GIF formats and the executed code was presented to the user for revision. However, it is not clear if the project is still active.

Currently, a modified version of Rcgi by Mai Zhou (actually, two versions: one with (bitmap) graphics and one without) as well as the original code are available from http://www.ms.uky.edu/~statweb/.

CGI-based web access to R is also provided at http://hermes.sdu.dk/cgi-bin/go/. There are many additional examples of web interfaces to R which basically allow to submit R code to a remote server, see for example the collection of links available from http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/StatCompCourse.

David Firth has written CGIwithR, an R add-on package available from CRAN. It provides some simple extensions to R to facilitate running R scripts through the CGI interface to a web server, and allows submission of data using both GET and POST methods. It is easily installed using Apache under Linux and in principle should run on any platform that supports R and a web server provided that the installer has the necessary security permissions. David’s paper “CGIwithR: Facilities for Processing Web Forms Using R” was published in the Journal of Statistical Software (http://www.jstatsoft.org/v08/i10/). The package is now maintained by Duncan Temple Lang and has a web page athttp://www.omegahat.org/CGIwithR/.

Rpad, developed and actively maintained by Tom Short, provides a sophisticated environment which combines some of the features of the previous approaches with quite a bit of JavaScript, allowing for a GUI-like behavior (with sortable tables, clickable graphics, editable output), etc.
Jeff Horner is working on the R/Apache Integration Project which embeds the R interpreter inside Apache 2 (and beyond). A tutorial and presentation are available from the project web page at http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RApacheProject.

Rserve is a project actively developed by Simon Urbanek. It implements a TCP/IP server which allows other programs to use facilities of R. Clients are available from the web site for Java and C++ (and could be written for other languages that support TCP/IP sockets).

OpenStatServer is being developed by a team lead by Greg Warnes; it aims “to provide clean access to computational modules defined in a variety of computational environments (R, SAS, Matlab, etc) via a single well-defined client interface” and to turn computational services into web services.

Two projects use PHP to provide a web interface to R. R_PHP_Online by Steve Chen (though it is unclear if this project is still active) is somewhat similar to the above Rcgi and Rweb. R-php is actively developed by Alfredo Pontillo and Angelo Mineo and provides both a web interface to R and a set of pre-specified analyses that need no R code input.

webbioc is “an integrated web interface for doing microarray analysis using several of the Bioconductor packages” and is designed to be installed at local sites as a shared computing resource.

Rwui is a web application to create user-friendly web interfaces for R scripts. All code for the web interface is created automatically. There is no need for the user to do any extra scripting or learn any new scripting techniques. Rwui can also be found at http://rwui.cryst.bbk.ac.uk.

Finally, the R.rsp package by Henrik Bengtsson introduces “R Server Pages”. Analogous to Java Server Pages, an R server page is typically HTMLwith embedded R code that gets evaluated when the page is requested. The package includes an internal cross-platform HTTP server implemented in Tcl, so provides a good framework for including web-based user interfaces in packages. The approach is similar to the use of the brew package withRapache with the advantage of cross-platform support and easy installation.

Also additional R Cloud Computing Use Cases
http://wwwdev.ebi.ac.uk/Tools/rcloud/

ArrayExpress R/Bioconductor Workbench

Remote access to R/Bioconductor on EBI’s 64-bit Linux Cluster

Start the workbench by downloading the package for your operating system (Macintosh or Windows), or via Java Web Start, and you will get access to an instance of R running on one of EBI’s powerful machines. You can install additional packages, upload your own data, work with graphics and collaborate with colleagues, all as if you are running R locally, but unlimited by your machine’s memory, processor or data storage capacity.

  • Most up-to-date R version built for multicore CPUs
  • Access to all Bioconductor packages
  • Access to our computing infrastructure
  • Fast access to data stored in EBI’s repositories (e.g., public microarray data in ArrayExpress)

Using R Google Docs
http://www.omegahat.org/RGoogleDocs/run.pdf
It uses the XML and RCurl packages and illustrates that it is relatively quick and easy
to use their primitives to interact with Web services.

Using R with Amazon
Citation
http://rgrossman.com/2009/05/17/running-r-on-amazons-ec2/

Amazon’s EC2 is a type of cloud that provides on demand computing infrastructures called an Amazon Machine Images or AMIs. In general, these types of cloud provide several benefits:

  • Simple and convenient to use. An AMI contains your applications, libraries, data and all associated configuration settings. You simply access it. You don’t need to configure it. This applies not only to applications like R, but also can include any third-party data that you require.
  • On-demand availability. AMIs are available over the Internet whenever you need them. You can configure the AMIs yourself without involving the service provider. You don’t need to order any hardware and set it up.
  • Elastic access. With elastic access, you can rapidly provision and access the additional resources you need. Again, no human intervention from the service provider is required. This type of elastic capacity can be used to handle surge requirements when you might need many machines for a short time in order to complete a computation.
  • Pay per use. The cost of 1 AMI for 100 hours and 100 AMI for 1 hour is the same. With pay per use pricing, which is sometimes called utility pricing, you simply pay for the resources that you use.

Connecting to R on Amazon EC2- Detailed tutorials
Ubuntu Linux version
https://decisionstats.com/2010/09/25/running-r-on-amazon-ec2/
and Windows R version
https://decisionstats.com/2010/10/02/running-r-on-amazon-ec2-windows/

Connecting R to Data on Google Storage and Computing on Google Prediction API
https://github.com/onertipaday/predictionapirwrapper
R wrapper for working with Google Prediction API

This package consists in a bunch of functions allowing the user to test Google Prediction API from R.
It requires the user to have access to both Google Storage for Developers and Google Prediction API:
see
http://code.google.com/apis/storage/ and http://code.google.com/apis/predict/ for details.

Example usage:

#This example requires you had previously created a bucket named data_language on your Google Storage and you had uploaded a CSV file named language_id.txt (your data) into this bucket – see for details
library(predictionapirwrapper)

and Elastic R for Cloud Computing
http://user2010.org/tutorials/Chine.html

Abstract

Elastic-R is a new portal built using the Biocep-R platform. It enables statisticians, computational scientists, financial analysts, educators and students to use cloud resources seamlessly; to work with R engines and use their full capabilities from within simple browsers; to collaborate, share and reuse functions, algorithms, user interfaces, R sessions, servers; and to perform elastic distributed computing with any number of virtual machines to solve computationally intensive problems.
Also see Karim Chine’s http://biocep-distrib.r-forge.r-project.org/

R for Salesforce.com

At the point of writing this, there seem to be zero R based apps on Salesforce.com This could be a big opportunity for developers as both Apex and R have similar structures Developers could write free code in R and charge for their translated version in Apex on Salesforce.com

Force.com and Salesforce have many (1009) apps at
http://sites.force.com/appexchange/home for cloud computing for
businesses, but very few forecasting and statistical simulation apps.

Example of Monte Carlo based app is here
http://sites.force.com/appexchange/listingDetail?listingId=a0N300000016cT9EAI#

These are like iPhone apps except meant for business purposes (I am
unaware if any university is offering salesforce.com integration
though google apps and amazon related research seems to be on)

Force.com uses a language called Apex  and you can see
http://wiki.developerforce.com/index.php/App_Logic and
http://wiki.developerforce.com/index.php/An_Introduction_to_Formulas
Apex is similar to R in that is OOPs

SAS Institute has an existing product for taking in Salesforce.com data.

A new SAS data surveyor is
available to access data from the Customer Relationship Management
(CRM) software vendor Salesforce.com. at
http://support.sas.com/documentation/cdl/en/whatsnew/62580/HTML/default/viewer.htm#datasurveyorwhatsnew902.htm)

Personal Note-Mentioning SAS in an email to a R list is a big no-no in terms of getting a response and love. Same for being careless about which R help list to email (like R devel or R packages or R help)

For python based cloud see http://pi-cloud.com

%d bloggers like this: