Topic Models in R- search documents for similarity by frequency

Zombie-process
Image via Wikipedia

From the marvelous lovely Journal of Statistical Software, ignored by mainstream corporatia, but beloved to academia. here is one more interesting and very timely paper.

Can be used to grade stdudents homework, catch terrorists as in plagiarists , search engine spam linkers. Enjoy!

New book on BigData Analytics and Data mining using #Rstats with a GUI

Joseph Marie Jacquard
Image via Wikipedia

I am hoping to put this on my pre-ordered or Amazon Wish list. The book the common people who wanted to do data mining with , but were unable to ask aloud they didnt know much.  It is written by the seminal Australian authority on data mining Dr Graham Williams whom I interviewed here at https://decisionstats.com/2009/01/13/interview-dr-graham-williams/

Data Mining for the masses using an ergonomically designed Graphical User Interface.

Thank you Springer. Thank you Dr Graham Williams

http://www.springer.com/statistics/physical+%26+information+science/book/978-1-4419-9889-7

Data Mining with Rattle and R

Data Mining with Rattle and R

The Art of Excavating Data for Knowledge Discovery

Series: Use R

Williams, Graham

1st Edition., 2011, XX, 409 p. 150 illus. in color.

  • Softcover, ISBN 978-1-4419-9889-7

    Due: August 29, 2011

    54,95 €
  • Encourages the concept of programming with data – more than just pushing data through tools, but learning to live and breathe the data
  • Accessible to many readers and not necessarily just those with strong backgrounds in computer science or statistics
  • Details some of the more popular algorithms for data mining, as well as covering model evaluation and model deployment

Data mining is the art and science of intelligent data analysis. By building knowledge from information, data mining adds considerable value to the ever increasing stores of electronic data that abound today. In performing data mining many decisions need to be made regarding the choice of methodology, the choice of data, the choice of tools, and the choice of algorithms.

Throughout this book the reader is introduced to the basic concepts and some of the more popular algorithms of data mining. With a focus on the hands-on end-to-end process for data mining, Williams guides the reader through various capabilities of the easy to use, free, and open source Rattle Data Mining Software built on the sophisticated R Statistical Software. The focus on doing data mining rather than just reading about data mining is refreshing.

The book covers data understanding, data preparation, data refinement, model building, model evaluation,  and practical deployment. The reader will learn to rapidly deliver a data mining project using software easily installed for free from the Internet. Coupling Rattle with R delivers a very sophisticated data mining environment with all the power, and more, of the many commercial offerings.

Content Level » Research

Keywords » Data mining

Related subjects » Physical & Information Science

Related- https://decisionstats.com/2009/01/13/interview-dr-graham-williams/

Revolution releases R Windows for Academics for free

Logo for R
Image via Wikipedia

Based on the official email from them, God bless the merry coders at Revo-

Revolution Analytics has just released Revolution R Enterprise 4.3 for 32-bit and 64-bit Windows, a significant step forward in enterprise data analytics.  It features an updated RevoScaleR package for scalable, fast (multicore), and extensible data analysis with R. Revolution R Enterprise 4.3 for Windows also provides R 2.12.2, and includes an enhanced R Productivity Environment (RPE), a full-featured integrated development environment with visual debugging capabilities. Also available is an updated Windows release of our deployment server solution, RevoDeployR 1.2, designed to help you deliver R analytics via the Web.

As a registered user of the Academic version of Revolution R Enterprise for Windows, you can take advantage of these improvements by downloading and installing Revolution R Enterprise 4.3 today. You can install Revolution R Enterprise 4.3 side-by-side with your existing Revolution R Enterprise installations; there is no need to uninstall previous versions.

 

High Performance Analytics

Marry Big Data Analytics to High Performance Computing, and you get the buzzword of this season- High Performance Analytics.

It basically consists of Parallelized code to run in parallel on custom hardware, in -database analytics for speed, and cloud computing /high performance computing environments. On an operational level, it consists of software (as in analytics) partnering with software (as in databases, Map reduce, Hadoop) plus some hardware (HP or IBM mostly). It is considered a high margin , highly profitable, business with small number of deals compared to say desktop licenses.

As per HPC Wire- which is a great tool/newsletter to keep updated on HPC , SAS Institute has been busy on this front partnering with EMC Greenplum and TeraData (who also acquired  SAS Partner AsterData to gain a much needed foot in the MR/SQL space) Continue reading “High Performance Analytics”

Changes in R software

The newest version of R is now available for download. R 2.13 is ready !!

 

http://cran.at.r-project.org/bin/windows/base/CHANGES.R-2.13.0.html

 

Windows-specific changes to R

CHANGES IN R VERSION 2.13.0

 

WINDOWS VERSION

 

  • Windows 2000 is no longer supported. (It went end-of-life in July 2010.)

 

 

 

NEW FEATURES

 

  • win_iconv has been updated: this version has a change in the behaviour with BOMs on UTF-16 and UTF-32 files – it removes BOMs when reading and adds them when writing. (This is consistent with Microsoft applications, but Unix versions of iconv usually ignore them.) 

     

  • Support for repository type win64.binary (used for 64-bit Windows binaries for R 2.11.x only) has been removed. 

     

  • The installers no longer put an ‘Uninstall’ item on the start menu (to conform to current Microsoft UI guidelines). 

     

  • Running R always sets the environment variable R_ARCH (as it does on a Unix-alike from the shell-script front-end). 

     

  • The defaults for options("browser") and options("pdfviewer") are now set from environment variables R_BROWSER and R_PDFVIEWER respectively (as on a Unix-alike). A value of "false" suppresses display (even if there is no false.exe present on the path). 

     

  • If options("install.lock") is set to TRUE, binary package installs are protected against failure similar to the way source package installs are protected. 

     

  • file.exists() and unlink() have more support for files > 2GB. 

     

  • The versions of R.exe in ‘R_HOME/bin/i386,x64/bin’ now support options such as R --vanilla CMD: there is no comparable interface for ‘Rcmd.exe’. 

     

  • A few more file operations will now work with >2GB files. 

     

  • The environment variable R_HOME in an R session now uses slash as the path separator (as it always has when set by Rcmd.exe). 

     

  • Rgui has a new menu item for the PDF ‘Sweave User Manual’.

 

 

 

DEPRECATED

 

  • zip.unpack() is deprecated: use unzip().

 

INSTALLATION

 

  • There is support for libjpeg-turbo via setting JPEGDIR to that value in ‘MkRules.local’. 

    Support for jpeg-6b has been removed.

     

  • The sources now work with libpng-1.5.1, jpegsrc.v8c (which are used in the CRAN builds) and tiff-4.0.0beta6 (CRAN builds use 3.9.1). It is possible that they no longer work with older versions than libpng-1.4.5.

 

 

 

BUG FIXES

 

  • Workaround for the incorrect values given by Windows’ casinh function on the branch cuts.
  • Bug fixes for drawing raster objects on windows(). The symptom was the occasional raster image not being drawn, especially when drawing multiple raster images in a single expression. Thanks to Michael Sumner for report and testing.
  • Printing extremely long string values could overflow the stack and cause the GUI to crash. (PR#14543)

Tonnes of changes!!

http://cran.at.r-project.org/src/base/NEWS

CHANGES IN R VERSION 2.13.0:

  SIGNIFICANT USER-VISIBLE CHANGES:

    • replicate() (by default) and vapply() (always) now return a
      higher-dimensional array instead of a matrix in the case where
      the inner function value is an array of dimension >= 2.

    • Printing and formatting of floating point numbers is now using
      the correct number of digits, where it previously rarely differed
      by a few digits. (See “scientific” entry below.)  This affects
      _many_ *.Rout.save checks in packages.

  NEW FEATURES:

    • normalizePath() has been moved to the base package (from utils):
      this is so it can be used by library() and friends.

      It now does tilde expansion.

      It gains new arguments winslash (to select the separator on
      Windows) and mustWork to control the action if a canonical path
      cannot be found.

    • The previously barely documented limit of 256 bytes on a symbol
      name has been raised to 10,000 bytes (a sanity check).  Long
      symbol names can sometimes occur when deparsing expressions (for
      example, in model.frame).

    • reformulate() gains a intercept argument.

    • cmdscale(add = FALSE) now uses the more common definition that
      there is a representation in n-1 or less dimensions, and only
      dimensions corresponding to positive eigenvalues are used.
      (Avoids confusion such as PR#14397.)

    • Names used by c(), unlist(), cbind() and rbind() are marked with
      an encoding when this can be ascertained.

    • R colours are now defined to refer to the sRGB color space.

      The PDF, PostScript, and Quartz graphics devices record this
      fact.  X11 (and Cairo) and Windows just assume that your screen
      conforms.

    • system.file() gains a mustWork argument (suggestion of Bill
      Dunlap).

    • new.env(hash = TRUE) is now the default.

    • list2env(envir = NULL) defaults to hashing (with a suitably sized
      environment) for lists of more than 100 elements.

    • text() gains a formula method.

    • IQR() now has a type argument which is passed to quantile().

    • as.vector(), as.double() etc duplicate less when they leave the
      mode unchanged but remove attributes.

      as.vector(mode = "any") no longer duplicates when it does not
      remove attributes.  This helps memory usage in matrix() and
      array().

      matrix() duplicates less if data is an atomic vector with
      attributes such as names (but no class).

      dim(x) <- NULL duplicates less if x has neither dimensions nor
      names (since this operation removes names and dimnames).

    • setRepositories() gains an addURLs argument.

    • chisq.test() now also returns a stdres component, for
      standardized residuals (which have unit variance, unlike the
      Pearson residuals).

    • write.table() and friends gain a fileEncoding argument, to
      simplify writing files for use on other OSes (e.g. a spreadsheet
      intended for Windows or Mac OS X Excel).

    • Assignment expressions of the form foo::bar(x) <- y and
      foo:::bar(x) <- y now work; the replacement functions used are
      foo::`bar<-` and foo:::`bar<-`.

    • Sys.getenv() gains a names argument so Sys.getenv(x, names =
      FALSE) can replace the common idiom of as.vector(Sys.getenv()).
      The default has been changed to not name a length-one result.

    • Lazy loading of environments now preserves attributes and locked
      status. (The locked status of bindings and active bindings are
      still not preserved; this may be addressed in the future).

    • options("install.lock") may be set to FALSE so that
      install.packages() defaults to --no-lock installs, or (on
      Windows) to TRUE so that binary installs implement locking.

    • sort(partial = p) for large p now tries Shellsort if quicksort is
      not appropriate and so works for non-numeric atomic vectors.

    • sapply() gets a new option simplify = "array" which returns a
      “higher rank” array instead of just a matrix when FUN() returns a
      dim() length of two or more.

      replicate() has this option set by default, and vapply() now
      behaves that way internally.

    • aperm() becomes S3 generic and gets a table method which
      preserves the class.

    • merge() and as.hclust() methods for objects of class "dendrogram"
      are now provided.

    • as.POSIXlt.factor() now passes ... to the character method
      (suggestion of Joshua Ulrich).

    • The character method of as.POSIXlt() now tries to find a format
      that works for all non-NA inputs, not just the first one.

    • str() now has a method for class "Date" analogous to that for
      class "POSIXt".

    • New function file.link() to create hard links on those file
      systems (POSIX, NTFS but not FAT) that support them.

    • New Summary() group method for class "ordered" implements min(),
      max() and range() for ordered factors.

    • mostattributes<-() now consults the "dim" attribute and not the
      dim() function, making it more useful for objects (such as data
      frames) from classes with methods for dim().  It also uses
      attr<-() in preference to the generics name<-(), dim<-() and
      dimnames<-().  (Related to PR#14469.)

    • There is a new option "browserNLdisabled" to disable the use of
      an empty (e.g. via the ‘Return’ key) as a synonym for c in
      browser() or n under debug().  (Wish of PR#14472.)

    • example() gains optional new arguments character.only and
      give.lines enabling programmatic exploration.

    • serialize() and unserialize() are no longer described as
      ‘experimental’.  The interface is now regarded as stable,
      although the serialization format may well change in future
      releases.  (serialize() has a new argument version which would
      allow the current format to be written if that happens.)

      New functions saveRDS() and readRDS() are public versions of the
      ‘internal’ functions .saveRDS() and .readRDS() made available for
      general use.  The dot-name versions remain available as several
      package authors have made use of them, despite the documentation.

      saveRDS() supports compress = "xz".

    • Many functions when called with a not-open connection will now
      ensure that the connection is left not-open in the event of
      error.  These include read.dcf(), dput(), dump(), load(),
      parse(), readBin(), readChar(), readLines(), save(), writeBin(),
      writeChar(), writeLines(), .readRDS(), .saveRDS() and
      tools::parse_Rd(), as well as functions calling these.

    • Public functions find.package() and path.package() replace the
      internal dot-name versions.

    • The default method for terms() now looks for a "terms" attribute
      if it does not find a "terms" component, and so works for model
      frames.

    • httpd() handlers receive an additional argument containing the
      full request headers as a raw vector (this can be used to parse
      cookies, multi-part forms etc.). The recommended full signature
      for handlers is therefore function(url, query, body, headers,
      ...).

    • file.edit() gains a fileEncoding argument to specify the encoding
      of the file(s).

    • The format of the HTML package listings has changed.  If there is
      more than one library tree , a table of links to libraries is
      provided at the top and bottom of the page.  Where a library
      contains more than 100 packages, an alphabetic index is given at
      the top of the section for that library.  (As a consequence,
      package names are now sorted case-insensitively whatever the
      locale.)

    • isSeekable() now returns FALSE on connections which have
      non-default encoding.  Although documented to record if ‘in
      principle’ the connection supports seeking, it seems safer to
      report FALSE when it may not work.

    • R CMD REMOVE and remove.packages() now remove file R.css when
      removing all remaining packages in a library tree.  (Related to
      the wish of PR#14475: note that this file is no longer
      installed.)

    • unzip() now has a unzip argument like zip.file.extract().  This
      allows an external unzip program to be used, which can be useful
      to access features supported by Info-ZIP's unzip version 6 which
      is now becoming more widely available.

    • There is a simple zip() function, as wrapper for an external zip
      command.

    • bzfile() connections can now read from concatenated bzip2 files
      (including files written with bzfile(open = "a")) and files
      created by some other compressors (such as the example of
      PR#14479).

    • The primitive function c() is now of type BUILTIN.

    • plot(<dendrogram>, .., nodePar=*) now obeys an optional xpd
      specification (allowing clipping to be turned off completely).

    • nls(algorithm="port") now shares more code with nlminb(), and is
      more consistent with the other nls() algorithms in its return
      value.

    • xz has been updated to 5.0.1 (very minor bugfix release).

    • image() has gained a logical useRaster argument allowing it to
      use a bitmap raster for plotting a regular grid instead of
      polygons. This can be more efficient, but may not be supported by
      all devices. The default is FALSE.

    • list.files()/dir() gains a new argument include.dirs() to include
      directories in the listing when recursive = TRUE.

    • New function list.dirs() lists all directories, (even empty
      ones).

    • file.copy() now (by default) copies read/write/execute
      permissions on files, moderated by the current setting of
      Sys.umask().

    • Sys.umask() now accepts mode = NA and returns the current umask
      value (visibly) without changing it.

    • There is a ! method for classes "octmode" and "hexmode": this
      allows xor(a, b) to work if both a and b are from one of those
      classes.

    • as.raster() no longer fails for vectors or matrices containing
      NAs.

    • New hook "before.new.plot" allows functions to be run just before
      advancing the frame in plot.new, which is potentially useful for
      custom figure layout implementations.

    • Package tools has a new function compactPDF() to try to reduce
      the size of PDF files _via_ qpdf or gs.

    • tar() has a new argument extra_flags.

    • dotchart() accepts more general objects x such as 1D tables which
      can be coerced by as.numeric() to a numeric vector, with a
      warning since that might not be appropriate.

    • The previously internal function create.post() is now exported
      from utils, and the documentation for bug.report() and
      help.request() now refer to that for create.post().

      It has a new method = "mailto" on Unix-alikes similar to that on
      Windows: it invokes a default mailer via open (Mac OS X) or
      xdg-open or the default browser (elsewhere).

      The default for ccaddress is now getOption("ccaddress") which is
      by default unset: using the username as a mailing address
      nowadays rarely works as expected.

    • The default for options("mailer") is now "mailto" on all
      platforms.

    • unlink() now does tilde-expansion (like most other file
      functions).

    • file.rename() now allows vector arguments (of the same length).

    • The "glm" method for logLik() now returns an "nobs" attribute
      (which stats4::BIC() assumed it did).

      The "nls" method for logLik() gave incorrect results for zero
      weights.

    • There is a new generic function nobs() in package stats, to
      extract from model objects a suitable value for use in BIC
      calculations.  An S4 generic derived from it is defined in
      package stats4.

    • Code for S4 reference-class methods is now examined for possible
      errors in non-local assignments.

    • findClasses, getGeneric, findMethods and hasMethods are revised
      to deal consistently with the package= argument and be consistent
      with soft namespace policy for finding objects.

    • tools::Rdiff() now has the option to return not only the status
      but a character vector of observed differences (which are still
      by default sent to stdout).

    • The startup environment variables R_ENVIRON_USER, R_ENVIRON,
      R_PROFILE_USER and R_PROFILE are now treated more consistently.
      In all cases an empty value is considered to be set and will stop
      the default being used, and for the last two tilde expansion is
      performed on the file name.  (Note that setting an empty value is
      probably impossible on Windows.)

    • Using R --no-environ CMD, R --no-site-file CMD or R
      --no-init-file CMD sets environment variables so these settings
      are passed on to child R processes, notably those run by INSTALL,
      check and build. R --vanilla CMD sets these three options (but
      not --no-restore).

    • smooth.spline() is somewhat faster.  With cv=NA it allows some
      leverage computations to be skipped,

    • The internal (C) function scientific(), at the heart of R's
      format.info(x), format(x), print(x), etc, for numeric x, has been
      re-written in order to provide slightly more correct results,
      fixing PR#14491, notably in border cases including when digits >=
      16, thanks to substantial contributions (code and experiments)
      from Petr Savicky.  This affects a noticable amount of numeric
      output from R.

    • A new function grepRaw() has been introduced for finding subsets
      of raw vectors. It supports both literal searches and regular
      expressions.

    • Package compiler is now provided as a standard package.  See
      ?compiler::compile for information on how to use the compiler.
      This package implements a byte code compiler for R: by default
      the compiler is not used in this release.  See the ‘R
      Installation and Administration Manual’ for how to compile the
      base and recommended packages.

    • Providing an exportPattern directive in a NAMESPACE file now
      causes classes to be exported according to the same pattern, for
      example the default from package.skeleton() to specify all names
      starting with a letter.  An explicit directive to
      exportClassPattern will still over-ride.

    • There is an additional marked encoding "bytes" for character
      strings.  This is intended to be used for non-ASCII strings which
      should be treated as a set of bytes, and never re-encoded as if
      they were in the encoding of the currrent locale: useBytes = TRUE
      is autmatically selected in functions such as writeBin(),
      writeLines(), grep() and strsplit().

      Only a few character operations are supported (such as substr()).

      Printing, format() and cat() will represent non-ASCII bytes in
      such strings by a \xab escape.

    • The new function removeSource() removes the internally stored
      source from a function.

    • "srcref" attributes now include two additional line number
      values, recording the line numbers in the order they were parsed.

    • New functions have been added for source reference access:
      getSrcFilename(), getSrcDirectory(), getSrcLocation() and
      getSrcref().

    • Sys.chmod() has an extra argument use_umask which defaults to
      true and restricts the file mode by the current setting of umask.
      This means that all the R functions which manipulate
      file/directory permissions by default respect umask, notably R
      CMD INSTALL.

    • tempfile() has an extra argument fileext to create a temporary
      filename with a specified extension.  (Suggestion and initial
      implementation by Dirk Eddelbuettel.)

      There are improvements in the way Sweave() and Stangle() handle
      non-ASCII vignette sources, especially in a UTF-8 locale: see
      ‘Writing R Extensions’ which now has a subsection on this topic.

    • factanal() now returns the rotation matrix if a rotation such as
      "promax" is used, and hence factor correlations are displayed.
      (Wish of PR#12754.)

    • The gctorture2() function provides a more refined interface to
      the GC torture process.  Environment variables R_GCTORTURE,
      R_GCTORTURE_WAIT, and R_GCTORTURE_INHIBIT_RELEASE can also be
      used to control the GC torture process.

    • file.copy(from, to) no longer regards it as an error to supply a
      zero-length from: it now simply does nothing.

    • rstandard.glm gains a type argument which can be used to request
      standardized Pearson residuals.

    • A start on a Turkish translation, thanks to Murat Alkan.

    • .libPaths() calls normalizePath(winslash = "/") on the paths:
      this helps (usually) present them in a user-friendly form and
      should detect duplicate paths accessed via different symbolic
      links.

  SWEAVE CHANGES:

    • Sweave() has options to produce PNG and JPEG figures, and to use
      a custom function to open a graphics device (see ?RweaveLatex).
      (Based in part on the contribution of PR#14418.)

    • The default for Sweave() is to produce only PDF figures (rather
      than both EPS and PDF).

    • Environment variable SWEAVE_OPTIONS can be used to supply
      defaults for existing or new options to be applied after the
      Sweave driver setup has been run.

    • The Sweave manual is now included as a vignette in the utils
      package.

    • Sweave() handles keep.source=TRUE much better: it could duplicate
      some lines and omit comments. (Reported by John Maindonald and
      others.)

  C-LEVEL FACILITIES:

    • Because they use a C99 interface which a C++ compiler is not
      required to support, Rvprintf and REvprintf are only defined by
      R_ext/Print.h in C++ code if the macro R_USE_C99_IN_CXX is
      defined when it is included.

    • pythag duplicated the C99 function hypot.  It is no longer
      provided, but is used as a substitute for hypot in the very
      unlikely event that the latter is not available.

    • R_inspect(obj) and R_inspect3(obj, deep, pvec) are (hidden)
      C-level entry points to the internal inspect function and can be
      used for C-level debugging (e.g., in conjunction with the p
      command in gdb).

    • Compiling R with --enable-strict-barrier now also enables
      additional checking for use of unprotected objects. In
      combination with gctorture() or gctorture2() and a C-level
      debugger this can be useful for tracking down memory protection
      issues.

  UTILITIES:

    • R CMD Rdiff is now implemented in R on Unix-alikes (as it has
      been on Windows since R 2.12.0).

    • R CMD build no longer does any cleaning in the supplied package
      directory: all the cleaning is done in the copy.

      It has a new option --install-args to pass arguments to R CMD
      INSTALL for --build (but not when installing to rebuild
      vignettes).

      There is new option, --resave-data, to call
      tools::resaveRdaFiles() on the data directory, to compress
      tabular files (.tab, .csv etc) and to convert .R files to .rda
      files.  The default, --resave-data=gzip, is to do so in a way
      compatible even with years-old versions of R, but better
      compression is given by --resave-data=best, requiring R >=
      2.10.0.

      It now adds a datalist file for data directories of more than
      1Mb.

      Patterns in .Rbuildignore are now also matched against all
      directory names (including those of empty directories).

      There is a new option, --compact-vignettes, to try reducing the
      size of PDF files in the inst/doc directory.  Currently this
      tries qpdf: other options may be used in future.

      When re-building vignettes and a inst/doc/Makefile file is found,
      make clean is run if the makefile has a clean: target.

      After re-building vignettes the default clean-up operation will
      remove any directories (and not just files) created during the
      process: e.g. one package created a .R_cache directory.

      Empty directories are now removed unless the option
      --keep-empty-dirs is given (and a few packages do deliberately
      include empty directories).

      If there is a field BuildVignettes in the package DESCRIPTION
      file with a false value, re-building the vignettes is skipped.

    • R CMD check now also checks for filenames that are
      case-insensitive matches to Windows' reserved file names with
      extensions, such as nul.Rd, as these have caused problems on some
      Windows systems.

      It checks for inefficiently saved data/*.rda and data/*.RData
      files, and reports on those large than 100Kb.  A more complete
      check (including of the type of compression, but potentially much
      slower) can be switched on by setting environment variable
      _R_CHECK_COMPACT_DATA2_ to TRUE.

      The types of files in the data directory are now checked, as
      packages are _still_ misusing it for non-R data files.

      It now extracts and runs the R code for each vignette in a
      separate directory and R process: this is done in the package's
      declared encoding.  Rather than call tools::checkVignettes(), it
      calls tool::buildVignettes() to see if the vignettes can be
      re-built as they would be by R CMD build.  Option --use-valgrind
      now applies only to these runs, and not when running code to
      rebuild the vignettes.  This version does a much better job of
      suppressing output from successful vignette tests.

      The 00check.log file is a more complete record of what is output
      to stdout: in particular contains more details of the tests.

      It now check all syntactically valid Rd usage entries, and warns
      about assignments (unless these give the usage of replacement
      functions).

      .tar.xz compressed tarballs are now allowed, if tar supports them
      (and setting environment variable TAR to internal ensures so on
      all platforms).

    • R CMD check now warns if it finds inst/doc/makefile, and R CMD
      build renames such a file to inst/doc/Makefile.

  INSTALLATION:

    • Installing R no longer tries to find perl, and R CMD no longer
      tries to substitute a full path for awk nor perl - this was a
      legacy from the days when they were used by R itself.  Because a
      couple of packages do use awk, it is set as the make (rather than
      environment) variable AWK.

    • make check will now fail if there are differences from the
      reference output when testing package examples and if environment
      variable R_STRICT_PACKAGE_CHECK is set to a true value.

    • The C99 double complex type is now required.

      The C99 complex trigonometric functions (such as csin) are not
      currently required (FreeBSD lacks most of them): substitutes are
      used if they are missing.

    • The C99 system call va_copy is now required.

    • If environment variable R_LD_LIBRARY_PATH is set during
      configuration (for example in config.site) it is used unchanged
      in file etc/ldpaths rather than being appended to.

    • configure looks for support for OpenMP and if found compiles R
      with appropriate flags and also makes them available for use in
      packages: see ‘Writing R Extensions’.

      This is currently experimental, and is only used in R with a
      single thread for colSums() and colMeans().  Expect it to be more
      widely used in later versions of R.

      This can be disabled by the --disable-openmp flag.

  PACKAGE INSTALLATION:

    • R CMD INSTALL --clean now removes copies of a src directory which
      are created when multiple sub-architectures are in use.
      (Following a comment from Berwin Turlach.)

    • File R.css is now installed on a per-package basis (in the
      package's html directory) rather than in each library tree, and
      this is used for all the HTML pages in the package.  This helps
      when installing packages with static HTML pages for use on a
      webserver.  It will also allow future versions of R to use
      different stylesheets for the packages they install.

    • A top-level file .Rinstignore in the package sources can list (in
      the same way as .Rbuildignore) files under inst that should not
      be installed.  (Why should there be any such files?  Because all
      the files needed to re-build vignettes need to be under inst/doc,
      but they may not need to be installed.)

    • R CMD INSTALL has a new option --compact-docs to compact any PDFs
      under the inst/doc directory.  Currently this uses qpdf, which
      must be installed (see ‘Writing R Extensions’).

    • There is a new option --lock which can be used to cancel the
      effect of --no-lock or --pkglock earlier on the command line.

    • Option --pkglock can now be used with more than one package, and
      is now the default if only one package is specified.

    • Argument lock of install.packages() can now be use for Mac binary
      installs as well as for Windows ones.  The value "pkglock" is now
      accepted, as well as TRUE and FALSE (the default).

    • There is a new option --no-clean-on-error for R CMD INSTALL to
      retain a partially installed package for forensic analysis.

    • Packages with names ending in . are not portable since Windows
      does not work correctly with such directory names.  This is now
      warned about in R CMD check, and will not be allowed in R 2.14.x.

    • The vignette indices are more comprehensive (in the style of
      browseVignetttes()).

  DEPRECATED & DEFUNCT:

    • require(save = TRUE) is defunct, and use of the save argument is
      deprecated.

    • R CMD check --no-latex is defunct: use --no-manual instead.

    • R CMD Sd2Rd is defunct.

    • The gamma argument to hsv(), rainbow(), and rgb2hsv() is
      deprecated and no longer has any effect.

    • The previous options for R CMD build --binary (--auto-zip,
      --use-zip-data and --no-docs) are deprecated (or defunct): use
      the new option --install-args instead.

    • When a character value is used for the EXPR argument in switch(),
      only a single unnamed alternative value is now allowed.

    • The wrapper utils::link.html.help() is no longer available.

    • Zip-ing data sets in packages (and hence R CMD INSTALL options
      --use-zip-data and --auto-zip, as well as the ZipData: yes field
      in a DESCRIPTION file) is defunct.

      Installed packages with zip-ed data sets can still be used, but a
      warning that they should be re-installed will be given.

    • The ‘experimental’ alternative specification of a name space via
      .Export() etc is now defunct.

    • The option --unsafe to R CMD INSTALL is deprecated: use the
      identical option --no-lock instead.

    • The entry point pythag in Rmath.h is deprecated in favour of the
      C99 function hypot.  A wrapper for hypot is provided for R 2.13.x
      only.

    • Direct access to the "source" attribute of functions is
      deprecated; use deparse(fn, control="useSource") to access it,
      and removeSource(fn) to remove it.

    • R CMD build --binary is now formally deprecated: R CMD INSTALL
      --build has long been the preferred alternative.

    • Single-character package names are deprecated (and R is already
      disallowed to avoid confusion in Depends: fields).

  BUG FIXES:

    • drop.terms and the [ method for class "terms" no longer add back
      an intercept.  (Reported by Niels Hansen.)

    • aggregate preserves the class of a column (e.g. a date) under
      some circumstances where it discarded the class previously.

    • p.adjust() now always returns a vector result, as documented.  In
      previous versions it copied attributes (such as dimensions) from
      the p argument: now it only copies names.

    • On PDF and PostScript devices, a line width of zero was recorded
      verbatim and this caused problems for some viewers (a very thin
      line combined with a non-solid line dash pattern could also cause
      a problem).  On these devices, the line width is now limited at
      0.01 and for very thin lines with complex dash patterns the
      device may force the line dash pattern to be solid.  (Reported by
      Jari Oksanen.)

    • The str() method for class "POSIXt" now gives sensible output for
      0-length input.

    • The one- and two-argument complex maths functions failed to warn
      if NAs were generated (as their numeric analogues do).

    • Added .requireCachedGenerics to the dont.mind list for library()
      to avoid warnings about duplicates.

    • $<-.data.frame messed with the class attribute, breaking any S4
      subclass.  The S4 data.frame class now has its own $<- method,
      and turns dispatch on for this primitive.

    • Map() did not look up a character argument f in the correct
      frame, thanks to lazy evaluation.  (PR#14495)

    • file.copy() did not tilde-expand from and to when to was a
      directory.  (PR#14507)

    • It was possible (but very rare) for the loading test in R CMD
      INSTALL to crash a child R process and so leave around a lock
      directory and a partially installed package.  That test is now
      done in a separate process.

    • plot(<formula>, data=<matrix>,..) now works in more cases;
      similarly for points(), lines() and text().

    • edit.default() contained a manual dispatch for matrices (the
      "matrix" class didn't really exist when it was written).  This
      caused an infinite recursion in the no-GUI case and has now been
      removed.

    • data.frame(check.rows = TRUE) sometimes worked when it should
      have detected an error.  (PR#14530)

    • scan(sep= , strip.white=TRUE) sometimes stripped trailing spaces
      from within quoted strings.  (The real bug in PR#14522.)

    • The rank-correlation methods for cor() and cov() with use =
      "complete.obs" computed the ranks before removing missing values,
      whereas the documentation implied incomplete cases were removed
      first.  (PR#14488)

      They also failed for 1-row matrices.

    • The perpendicular adjustment used in placing text and expressions
      in the margins of plots was not scaled by par("mex"). (Part of
      PR#14532.)

    • Quartz Cocoa device now catches any Cocoa exceptions that occur
      during the creation of the device window to prevent crashes.  It
      also imposes a limit of 144 ft^2 on the area used by a window to
      catch user errors (unit misinterpretation) early.

    • The browser (invoked by debug(), browser() or otherwise) would
      display attributes such as "wholeSrcref" that were intended for
      internal use only.

    • R's internal filename completion now properly handles filenames
      with spaces in them even when the readline library is used.  This
      resolves PR#14452 provided the internal filename completion is
      used (e.g., by setting rc.settings(files = TRUE)).

    • Inside uniroot(f, ...), -Inf function values are now replaced by
      a maximally *negative* value.

    • rowsum() could silently over/underflow on integer inputs
      (reported by Bill Dunlap).

    • as.matrix() did not handle "dist" objects with zero rows.

CHANGES IN R VERSION 2.12.2 patched:

  NEW FEATURES:

    • max() and min() work harder to ensure that NA has precedence over
      NaN, so e.g. min(NaN, NA) is NA.  (This was not previously
      documented except for within a single numeric vector, where
      compiler optimizations often defeated the code.)

  BUG FIXES:

    • A change to the C function R_tryEval had broken error messages in
      S4 method selection; the error message is now printed.

    • PDF output with a non-RGB color model used RGB for the line
      stroke color.  (PR#14511)

    • stats4::BIC() assumed without checking that an object of class
      "logLik" has an "nobs" attribute: glm() fits did not and so BIC()
      failed for them.

    • In some circumstances a one-sided mantelhaen.test() reported the
      p-value for the wrong tail.  (PR#14514)

    • Passing the invalid value lty = NULL to axis() sent an invalid
      value to the graphics device, and might cause the device to
      segfault.

    • Sweave() with concordance=TRUE could lead to invalid PDF files;
      Sweave.sty has been updated to avoid this.

    • Non-ASCII characters in the titles of help pages were not
      rendered properly in some locales, and could cause errors or
      warnings.    • checkRd() gave a spurious error if the \href macro was used.

 

 

The long tail of the internet

On a whim, I took the all time stats of my blog posts (more than 1000 posts) , and tried to plot their distribution.

Basically I copied and pasted all the data in a Google docs spreadsheet. and I created dummy codes (like URL1, URL2…. URL 500)

Next I  downloaded the….

I wasnt in the mood for downloading and uploading stuff so I decided to use GGPLOT using Jeroen’s Application at http://www.stat.ucla.edu/~jeroen/

I used the mirror server that Dataspora provides as I have had latency issues with Jeroen’s website.

I got this error while trying to connect the Dataspora App to my Google spreadsheet

The page you have requested cannot be displayed. Another site was requesting access to your Google Account, but sent a malformed request. Please contact the site that you were trying to use when you received this message to inform them of the error. A detailed error message follows:

The site “http://dataspora.com&#8221; has not been registered.

Oh dear! Back to Jeroen’s /UCLA’s page.

http://rweb.stat.ucla.edu/ggplot2/

I get this warning but it still manages to log in

This website has not registered with Google to establish a secure connection for authorization requests. We recommend that you continue the process only if you trust the following destination:

http://rweb.stat.ucla.edu/R/googleLogin?domain=rweb.stat.ucla.edu

wow it works! thats cloud computing now so I wonder why Google and Amazon continue to ignore the rApache, and Jeroen’s cloud app . Surely their Google Fusion Tables can be always improved or tweaked. Not to mention the next gen version of R which will have its own server

Pretty cool screenshot (but click to see more)

I get the following pretty graph. Hadley Wickham would be ashamed of me by now.

What went wrong- well one page has 36000 views . Scale is the key to graphical coherence . So I redo- delete home page in Google spreadsheet ,reimport replot. ( I didnt know how to modify data in the cloud app, maybe we need a cloud PlyR) I redo it again as I have a big outlier-The top 10 Statistical GUI article which ironically has only 5 GUIs in that article but hush dont tell to high quality search engine)

So again Belatedly I discover something called layer in ggplot.

Base Graphics engine has really spoilt me to write short functions for plots.

I give up. I rather prefer hist() I go to my favorite GUI Rattle, but it has some dating issues with the dll of GTK+

So I go to John Fox’s simple GUI. R Commander- is the best GUI if you use Occam’s Razor, and I am using Occam’s Chainsaw now.

I get the analysis I want in 12 secs


Summary- GGPLot is more complicated than base graphics engine.

Deducer GUI is not as simple too

R Commander is the best GUI because it retains simplicity

Ignore long tail of internet only at your peril

Almost 2/3 rds of my daily traffic of 400+ comes from old archived content That is why Search Engine Optimization and Alerts for Keywords are CRITICAL for any poor soul trying to write on a blog (which has no journal like prestige nor rewards)

If you make life easier for the search engine, it being a fair chap, rewards you well

Existing web traffic estimates like Comscore and Google Trends ignore this long tail

Comments are welcome (Data is pasted below of 500 rows X 2 columns if you can come up with a better analysis)

Since SAS has ignored web analytics and Google Analytics is hmm hmm,  this could be an area of opportunity for R developers as well to create a web analytics package.

Title
Views
Home page 36,185
Top 10 Graphical User Interfaces in Statistical Software More stats 8,264
Matlab-Mathematica-R and GPU Computing More stats 2,166
Wealth = function (numeracy, memory recall) More stats 2,162
The Top Statistical Softwares (GUI) More stats 2,118
About DecisionStats More stats 1,902
Libre Office More stats 1,770
Using Facebook Analytics (Updated) More stats 1,446
Windows Azure vs Amazon EC2 (and Google Storage) More stats 1,386
Interview Hadley Wickham R Project Data Visualization Guru More stats 1,204
Test drive a Chrome notebook. More stats 1,201
Interview Professor John Fox Creator R Commander More stats 1,190
Top ten RRReasons R is bad for you ? More stats 1,178
SAS Institute files first lawsuit against WPS- Episode 1 More stats 1,131
R Package Creating More stats 1,104
Interfaces to R More stats 1,039
Using Red R- R with a Visual Interface More stats 950
Google Maps – Jet Ski across Pacific Ocean More stats 922
Norman Nie: R GUI and More More stats 851
Not so AWkward after all: R GUI RKWard More stats 805
Running R on Amazon EC2 More stats 786
Startups for Geeks More stats 785
Creating a Blog Aggregator for free More stats 749
Cloud Computing with R More stats 676
Rapid Miner- R Extension More stats 671
Parallel Programming using R in Windows More stats 664
Revolution R for Linux More stats 645
Red R 1.8- Pretty GUI More stats 638
John Sall sets JMP 9 free to tango with R More stats 601
Wordle.net More stats 597
Funny Images from India More stats 571
R is an epic fail or is it just overhyped More stats 568
Great article on Notepad++ and R in R Journal More stats 564
Certifications in Analytics and Business Intelligence More stats 548
R Excel :Updated More stats 542
Enterprise Linux rises rapidly:New Report More stats 537
So which software is the best analytical software? Sigh- It depends More stats 520
Funny Photo :It happens only In India More stats 518
Creating 3D Graphs with Data in R More stats 507
SPSS /PASW Certification – Free until Sept 15 More stats 497
Interview :Dr Graham Williams More stats 476
GNU PSPP- The Open Source SPSS More stats 474
Professors and Patches: For a Betterrrr R More stats 467
Running R on Amazon EC2 :Windows More stats 462
WPS response to SAS Lawsuit More stats 458
R language on the GPU More stats 450
KXEN and a Data Mining Survey More stats 449
News on R Commercial Development -Rattle- R Data Mining Tool More stats 449
WPS ( Alternative SAS Language Software) Pricing More stats 447
Kill R? Wait a sec More stats 445
SAS Institute lawsuit against WPS Episode 2 The Clone Wars More stats 442
How to be a BAD blogger? More stats 435
ROC Curve More stats 431
Bulls ,Bears ,Tigers and Asses More stats 424
Trrrouble in land of R…and Open Source Suggestions More stats 422
Interview- BI Dashboards dMINE Sanjay Patel More stats 417
Top Seven Reasons :Why Outsourcing is Bad for India More stats 408
Interviews @Decisionstats More stats 407
Running a R GUI,and parallel programming on Amazon EC2 More stats 394
Unbreakable Oracle Linux- and Unshakable-Libre Office- More stats 393
IBM SPSS 19: Marketing Analytics and RFM More stats 387
Analyzing SAS Institute-WPS Lawsuit More stats 377
Hive Tutorial: Cloud Computing More stats 377
R and Hadoop More stats 374
Graphics Presentations More stats 373
Sector/ Sphere – Faster than Hadoop/Mapreduce at Terasort More stats 370
Benchmarking GNU R: DirkE’s view and a Ninja wishlist More stats 363
Webfocus RStat: Pervasive BI using R More stats 363
Open Source Business Intelligence: Pentaho and Jaspersoft More stats 362
How to do Logistic Regression More stats 362
CommeRcial R- Integration in software More stats 359
So what’s new in R 2.12.0 More stats 357
Interview Michael J. A. Berry Data Miners, Inc More stats 356
Data Mining through the Android More stats 352
Newer version of Alternative SAS / WPS 2.4 launched More stats 350
How to Analyze Wikileaks Data – R SPARQL More stats 348
JMP 9 releasing on Oct 12 More stats 343
The R Online WikiBook More stats 340
Hadley’s tutorials on R Visualization More stats 340
Interview Tasso Argyros CTO Aster Data Systems More stats 339
Parsing XML files easily More stats 337
A Software Called Rattle More stats 335
Which software do we buy? -It depends More stats 329
Jim Goodnight on Open Source- and why he is right -sigh More stats 328
SAS/Blades/Servers/ GPU Benchmarks More stats 326
R Commander Plugins-20 and growing! More stats 324
10 iPhone Apps you can actually use ( and dont have to pay for) More stats 316
R Modeling with huge data More stats 315
The Popularity of Data Analysis Software More stats 315
Interview Donald Farmer Microsoft More stats 307
Learning SAS for free More stats 305
Comparing Base SAS and SPSS More stats 304
Towards better Statistical Interfaces More stats 302
Making NeW R More stats 301
Using Code Snippets in Revolution R More stats 300
R Apache – The next frontier of R Computing More stats 298
Using JMP 9 and R together More stats 297
Doing Time Series using a R GUI More stats 295
Amazon announces Micro Instances for cloud computing More stats 295
Top 5 Free Music Websites More stats 295
Web R- Elastic R and RevoDeploy R More stats 291
R for Stats : Updated More stats 290
Heritage Health Prize- Data Mining Contest for 3mill USD More stats 289
Google AppInventor -Android and Business Intelligence More stats 281
Top R Interviews More stats 278
An Introduction to Data Mining-online book More stats 272
Interview Jim Davis SAS Institute More stats 272
Economic: Indian Caste System -Simplification More stats 271
Rattle Re-Introduced More stats 271
KXEN – Automated Regression Modeling More stats 267
Movie Review- Inglorious Basterds More stats 267
Interview :Doug Savage ,Creator SavageChickens.com More stats 261
IPSUR – A Free R Textbook More stats 258
SAS with the GUI Enterprise Guide (Updated) More stats 256
Trying out Google Prediction API from R More stats 256
Segmenting Models : When and Why More stats 253
Using R and Excel Together More stats 253
R Oracle Data Mining More stats 253
KNIME More stats 253
Using PostgreSQL and MySQL databases in R 2.12 for Windows More stats 250
Fighting Back -The Net, Social Media, Spam, Identity Theft, Terrorism More stats 249
Libre Office (Beta) 3 Launched More stats 248
India to make own DoS -citing cyber security More stats 247
Interview Dominic Pouzin Data Applied More stats 242
R releases new version R 2.9.2 More stats 240
SAS to launch SAS/IML with R ( updated) More stats 239
Playing with Playwith- R Package for Interactive Data Visualizations More stats 234
Predictive Analytics World Conference More stats 231
Analytics and BI for small biz More stats 231
Interview Jeanne Harris Co-Author -Analytics at Work and Competing with Analytics More stats 230
Using R for Time Series in SAS More stats 228
General Electric ‘s breach of the spirit and letter of integrity More stats 227
Interview Luis Torgo Author Data Mining with R More stats 222
Browser Based Model Creation More stats 222
Interview James Dixon Pentaho More stats 221
Thoughts on WPS, SAS , R More stats 220
Choosing R for business – What to consider? More stats 220
Buying SAS Institute More stats 219
Google: Prediction API and other cool stuff More stats 218
Interview : R For Stata Users More stats 216
Viva Libre Office More stats 216
Top 10 Games on Linux -sudo update More stats 214
When China overtook India- using DEDUCER More stats 214
KDD 2009 : Demos More stats 211
Interview Dean Abbott Abbott Analytics More stats 210
Statistically Speaking More stats 203
Data Visualization using Tableau More stats 203
SAS and JMP : Visual Data Discovery More stats 203
High Performance Computing and R More stats 200
Troubleshooting Rattle Installation- Data Mining R GUI More stats 194
Google Realtime Live Updates on Egypt Yemen Tunisia Jordan.. More stats 192
New Deal in Statistical Training More stats 191
Interview Ken O Connor Business Intelligence Consultant More stats 190
Karmic Koala versus Windows 7 More stats 189
Interview Shawn Kung Sr Director Aster Data More stats 189
Pun on Putin More stats 189
Towards better analytical software More stats 188
Dryad- Microsoft’s answer to MR More stats 188
Analyzing Indian – Chinese Relationships More stats 188
LibreOffice News and Google Musings More stats 186
Special Issue of JSS on R GUIs More stats 184
Using Google Docs for Web Scraping More stats 181
Using Reshape2 for transposing datasets in R More stats 180
IBM Buys Netezza More stats 180
Libreoffice 3.3 released More stats 180
Google moving on from MapReduce: rest of world still catching up More stats 179
Linux= Who did what and how much? More stats 176
Interview Carole Jesse Experienced Analytics Professional More stats 176
HIRE ME More stats 175
Test Drive a Google Chrome Notebook: Last Two Days left More stats 174
Q&A with David Smith, Revolution Analytics. More stats 174
R , Ubuntu, RCmdr Updates More stats 173
Interview KNIME Fabian Dill More stats 173
Big Data and R: New Product Release by Revolution Analytics More stats 173
Automated Content Aggregation More stats 173
R or SAS —– R and SAS ? More stats 170
Graphs More stats 169
How to use Oracle for Data Mining More stats 169
Carolina and SAS More stats 166
Interview John Sall Founder JMP/SAS Institute More stats 165
Aster Data hires Quentin Gallivan as CEO More stats 165
Oracle for possible takeover of REvolution Computing More stats 164
The Best and Worst Graphs Ever More stats 163
Statistical Analysis with R- by John M Quick More stats 163
Growing Rapidly: Rapid Miner 4.5 More stats 161
SAP and BI on Demand More stats 161
Google Snappy More stats 161
Google Refine More stats 161
Scoring SAS and SPSS Models in the cloud More stats 159
Hey Professor, I am not a Monkey More stats 157
REVolution Computing fails to create a Revolution More stats 156
SAS Lawsuit against WPS- Application Dismissed More stats 156
KDNuggets Poll on SAS: Churn in Analytics Users More stats 154
SAS Early Days More stats 154
Interview James Taylor Decision Management Expert (Updated) More stats 151
Google Books Ngram Viewer More stats 148
Review – R for SAS and SPSS Users More stats 148
New R Journal Edition More stats 146
Here comes PySpread- 85,899,345 rows and 14,316,555 columns More stats 145
Interview Karl Rexer -Rexer Analytics More stats 144
Poem: The Extroverted Engineer More stats 144
Hearst DataMining Challenge More stats 144
This Is It More stats 142
Interview Timo Elliott SAP More stats 141
The Blind Side – Movie Review More stats 141
Data Mining Survey Results :Tools and Offshoring More stats 140
Going Deap : Algols in Python More stats 140
ADVERTISE More stats 139
Interview Jeff Bass, Bass Institute (Part 2) More stats 139
Interview Jim Harris Data Quality Expert OCDQ Blog More stats 139
Do Monkeys Pay for Sex? More stats 138
Privacy Browsing Extensions in Google Chrome More stats 137
China biggest threat to Indian Software in 5 years: Indian Tech CEO More stats 136
Software HIStory: Bass Institute Part 1 More stats 135
Grenier’s Theory for Competitiveness More stats 134
Interview Charlie Berger Oracle Data Mining More stats 134
Karmic Koala Ubuntu/Linux 9.2 Preview More stats 133
Analytics and Journals More stats 133
Using Code Editors in R More stats 132
Interview Stephanie McReynolds Director Product Marketing, AsterData More stats 132
Amcharts- Cool Charts Web Editor More stats 130
Mapreduce Book More stats 128
Interesting R competition at Reddit More stats 127
Color of Statistics More stats 127
Amazon goes free for users next month More stats 127
#3443 (deleted) More stats 127
Interview Sarah Blow – Girly Geekdom Founder More stats 126
Social Network Analysis: Using R More stats 126
Interview Thomas C. Redman Author Data Driven More stats 126
Audio Interview Anne Milley , Part 1 More stats 124
Advanced Analytics on Multi-Terabyte Datasets- Conferences More stats 123
Geek Humour More stats 123
John M. Chambers Statistical Software Award – 2011 More stats 122
My friend -The Computer More stats 120
M2009 Interview Peter Pawlowski AsterData More stats 118
R Journal Dec 2010 and R for Business Analytics More stats 118
Top ten RRReasons R is bad for you ? More stats 116
Interview Michael Zeller,CEO Zementis on PMML More stats 115
Fast R Graphics More stats 114
New Google Ad Planner More stats 114
Making Sense: Hadoop and MapReduce More stats 114
Using SAS/IML with R More stats 114
Facebook App by SAP Crystal Reports More stats 113
Whats behind that pretty SAS Blog? More stats 113
Interview Alison Bolen SAS.com More stats 113
Ajay @ arts More stats 112
My latest creation More stats 112
Indian Crabs – A story More stats 112
Open Source’s worst enemy is itself not Microsoft/SAS/SAP/Oracle More stats 112
Google Cloud Print -print documents from the internet More stats 111
WPS and SAS- A rah-rah comparison More stats 110
Facebook Gmail Killer Threatens to commit Hara Kari live on AOL Techcrunch if unsucessful More stats 110
Open Source and Software Strategy More stats 109
Windows Azure and Amazon Free offer More stats 108
R for Analytics is now live More stats 108
Open Source Compiler for SAS language/ GNU -DAP More stats 107
Using Chromium /Chrome on Ubuntu Linux More stats 107
Interview John Moore CTO, Swimfish More stats 106
Nice BI Tutorials More stats 106
Creating Customized Packages in SAS Software More stats 106
Business Analytics Analyst Relations /Ethics/White Papers More stats 105
Web Crawling Automation More stats 105
The SAS-WPS Lawsuit- Preliminary Hearing More stats 105
Handling time and date in R More stats 105
KXEN Update More stats 104
MapReduce Analytics Apps- AsterData’s Developer Express Plugin More stats 104
+ 1 your website -updated More stats 103
Movie Review- Peepli Live More stats 103
Better Data Visualization in WordPress.com Stats More stats 102
Customizing your R software startup More stats 102
LibreOffice Beta 2 (Office Fork off Oracle) launches! More stats 102
KXEN Case Studies : Financial Sector More stats 102
Deleting Twitter, Facebook,LinkedIn- Accepting Life More stats 102
Google Street View shows Gladiators fighting More stats 101
Carole-Ann’s 2011 Predictions for Decision Management More stats 101
Amazon goes HPC and GPU: Dirk E to revise his R HPC book More stats 101
Happy Thanksgiving Id More stats 101
Interview Phil Rack WPS Consultant and Developer More stats 100
SPSS launches two more PASWs More stats 99
Interview David Smith REvolution Computing More stats 99
Data Mining with R More stats 97
Dataset too big for R ? More stats 97
How Jesus saved my Butt More stats 97
Interview Evan Levy Baseline Consulting More stats 97
The Latest GUI for R- BioR More stats 96
WPS Version 2.5.1 Released – can still run SAS language/data and R More stats 96
SAS legal falls flat against WPS again: Technical Grounds More stats 95
World Programming System:300 pounds for The power of SAS language More stats 94
KNIME and Zementis shake hands More stats 93
Interview Eric Siegel, Phd President Prediction Impact More stats 93
Interview Sarah Burnett BI Analyst,Ovum group More stats 92
Quantifying Analytics ROI More stats 92
PSPP – SPSS ‘s Open Source Counterpart More stats 91
PySpread Magic More stats 91
Interview SPSS Olivier Jouve More stats 91
Interesting Data Visualization:Friendwheels More stats 91
R on Windows HPC Server More stats 90
The declining market for Telecommunication Churn Models More stats 90
Getting Inside R More stats 90
The Big Data Summit Agenda More stats 90
Review: Clash of the Titans More stats 89
Red Hat worth 7.8 Billion now More stats 89
Movie Review : Rajneeti (Politics) More stats 89
3 Idiots: Insight to Indian Engineer Campus Life More stats 89
The Comic Water Games (aka Common Wealth Games) More stats 88
Computer Education grants from Google More stats 88
Challenges of Analyzing a dataset (with R) More stats 87
Input Data in R using the top 3 R GUI More stats 86
Complex Event Processing- SASE Language More stats 85
Interview with Anne Milley, SAS II More stats 85
Data Mining Presentation at M2009 by Dr Vincent Granville More stats 85
Brief Interview Timo Elliott More stats 85
Mapping Health Statistics at CDC.gov More stats 85
Amazon’s Turks Mturk.com More stats 84
Business Intelligence and Stat Computing: The White Man’s Last Stand More stats 84
Movie Review- Dabangg More stats 84
Movie Review: Sherlock Holmes More stats 84
SAS Data Mining 2009 Las Vegas More stats 83
Chinese Fortune Cookies More stats 83
SPSS and R More stats 83
Manjunath- A Batchmate on my mine More stats 82
Data Mining 2010:SAS Conference in Vegas More stats 81
DirkE and JD swoon about Shane’s MOM in Room 106 while writing R code More stats 81
SAS to R Challenge: Unique benchmarking More stats 81
S A S GOOD LIFE UNDER SIEGE – NYT More stats 81
Pentaho and R: working together More stats 81
Interview John F Moore CEO The Lab More stats 80
Ways to use both Windows and Linux together More stats 80
Brief Interview with James G Kobielus More stats 80
For R Writers- Inside R More stats 79
Using Ipod and Iphone with your Ubuntu Laptop More stats 79
Webcasts: Oracle Data Mining More stats 79
The Cloud OS is finally here or is it?: Karmic Koala More stats 79
Movie Review: Lafangey Parinday (Rouge Birds) More stats 79
SAS announcement in education initiatives More stats 78
Using R from within Python More stats 78
Event: Predictive analytics with R, PMML and ADAPA More stats 78
Interesting R and BI Web Event More stats 78
Bruno Aziza, Microsoft Global BI Lead joins PAW Keynote More stats 77
Common Analytical Tasks More stats 77
RWui :Creating R Web Interfaces on the go More stats 77
R Successor Language ‘Tea’ announced More stats 76
Learning SPSS for SAS users More stats 76
Protovis a graphical toolkit for visualization More stats 76
Interview Paul van Eikeren Inference for R More stats 75
Data Visualization: Central Banks More stats 75
Oracle Data Mining 11 G R2 More stats 75
Interview Peter J Thomas -Award Winning BI Expert More stats 75
Weak Security in Internet Databases for Statisticians More stats 74
Open Source Cartoon More stats 74
Top Ten Graphs for Business Analytics -Pie Charts (1/10) More stats 74
SAS Sentiment Analysis wins Award More stats 74
JMP Genomics 5 released More stats 74
Short Interview Jill Dyche More stats 73
Interview David Katz ,Dataspora /David Katz Consulting More stats 73
PMML 4.0 More stats 73
Ponder This: IBM Research More stats 72
PAW Videos More stats 71
PASW 13 :The preview More stats 71
Cisco SocialMiner More stats 70
Review-The Dark knight More stats 70
MapReduce Patent Granted More stats 70
Cloud Computing and GPU ( and some stats softwares) More stats 70
IBM Business Analytics Forum More stats 70
And now- The Business Analytics Summit More stats 70
Creating an Anonymous Bot More stats 69
R and SAS in Twitter Land More stats 69
Interview:Richard Schultz , CEO REvolution Computing More stats 69
China -United States -The Third Opium War More stats 68
Quick-R and Statmethods.net More stats 68
R Node- and other Web Interfaces to R More stats 68
Life Mojo – A Health Startup More stats 68
Using Views in R and comparing functions across multiple packages More stats 68
Another R Tutorial More stats 67
Interview Karen Lopez Data Modeling Expert More stats 67
QGIS and R More stats 66
Christmas Carol: The Best Software (BI-Stats-Analytics) More stats 66
Software Lawsuits :Ergo More stats 66
STEM is cool More stats 65
Date Night More stats 65
More Advanced SAS Modeling Procs More stats 65
The Big Data Event- Why am I here? More stats 65
Interview Gary Cokins SAS Institute More stats 65
Browser based Music Creation More stats 64
Interview Steve Sarsfield Author The Data Governance Imperative More stats 63
GrapheR More stats 63
Google Web Intelligence (Beta) More stats 61
Data Mining 2009 Interviews- Terry Whitlock, BlueCross BlueShield of TN More stats 60
Audio Interviews -Dr. Colleen McCue National Security Expert More stats 60
Red R- A new beginning More stats 59
YouTube Features: Audio Swap, Mobile posts and Themes More stats 59
R for Predictive Modeling:Workshop More stats 59
KDD2009: Papers Research and Industrial More stats 58
Chapman/Hall announces new series on R More stats 58
Data Visualization and Politics More stats 58
T Shirts Design More stats 58
Jump to JMP: Using Data Analysis in a visual manner More stats 58
Aster Analytics and MapReduce.org More stats 57
OK Cupid Data Visualization- Flow Chart to your Heart More stats 57
R for SAS and SPSS Users More stats 57
Carbon Footprints in the snow More stats 57
Summer School on Uncertainty Quantification More stats 57
High Performance Computing within R: Tutorial More stats 57
Running Stats Softwares on Clouds More stats 57
Amazing Data Visualization- UN Counter Terrorism More stats 56
Cloud MapReduce More stats 56
Statistical Features in WPS More stats 56
An R Package only for SAS Users More stats 56
R is Ready for Business™ More stats 55
A Google App for Sales- ERPLY More stats 55
Rexer Analytics Annual Data Miner Survey More stats 55
Cartoons on R More stats 55
American Decline- Why outsourcing doesnt make sense More stats 55
Friday Cartoon Series- New More stats 55
What softwares do you plan to use/learn in the next one year? More stats 54
Great App for Online Sketching More stats 54
September Roundup by Revolution More stats 54
Using Firesheep on Campus, Caltrain and beyond More stats 54
Decisionstats Interview at Big Data Summit, AsterData More stats 53
Learning Hadoop More stats 53
The White Man’s Burden-Poem More stats 53
Curt Monash on Analytics with MapReduce More stats 53
To R or Not to R : Data Mining and CRM for Free More stats 52
Algorithms and Ads: No Free Lunches and Hill Climbing More stats 52
Interview: Roger Haddad, Founder of KXEN Automated Modeling Software More stats 52
Google and Me on Privacy and Openness More stats 52
MapReduce.org More stats 52
Why do bloggers blog ? More stats 52
Live Streaming for Free : UStream More stats 51
Light Cycle of Tron review More stats 51
Lyx Releases 2 More stats 51
Interview – Anne Milley, SAS Part 1 More stats 51
SAS News More stats 51
KXEN EMEA User Conference 2010-Success in Business Analytics More stats 51
2011 Forecast-ying More stats 51
Kill Analytics More stats 50
Social Media Analysis Toolkit More stats 50
Multi State Models More stats 50
R and Cloud Computing More stats 50
Dataists shake up R community with a rocking contest More stats 50
Interview Anne Milley JMP More stats 49
Movie Review: Between the Folds More stats 49
Jokes in Economics More stats 49
Interview Ajay Ohri Decisionstats.com with DMR More stats 49
One more Y Tube Video More stats 49
Happy Diwali /Google Music More stats 48
SPSS Directions : Rexer Survey Results More stats 48
Redlining in Internet Access and notes on Regression Models More stats 48
Poem : A Poets Life More stats 48
Predictive Analytics World More stats 48
Interview- Phil Rack More stats 48
Building KXEN Models on Ubuntu More stats 48
New Year Resolution Presentation More stats 48
Adobe gulps Omniture More stats 47
SAS Modeling Procs More stats 47
Oracle Open World/ RODM package More stats 47
KDNuggets Survey on R More stats 47
IBM and Revolution team to create new in-database R More stats 47
SAS Institute invests in R project More stats 46
Not just a Cloud More stats 46
New Version of R released: R 2.10.1 More stats 46
Review- Iron Man2 More stats 46
Online Analytics: Monte Carlo Simulation More stats 45
Predictive Forecasting in Commercial Applications More stats 45
The Race -by D.H Groberg More stats 45
SAS Scoring Accelerators More stats 45
IBM launches Smart Analytics Cloud More stats 45
Reactions to IBM -SPSS takeover. More stats 45
Zementis partners with R Analytics Vendor- Revo More stats 44
A Missing Mandelbrot Who Dun It More stats 44
Downloading your Facebook Photos More stats 44
Android Tutorial More stats 44
The Mommy Track More stats 44
My First You Tube Video: Courtesy the competiton on VOLNIGHT by Univ of Tennessee More stats 44
Born in the USA? More stats 43
Interview Eric A. King President The Modeling Agency More stats 43
Interview Augusto Albeghi (Straycat) —Founder Straysoft More stats 43
Why Cloud? More stats 43
Innovative ways of Calculus: Gifting a comic set for Christmas More stats 43
To find the best chaat or paan shop More stats 43
Google unleashes Fusion Tables More stats 42
Using SAS and C/C++ together More stats 42
Whats new in the latest version of R More stats 42
Bollywood 101 More stats 42
Who will forecast for the forecasters? More stats 42
Learning R Easily :Two GUI’s More stats 41
Harvard DropOut Writes Open Letter- His Startup has 350m users More stats 41
BI Software More stats 41
How to read blogs in Indonesian and Chinese! More stats 41
Window to a Blue Cloud: Azure Pricing More stats 41
China bans Chinese Food for Googleplex More stats 41
SAS Program for Students More stats 41
The Year 2010 More stats 40
What do you want to know in data analytics? More stats 40
America’s Data Book: Census Abstract 2011 More stats 40
Big Data Management and Advanced Analytics More stats 40
AsterData partners with Tableau More stats 40
Using R from other Software More stats 40
SAS on Fraud More stats 40

Using Views in R and comparing functions across multiple packages

Some RDF hacking relating to updating probabil...
Image via Wikipedia

R has almost 2923 available packages

This makes the task of searching among these packages and comparing functions for the same analytical task across different packages a bit tedious and prone to manual searching (of reading multiple Pdfs of help /vignette of packages) or sending an email to the R help list.

However using R Views is a slightly better way of managing all your analytical requirements for software rather than the large number of packages (see Graphics view below).

CRAN Task Views allow you to browse packages by topic and provide tools to automatically install all packages for special areas of interest. Currently, 28 views are available. http://cran.r-project.org/web/views/

Bayesian Bayesian Inference
ChemPhys Chemometrics and Computational Physics
ClinicalTrials Clinical Trial Design, Monitoring, and Analysis
Cluster Cluster Analysis & Finite Mixture Models
Distributions Probability Distributions
Econometrics Computational Econometrics
Environmetrics Analysis of Ecological and Environmental Data
ExperimentalDesign Design of Experiments (DoE) & Analysis of Experimental Data
Finance Empirical Finance
Genetics Statistical Genetics
Graphics Graphic Displays & Dynamic Graphics & Graphic Devices & Visualization
gR gRaphical Models in R
HighPerformanceComputing High-Performance and Parallel Computing with R
MachineLearning Machine Learning & Statistical Learning
MedicalImaging Medical Image Analysis
Multivariate Multivariate Statistics
NaturalLanguageProcessing Natural Language Processing
OfficialStatistics Official Statistics & Survey Methodology
Optimization Optimization and Mathematical Programming
Pharmacokinetics Analysis of Pharmacokinetic Data
Phylogenetics Phylogenetics, Especially Comparative Methods
Psychometrics Psychometric Models and Methods
ReproducibleResearch Reproducible Research
Robust Robust Statistical Methods
SocialSciences Statistics for the Social Sciences
Spatial Analysis of Spatial Data
Survival Survival Analysis
TimeSeries Time Series Analysis

To automatically install these views, the ctv package needs to be installed, e.g., via

install.packages("ctv")
library("ctv")
Created by Pretty R at inside-R.org


and then the views can be installed via install.views or update.views (which first assesses which of the packages are already installed and up-to-date), e.g.,

install.views("Econometrics")
 update.views("Econometrics")
 Created by Pretty R at inside-R.org

CRAN Task View: Graphic Displays & Dynamic Graphics & Graphic Devices & Visualization

Maintainer: Nicholas Lewin-Koh
Contact: nikko at hailmail.net
Version: 2009-10-28

R is rich with facilities for creating and developing interesting graphics. Base R contains functionality for many plot types including coplots, mosaic plots, biplots, and the list goes on. There are devices such as postscript, png, jpeg and pdf for outputting graphics as well as device drivers for all platforms running R. lattice and grid are supplied with R’s recommended packages and are included in every binary distribution. lattice is an R implementation of William Cleveland’s trellis graphics, while grid defines a much more flexible graphics environment than the base R graphics.

R’s base graphics are implemented in the same way as in the S3 system developed by Becker, Chambers, and Wilks. There is a static device, which is treated as a static canvas and objects are drawn on the device through R plotting commands. The device has a set of global parameters such as margins and layouts which can be manipulated by the user using par() commands. The R graphics engine does not maintain a user visible graphics list, and there is no system of double buffering, so objects cannot be easily edited without redrawing a whole plot. This situation may change in R 2.7.x, where developers are working on double buffering for R devices. Even so, the base R graphics can produce many plots with extremely fine graphics in many specialized instances.

One can quickly run into trouble with R’s base graphic system if one wants to design complex layouts where scaling is maintained properly on resizing, nested graphs are desired or more interactivity is needed. grid was designed by Paul Murrell to overcome some of these limitations and as a result packages like latticeggplot2vcd or hexbin (on Bioconductor ) use grid for the underlying primitives. When using plots designed with grid one needs to keep in mind that grid is based on a system of viewports and graphic objects. To add objects one needs to use grid commands, e.g., grid.polygon() rather than polygon(). Also grid maintains a stack of viewports from the device and one needs to make sure the desired viewport is at the top of the stack. There is a great deal of explanatory documentation included with grid as vignettes.

The graphics packages in R can be organized roughly into the following topics, which range from the more user oriented at the top to the more developer oriented at the bottom. The categories are not mutually exclusive but are for the convenience of presentation:

  • Plotting : Enhancements for specialized plots can be found in plotrix, for polar plotting, vcd for categorical data, hexbin (on Bioconductor ) for hexagon binning, gclus for ordering plots and gplots for some plotting enhancements. Some specialized graphs, like Chernoff faces are implemented in aplpack, which also has a nice implementation of Tukey’s bag plot. For 3D plots latticescatterplot3d and misc3d provide a selection of plots for different kinds of 3D plotting. scatterplot3d is based on R’s base graphics system, while misc3d is based on rgl. The package onion for visualizing quaternions and octonions is well suited to display 3D graphics based on derived meshes.
  • Graphic Applications : This area is not much different from the plotting section except that these packages have tools that may not for display, but can aid in creating effective displays. Also included are packages with more esoteric plotting methods. For specific subject areas, like maps, or clustering the excellent task views contributed by other dedicated useRs is an excellent place to start.
    • Effect ordering : The gclus package focuses on the ordering of graphs to accentuate cluster structure or natural ordering in the data. While not for graphics directly cba and seriation have functions for creating 1 dimensional orderings from higher dimensional criteria. For ordering an array of displays, biclust can be useful.
    • Large Data Sets : Large data sets can present very different challenges from moderate and small datasets. Aside from overplotting, rendering 1,000,000 points can tax even modern GPU’s. For univariate datalvplot produces letter value boxplots which alleviate some of the problems that standard boxplots exhibit for large data sets. For bivariate data ash can produce a bivariate smoothed histogram very quickly, and hexbin, on Bioconductor , can bin bivariate data onto a hexagonal lattice, the advantage being that the irregular lines and orientation of hexagons do not create linear artifacts. For multivariate data, hexbin can be used to create a scatterplot matrix, combined with lattice. An alternative is to use scagnostics to produce a scaterplot matrix of “data about the data”, and look for interesting combinations of variables.
    • Trees and Graphs ape and ade4 have functions for plotting phylogenetic trees, which can be used for plotting dendrograms from clustering procedures. While these packages produce decent graphics, they do not use sophisticated algorithms for node placement, so may not be useful for very large trees. igraph has the Tilford-Rheingold algorithm implementead and is useful for plotting larger trees. diagram as facilities for flow diagrams and simple graphs. For more sophisticated graphs Rgraphviz and igraph have functions for plotting and layout, especially useful for representing large networks.
  • Graphics Systems lattice is built on top of the grid graphics system and is an R implementation of William Cleveland’s trellis system for S-PLUS. lattice allows for building many types of plots with sophisticated layouts based on conditioning. ggplot2 is an R implementation of the system described in “A Grammar of Graphics” by Leland Wilkinson. Like latticeggplot (also built on top of grid) assists in trellis-like graphics, but allows for much more. Since it is built on the idea of a semantics for graphics there is much more emphasis on reshaping data, transformation, and assembling the elements of a plot.
  • Devices : Whereas grid is built on top of the R graphics engine, many in the R community have found the R graphics engine somewhat inflexible and have written separate device drivers that either emphasize interactivity or plotting in various graphics formats. R base supplies devices for PostScript, PDF, JPEG and other formats. Devices on CRAN include cairoDevice which is a device based libcairo, which can actually render to many device types. The cairo device is desgned to work with RGTK2, which is an interface to the Gimp Tool Kit, similar to pyGTK2. GDD provides device drivers for several bitmap formats, including GIF and BMP. RSvgDevice is an SVG device driver and interfaces well with with vector drawing programs, or R web development packages, such as Rpad. When SVG devices are for web display developers should be aware that internet explorer does not support SVG, but has their own standard. Trust Microsoft. rgl provides a device driver based on OpenGL, and is good for 3D and interactive development. Lastly, the Augsburg group supplies a set of packages that includes a Java-based device, JavaGD.
  • Colors : The package colorspace provides a set of functions for transforming between color spaces and mixcolor() for mixing colors within a color space. Based on the HCL colors provided in colorspacevcdprovides a set of functions for choosing color palettes suitable for coding categorical variables ( rainbow_hcl()) and numerical information ( sequential_hcl()diverge_hcl()). Similar types of palettes are provided in RColorBrewer and dichromat is focused on palettes for color-impaired viewers.
  • Interactive Graphics : There are several efforts to implement interactive graphics systems that interface well with R. In an interactive system the user can interactively query the graphics on the screen with the mouse, or a moveable brush to zoom, pan and query on the device as well as link with other views of the data. rggobi embeds the GGobi interactive graphics system within R, so that one can display a data frame or several in GGobi directly from R. The package has functions to support longitudinal data, and graphs using GGobi’s edge set functionality. The RoSuDA repository maintained and developed by the University of Augsburg group has two packages, iplots and iwidgets as well as their Java development environment including a Java device, JavaGD. Their interactive graphics tools contain functions for alpha blending, which produces darker shading around areas with more data. This is exceptionally useful for parallel coordinate plots where many lines can quickly obscure patterns. playwith has facilities for building interactive versions of R graphics using the cairoDevice and RGtk2. Lastly, the rgl package has mechanisms for interactive manipulation of plots, especially 3D rotations and surfaces.
  • Development : For development of specialized graphics packages in R, grid should probably be the first consideration for any new plot type. rgl has better tools for 3D graphics, since the device is interactive, though it can be slow. An alternative is to use Java and the Java device in the RoSuDA packages, though Java has its own drawbacks. For porting plotting code to grid, using the package gridBase presents a nice intermediate step to embed base graphics in grid graphics and vice versa.