Top Ten Graphs for Business Analytics -Pie Charts (1/10)

I have not been really posting or writing worthwhile on the website for some time, as I am still busy writing ” R for Business Analytics” which I hope to get out before year end. However while doing research for that, I came across many types of graphs and what struck me is the actual usage of some kinds of graphs is very different in business analytics as compared to statistical computing.

The criterion of top ten graphs is as follows-

1) Usage-The order in which they appear is not strictly in terms of desirability but actual frequency of usage. So a frequently used graph like box plot would be recommended above say a violin plot.

2) Adequacy- Data Visualization paradigms change over time- but the need for accurate conveying of maximum information in a minium space without overwhelming reader or misleading data perceptions.

3) Ease of creation- A simpler graph created by a single function is more preferrable to writing 4-5 lines of code to create an elaborate graph.

4) Aesthetics– Aesthetics is relative and  in addition studies have shown visual perception varies across cultures and geographies. However , beauty is universally appreciated and a pretty graph is sometimes and often preferred over a not so pretty graph. Here being pretty is in both visual appeal without compromising perceptual inference from graphical analysis.

 

so When do we use a bar chart versus a line graph versus a pie chart? When is a mosaic plot more handy and when should histograms be used with density plots? The list tries to capture most of these practicalities.

Let me elaborate on some specific graphs-

1) Pie Chart- While Pie Chart is not really used much in stats computing, and indeed it is considered a misleading example of data visualization especially the skewed or two dimensional charts. However when it comes to evaluating market share at a particular instance, a pie chart is simple to understand. At the most two pie charts are needed for comparing two different snapshots, but three or more pie charts on same data at different points of time is definitely a bad case.

In R you can create piechart, by just using pie(dataset$variable)

As per official documentation, pie charts are not  recommended at all.

http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/pie.html

Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.

Cleveland (1985), page 264: “Data that can be shown by pie charts always can be shown by a dot chart. This means that judgements of position along a common scale can be made instead of the less accurate angle judgements.” This statement is based on the empirical investigations of Cleveland and McGill as well as investigations by perceptual psychologists.

—-

Despite this, pie charts are frequently used as an important metric they inevitably convey is market share. Market share remains an important analytical metric for business.

The pie3D( ) function in the plotrix package provides 3D exploded pie charts.An exploded pie chart remains a very commonly used (or misused) chart.

From http://lilt.ilstu.edu/jpda/charts/chart%20tips/Chartstip%202.htm#Rules

we see some rules for using Pie charts.

 

  1. Avoid using pie charts.
  2. Use pie charts only for data that add up to some meaningful total.
  3. Never ever use three-dimensional pie charts; they are even worse than two-dimensional pies.
  4. Avoid forcing comparisons across more than one pie chart

 

From the R Graph Gallery (a slightly outdated but still very comprehensive graphical repository)

http://addictedtor.free.fr/graphiques/RGraphGallery.php?graph=4

par(bg="gray")
pie(rep(1,24), col=rainbow(24), radius=0.9)
title(main="Color Wheel", cex.main=1.4, font.main=3)
title(xlab="(test)", cex.lab=0.8, font.lab=3)
(Note adding a grey background is quite easy in the basic graphics device as well without using an advanced graphical package)

 

Jobs in Analytics

Here are some jobs from Vincent Granville, founder Analyticbridge. Please contact him directly- I just thought the Season of Joy should have better jobs than currently.

————————————————————————————–

Several job ads recently posted on DataShaping / AnalyticBridge, across United Sates and in Europe. Use the DataShaping search box to find more opportunities.

Job ads are posted at:

 

Selected opportunities:

Quantitative Modeling Consultants – Agilex (Alexandria, VA)
Sr. Software Development Engineers – Agilex (Alexandria, VA)
Actuary – FBL Financial Group (Des Moines, IA)
Relevance scientist – Yandex Labs (Palo Alto, CA)
Research Engineer, Search Ranking – Chomp (San Francisco, CA)
Mathematical Modeling and Optimization – Exxon (Clinton, NJ)
Data Analyst – DISH Network (Englewood, CO)
Sr Aviation Planning Research & Data Analyst – Port of Seattle (Seattle, WA)
Statistician / Quantitative Analyst – Indeed (Austin, TX)
Statistician – Pratt & Whitney (East Hartford, CT)
Biostatistician – The J. David Gladstone Institutes (San Francisco, CA)
Customer Service Representative (oklahoma, OK)
Program Associate – Cambridge Systematics (Washington D.C., DC)
Sr Risk Analyst – Paypal (Omaha, NE)
Sr. Actuarial Analyst – Farmers (Simi Valley, CA)
Senior Statistician, Data Services – Equifax (Alpharetta, GA)
Business Intelligence Analyst – Burbery (NYC, NY)
Fact Extraction – Amazon (Seattle, WA)
Senior Researcher – Bing (Bellevue, WA)
Senior Statistical Research Analyst – Walt Disney (Lake Buena Vista, FL)
Statistician – Capital One (Nottingham, NH)
Lead Data Analyst – Barclays (Northampton, UK)
Analytical Data Scientist – Aviagen (Huntsville, AL or Edinburgh, UK)
VP of Engineering for Analytics (Bay Area, CA)
Senior Software Engineer – Numenta (Redwood City, CA)
Numenta Internship Program – Numenta (Redwood City, CA)
Director of Analytics – Mozilla Corporation (Mountain View, CA)
Senior Sales Engineer – Statsoft (NY, NY)

Facebook Gmail Killer Threatens to commit Hara Kari live on AOL Techcrunch if unsucessful

The Facebook headquarters in Palo Alto, CA (fr...
Image via Wikipedia

As per Techcrunchhttp://techcrunch.com/2010/11/11/facebook-gmail-titan/

Project Titan — a web-based email client that we hear is unofficially referred to internally as its “Gmail killer”. Now we’ve heard from sources that this is indeed what’s coming on Monday during Facebook’s special event, alongside personal @facebook.com email addresses for users.

Now Techcrunch always tells the Truth and the Gospel as per Mike is always right, especially when he is talking of gates of heaven and Angels.

Again as per the newly rich Mike Arringotn (who qualifies to be an Angel Investor himself except AOL has locked in his err wings)

Our understanding is that this is more than just a UI refresh for Facebook’s existing messaging service with POP access tacked on. Rather, Facebook is building a full-fledged webmail client, and while it may only be in early stages come its launch Monday, there’s a huge amount of potential here.

Facebook has the world’s most popular photos product, the most popular events product, and soon will have a very popular local deals product as well.  It can tweak the design of its webmail client to display content from each of these in a seamless fashion (and don’t forget messages from games, or payments via Facebook Credits). And there’s also the social element: Facebook knows who your friends are and how closely you’re connected to them; it can probably do a pretty good job figuring out which personal emails you want to read most and prioritize them accordingly.

Oh, and assuming our sources prove accurate, this explains the timing of the Google/Facebook slap fight over contact information.

In an exclusive chat with Decisionstats, Senior VP Eduard Patel Bumberg said- This is it. I am going to kill Gmail. This movie I just had  a small part in the mens room while they had the groupies. If we finally kill Gmail, I hope to get a much bigger part in Social Network 3.

New The new Facebook email gives you lesses spam (primarily) as it leans on its contacts in the Cosa Nostra of Spam- and tell them no spam to .fb books.

Yes Anyone is someone in spam has had a connection in the spam pie in Facebook, like creating duplicate 50 million accounts just before the movie got launched, inflating the number of daily Farmville players, invites, links .

Arringutan even covered some of it in an earlier FB game called scamville.

Saint Mark and Mike would have approved Senior VP Eduard Patel Bumberg decision to either kill Gmail or commit hara kari live on U Stream. It is good for the sequel.

So what's new in R 2.12.0

PoissonCDF
Image via Wikipedia

and as per http://cran.r-project.org/src/base/NEWS

the answer is plenty is new in the newR.

While you and me, were busy writing and reading blogs, or generally writing code for earning more money, or our own research- Uncle Peter D and his band of merry men have been really busy in a much more upgraded R.

————————————–

CHANGES————————-

NEW FEATURES:

    • Reading a packages's CITATION file now defaults to ASCII rather
      than Latin-1: a package with a non-ASCII CITATION file should
      declare an encoding in its DESCRIPTION file and use that encoding
      for the CITATION file.

    • difftime() now defaults to the "tzone" attribute of "POSIXlt"
      objects rather than to the current timezone as set by the default
      for the tz argument.  (Wish of PR#14182.)

    • pretty() is now generic, with new methods for "Date" and "POSIXt"
      classes (based on code contributed by Felix Andrews).

    • unique() and match() are now faster on character vectors where
      all elements are in the global CHARSXP cache and have unmarked
      encoding (ASCII).  Thanks to Matthew Dowle for suggesting
      improvements to the way the hash code is generated in unique.c.

    • The enquote() utility, in use internally, is exported now.

    • .C() and .Fortran() now map non-zero return values (other than
      NA_LOGICAL) for logical vectors to TRUE: it has been an implicit
      assumption that they are treated as true.

    • The print() methods for "glm" and "lm" objects now insert
      linebreaks in long calls in the same way that the print() methods
      for "summary.[g]lm" objects have long done.  This does change the
      layout of the examples for a number of packages, e.g. MASS.
      (PR#14250)

    • constrOptim() can now be used with method "SANN".  (PR#14245)

      It gains an argument hessian to be passed to optim(), which
      allows all the ... arguments to be intended for f() and grad().
      (PR#14071)

    • curve() now allows expr to be an object of mode "expression" as
      well as "call" and "function".

    • The "POSIX[cl]t" methods for Axis() have been replaced by a
      single method for "POSIXt".

      There are no longer separate plot() methods for "POSIX[cl]t" and
      "Date": the default method has been able to handle those classes
      for a long time.  This _inter alia_ allows a single date-time
      object to be supplied, the wish of PR#14016.

      The methods had a different default ("") for xlab.

    • Classes "POSIXct", "POSIXlt" and "difftime" have generators
      .POSIXct(), .POSIXlt() and .difftime().  Package authors are
      advised to make use of them (they are available from R 2.11.0) to
      proof against planned future changes to the classes.

      The ordering of the classes has been changed, so "POSIXt" is now
      the second class.  See the document ‘Updating packages for
      changes in R 2.12.x’ on  for
      the consequences for a handful of CRAN packages.

    • The "POSIXct" method of as.Date() allows a timezone to be
      specified (but still defaults to UTC).

    • New list2env() utility function as an inverse of
      as.list() and for fast multi-assign() to existing
      environment.  as.environment() is now generic and uses list2env()
      as list method.

    • There are several small changes to output which ‘zap’ small
      numbers, e.g. in printing quantiles of residuals in summaries
      from "lm" and "glm" fits, and in test statisics in print.anova().

    • Special names such as "dim", "names", etc, are now allowed as
      slot names of S4 classes, with "class" the only remaining
      exception.

    • File .Renviron can have architecture-specific versions such as
      .Renviron.i386 on systems with sub-architectures.

    • installed.packages() has a new argument subarch to filter on
      sub-architecture.

    • The summary() method for packageStatus() now has a separate
      print() method.

    • The default summary() method returns an object inheriting from
      class "summaryDefault" which has a separate print() method that
      calls zapsmall() for numeric/complex values.

    • The startup message now includes the platform and if used,
      sub-architecture: this is useful where different
      (sub-)architectures run on the same OS.

    • The getGraphicsEvent() mechanism now allows multiple windows to
      return graphics events, through the new functions
      setGraphicsEventHandlers(), setGraphicsEventEnv(), and
      getGraphicsEventEnv().  (Currently implemented in the windows()
      and X11() devices.)

    • tools::texi2dvi() gains an index argument, mainly for use by R
      CMD Rd2pdf.

      It avoids the use of texindy by texinfo's texi2dvi >= 1.157,
      since that does not emulate 'makeindex' well enough to avoid
      problems with special characters (such as (, {, !) in indices.

    • The ability of readLines() and scan() to re-encode inputs to
      marked UTF-8 strings on Windows since R 2.7.0 is extended to
      non-UTF-8 locales on other OSes.

    • scan() gains a fileEncoding argument to match read.table().

    • points() and lines() gain "table" methods to match plot().  (Wish
      of PR#10472.)

    • Sys.chmod() allows argument mode to be a vector, recycled along
      paths.

    • There are |, & and xor() methods for classes "octmode" and
      "hexmode", which work bitwise.

    • Environment variables R_DVIPSCMD, R_LATEXCMD, R_MAKEINDEXCMD,
      R_PDFLATEXCMD are no longer used nor set in an R session.  (With
      the move to tools::texi2dvi(), the conventional environment
      variables LATEX, MAKEINDEX and PDFLATEX will be used.
      options("dvipscmd") defaults to the value of DVIPS, then to
      "dvips".)

    • New function isatty() to see if terminal connections are
      redirected.

    • summaryRprof() returns the sampling interval in component
      sample.interval and only returns in by.self data for functions
      with non-zero self times.

    • print(x) and str(x) now indicate if an empty list x is named.

    • install.packages() and remove.packages() with lib unspecified and
      multiple libraries in .libPaths() inform the user of the library
      location used with a message rather than a warning.

    • There is limited support for multiple compressed streams on a
      file: all of [bgx]zfile() allow streams to be appended to an
      existing file, but bzfile() reads only the first stream.

    • Function person() in package utils now uses a given/family scheme
      in preference to first/middle/last, is vectorized to handle an
      arbitrary number of persons, and gains a role argument to specify
      person roles using a controlled vocabulary (the MARC relator
      terms).

    • Package utils adds a new "bibentry" class for representing and
      manipulating bibliographic information in enhanced BibTeX style,
      unifying and enhancing the previously existing mechanisms.

    • A bibstyle() function has been added to the tools package with
      default JSS style for rendering "bibentry" objects, and a
      mechanism for registering other rendering styles.

    • Several aspects of the display of text help are now customizable
      using the new Rd2txt_options() function.
      options("help_text_width") is no longer used.

    • Added \href tag to the Rd format, to allow hyperlinks to URLs
      without displaying the full URL.

    • Added \newcommand and \renewcommand tags to the Rd format, to
      allow user-defined macros.

    • New toRd() generic in the tools package to convert objects to
      fragments of Rd code, and added "fragment" argument to Rd2txt(),
      Rd2HTML(), and Rd2latex() to support it.

    • Directory R_HOME/share/texmf now follows the TDS conventions, so
      can be set as a texmf tree (‘root directory’ in MiKTeX parlance).

    • S3 generic functions now use correct S4 inheritance when
      dispatching on an S4 object.  See ?Methods, section on “Methods
      for S3 Generic Functions” for recommendations and details.

    • format.pval() gains a ... argument to pass arguments such as
      nsmall to format().  (Wish of PR#9574)

    • legend() supports title.adj.  (Wish of PR#13415)

    • Added support for subsetting "raster" objects, plus assigning to
      a subset, conversion to a matrix (of colour strings), and
      comparisons (== and !=).

    • Added a new parseLatex() function (and related functions
      deparseLatex() and latexToUtf8()) to support conversion of
      bibliographic entries for display in R.

    • Text rendering of \itemize in help uses a Unicode bullet in UTF-8
      and most single-byte Windows locales.

    • Added support for polygons with holes to the graphics engine.
      This is implemented for the pdf(), postscript(),
      x11(type="cairo"), windows(), and quartz() devices (and
      associated raster formats), but not for x11(type="Xlib") or
      xfig() or pictex().  The user-level interface is the polypath()
      function in graphics and grid.path() in grid.

    • File NEWS is now generated at installation with a slightly
      different format: it will be in UTF-8 on platforms using UTF-8,
      and otherwise in ASCII.  There is also a PDF version, NEWS.pdf,
      installed at the top-level of the R distribution.

    • kmeans(x, 1) now works.  Further, kmeans now returns between and
      total sum of squares.

    • arrayInd() and which() gain an argument useNames.  For arrayInd,
      the default is now false, for speed reasons.

    • As is done for closures, the default print method for the formula
      class now displays the associated environment if it is not the
      global environment.

    • A new facility has been added for inserting code into a package
      without re-installing it, to facilitate testing changes which can
      be selectively added and backed out.  See ?insertSource.

    • New function readRenviron to (re-)read files in the format of
      ~/.Renviron and Renviron.site.

    • require() will now return FALSE (and not fail) if loading the
      package or one of its dependencies fails.

    • aperm() now allows argument perm to be a character vector when
      the array has named dimnames (as the results of table() calls
      do).  Similarly, array() allows MARGIN to be a character vector.
      (Based on suggestions of Michael Lachmann.)

    • Package utils now exports and documents functions
      aspell_package_Rd_files() and aspell_package_vignettes() for
      spell checking package Rd files and vignettes using Aspell,
      Ispell or Hunspell.

    • Package news can now be given in Rd format, and news() prefers
      these inst/NEWS.Rd files to old-style plain text NEWS or
      inst/NEWS files.

    • New simple function packageVersion().

    • The PCRE library has been updated to version 8.10.

    • The standard Unix-alike terminal interface declares its name to
      readline as 'R', so that can be used for conditional sections in
      ~/.inputrc files.

    • ‘Writing R Extensions’ now stresses that the standard sections in
      .Rd files (other than \alias, \keyword and \note) are intended to
      be unique, and the conversion tools now drop duplicates with a
      warning.

      The .Rd conversion tools also warn about an unrecognized type in
      a \docType section.

    • ecdf() objects now have a quantile() method.

    • format() methods for date-time objects now attempt to make use of
      a "tzone" attribute with "%Z" and "%z" formats, but it is not
      always possible.  (Wish of PR#14358.)

    • tools::texi2dvi(file, clean = TRUE) now works in more cases (e.g.
      where emulation is used and when file is not in the current
      directory).

    • New function droplevels() to remove unused factor levels.

    • system(command, intern = TRUE) now gives an error on a Unix-alike
      (as well as on Windows) if command cannot be run.  It reports a
      non-success exit status from running command as a warning.

      On a Unix-alike an attempt is made to return the actual exit
      status of the command in system(intern = FALSE): previously this
      had been system-dependent but on POSIX-compliant systems the
      value return was 256 times the status.

    • system() has a new argument ignore.stdout which can be used to
      (portably) ignore standard output.

    • system(intern = TRUE) and pipe() connections are guaranteed to be
      avaliable on all builds of R.

    • Sys.which() has been altered to return "" if the command is not
      found (even on Solaris).

    • A facility for defining reference-based S4 classes (in the OOP
      style of Java, C++, etc.) has been added experimentally to
      package methods; see ?ReferenceClasses.

    • The predict method for "loess" fits gains an na.action argument
      which defaults to na.pass rather than the previous default of
      na.omit.

      Predictions from "loess" fits are now named from the row names of
      newdata.

    • Parsing errors detected during Sweave() processing will now be
      reported referencing their original location in the source file.

    • New adjustcolor() utility, e.g., for simple translucent color
      schemes.

    • qr() now has a trivial lm method with a simple (fast) validity
      check.

    • An experimental new programming model has been added to package
      methods for reference (OOP-style) classes and methods.  See
      ?ReferenceClasses.

    • bzip2 has been updated to version 1.0.6 (bug-fix release).
      --with-system-bzlib now requires at least version 1.0.6.

    • R now provides jss.cls and jss.bst (the class and bib style file
      for the Journal of Statistical Software) as well as RJournal.bib
      and Rnews.bib, and R CMD ensures that the .bst and .bib files are
      found by BibTeX.

    • Functions using the TAR environment variable no longer quote the
      value when making system calls.  This allows values such as tar
      --force-local, but does require additional quotes in, e.g., TAR =
      "'/path with spaces/mytar'".

  DEPRECATED & DEFUNCT:

    • Supplying the parser with a character string containing both
      octal/hex and Unicode escapes is now an error.

    • File extension .C for C++ code files in packages is now defunct.

    • R CMD check no longer supports configuration files containing
      Perl configuration variables: use the environment variables
      documented in ‘R Internals’ instead.

    • The save argument of require() now defaults to FALSE and save =
      TRUE is now deprecated.  (This facility is very rarely actually
      used, and was superseded by the Depends field of the DESCRIPTION
      file long ago.)

    • R CMD check --no-latex is deprecated in favour of --no-manual.

    • R CMD Sd2Rd is formally deprecated and will be removed in R
      2.13.0.

  PACKAGE INSTALLATION:

    • install.packages() has a new argument libs_only to optionally
      pass --libs-only to R CMD INSTALL and works analogously for
      Windows binary installs (to add support for 64- or 32-bit
      Windows).

    • When sub-architectures are in use, the installed architectures
      are recorded in the Archs field of the DESCRIPTION file.  There
      is a new default filter, "subarch", in available.packages() to
      make use of this.

      Code is compiled in a copy of the src directory when a package is
      installed for more than one sub-architecture: this avoid problems
      with cleaning the sources between building sub-architectures.

    • R CMD INSTALL --libs-only no longer overrides the setting of
      locking, so a previous version of the package will be restored
      unless --no-lock is specified.

  UTILITIES:

    • R CMD Rprof|build|check are now based on R rather than Perl
      scripts.  The only remaining Perl scripts are the deprecated R
      CMD Sd2Rd and install-info.pl (used only if install-info is not
      found) as well as some maintainer-mode-only scripts.

      *NB:* because these have been completely rewritten, users should
      not expect undocumented details of previous implementations to
      have been duplicated.

      R CMD no longer manipulates the environment variables PERL5LIB
      and PERLLIB.

    • R CMD check has a new argument --extra-arch to confine tests to
      those needed to check an additional sub-architecture.

      Its check for “Subdirectory 'inst' contains no files” is more
      thorough: it looks for files, and warns if there are only empty
      directories.

      Environment variables such as R_LIBS and those used for
      customization can be set for the duration of checking _via_ a
      file ~/.R/check.Renviron (in the format used by .Renviron, and
      with sub-architecture specific versions such as
      ~/.R/check.Renviron.i386 taking precedence).

      There are new options --multiarch to check the package under all
      of the installed sub-architectures and --no-multiarch to confine
      checking to the sub-architecture under which check is invoked.
      If neither option is supplied, a test is done of installed
      sub-architectures and all those which can be run on the current
      OS are used.

      Unless multiple sub-architectures are selected, the install done
      by check for testing purposes is only of the current
      sub-architecture (_via_ R CMD INSTALL --no-multiarch).

      It will skip the check for non-ascii characters in code or data
      if the environment variables _R_CHECK_ASCII_CODE_ or
      _R_CHECK_ASCII_DATA_ are respectively set to FALSE.  (Suggestion
      of Vince Carey.)

    • R CMD build no longer creates an INDEX file (R CMD INSTALL does
      so), and --force removes (rather than overwrites) an existing
      INDEX file.

      It supports a file ~/.R/build.Renviron analogously to check.

      It now runs build-time \Sexpr expressions in help files.

    • R CMD Rd2dvi makes use of tools::texi2dvi() to process the
      package manual.  It is now implemented entirely in R (rather than
      partially as a shell script).

    • R CMD Rprof now uses utils::summaryRprof() rather than Perl.  It
      has new arguments to select one of the tables and to limit the
      number of entries printed.

    • R CMD Sweave now runs R with --vanilla so the environment setting
      of R_LIBS will always be used.

  C-LEVEL FACILITIES:

    • lang5() and lang6() (in addition to pre-existing lang[1-4]())
      convenience functions for easier construction of eval() calls.
      If you have your own definition, do wrap it inside #ifndef lang5
      .... #endif to keep it working with old and new R.

    • Header R.h now includes only the C headers it itself needs, hence
      no longer includes errno.h.  (This helps avoid problems when it
      is included from C++ source files.)

    • Headers Rinternals.h and R_ext/Print.h include the C++ versions
      of stdio.h and stdarg.h respectively if included from a C++
      source file.

  INSTALLATION:

    • A C99 compiler is now required, and more C99 language features
      will be used in the R sources.

    • Tcl/Tk >= 8.4 is now required (increased from 8.3).

    • System functions access, chdir and getcwd are now essential to
      configure R.  (In practice they have been required for some
      time.)

    • make check compares the output of the examples from several of
      the base packages to reference output rather than the previous
      output (if any).  Expect some differences due to differences in
      floating-point computations between platforms.

    • File NEWS is no longer in the sources, but generated as part of
      the installation.  The primary source for changes is now
      doc/NEWS.Rd.

    • The popen system call is now required to build R.  This ensures
      the availability of system(intern = TRUE), pipe() connections and
      printing from postscript().

    • The pkg-config file libR.pc now also works when R is installed
      using a sub-architecture.

    • R has always required a BLAS that conforms to IE60559 arithmetic,
      but after discovery of more real-world problems caused by a BLAS
      that did not, this is tested more thoroughly in this version.

  BUG FIXES:

    • Calls to selectMethod() by default no longer cache inherited
      methods.  This could previously corrupt methods used by as().

    • The densities of non-central chi-squared are now more accurate in
      some cases in the extreme tails, e.g. dchisq(2000, 2, 1000), as a
      series expansion was truncated too early.  (PR#14105)

    • pt() is more accurate in the left tail for ncp large, e.g.
      pt(-1000, 3, 200).  (PR#14069)

    • The default C function (R_binary) for binary ops now sets the S4
      bit in the result if either argument is an S4 object.  (PR#13209)

    • source(echo=TRUE) failed to echo comments that followed the last
      statement in a file.

    • S4 classes that contained one of "matrix", "array" or "ts" and
      also another class now accept superclass objects in new().  Also
      fixes failure to call validObject() for these classes.

    • Conditional inheritance defined by argument test in
      methods::setIs() will no longer be used in S4 method selection
      (caching these methods could give incorrect results).  See
      ?setIs.

    • The signature of an implicit generic is now used by setGeneric()
      when that does not use a definition nor explicitly set a
      signature.

    • A bug in callNextMethod() for some examples with "..." in the
      arguments has been fixed.  See file
      src/library/methods/tests/nextWithDots.R in the sources.

    • match(x, table) (and hence %in%) now treat "POSIXlt" consistently
      with, e.g., "POSIXct".

    • Built-in code dealing with environments (get(), assign(),
      parent.env(), is.environment() and others) now behave
      consistently to recognize S4 subclasses; is.name() also
      recognizes subclasses.

    • The abs.tol control parameter to nlminb() now defaults to 0.0 to
      avoid false declarations of convergence in objective functions
      that may go negative.

    • The standard Unix-alike termination dialog to ask whether to save
      the workspace takes a EOF response as n to avoid problems with a
      damaged terminal connection.  (PR#14332)

    • Added warn.unused argument to hist.default() to allow suppression
      of spurious warnings about graphical parameters used with
      plot=FALSE.  (PR#14341)

    • predict.lm(), summary.lm(), and indeed lm() itself had issues
      with residual DF in zero-weighted cases (the latter two only in
      connection with empty models). (Thanks to Bill Dunlap for
      spotting the predict() case.)

    • aperm() treated resize = NA as resize = TRUE.

    • constrOptim() now has an improved convergence criterion, notably
      for cases where the minimum was (very close to) zero; further,
      other tweaks inspired from code proposals by Ravi Varadhan.

    • Rendering of S3 and S4 methods in man pages has been corrected
      and made consistent across output formats.

    • Simple markup is now allowed in \title sections in .Rd files.

    • The behaviour of as.logical() on factors (to use the levels) was
      lost in R 2.6.0 and has been restored.

    • prompt() did not backquote some default arguments in the \usage
      section.  (Reported by Claudia Beleites.)

    • writeBin() disallows attempts to write 2GB or more in a single
      call. (PR#14362)

    • new() and getClass() will now work if Class is a subclass of
      "classRepresentation" and should also be faster in typical calls.

    • The summary() method for data frames makes a better job of names
      containing characters invalid in the current locale.

    • [[ sub-assignment for factors could create an invalid factor
      (reported by Bill Dunlap).

    • Negate(f) would not evaluate argument f until first use of
      returned function (reported by Olaf Mersmann).

    • quietly=FALSE is now also an optional argument of library(), and
      consequently, quietly is now propagated also for loading
      dependent packages, e.g., in require(*, quietly=TRUE).

    • If the loop variable in a for loop was deleted, it would be
      recreated as a global variable.  (Reported by Radford Neal; the
      fix includes his optimizations as well.)

    • Task callbacks could report the wrong expression when the task
      involved parsing new code. (PR#14368)

    • getNamespaceVersion() failed; this was an accidental change in
      2.11.0. (PR#14374)

    • identical() returned FALSE for external pointer objects even when
      the pointer addresses were the same.

    • L$a@x[] <- val did not duplicate in a case it should have.

    • tempfile() now always gives a random file name (even if the
      directory is specified) when called directly after startup and
      before the R RNG had been used.  (PR#14381)

    • quantile(type=6) behaved inconsistently.  (PR#14383)

    • backSpline(.) behaved incorrectly when the knot sequence was
      decreasing.  (PR#14386)

    • The reference BLAS included in R was assuming that 0*x and x*0
      were always zero (whereas they could be NA or NaN in IEC 60559
      arithmetic).  This was seen in results from tcrossprod, and for
      example that log(0) %*% 0 gave 0.

    • The calculation of whether text was completely outside the device
      region (in which case, you draw nothing) was wrong for screen
      devices (which have [0, 0] at top-left).  The symptom was (long)
      text disappearing when resizing a screen window (to make it
      smaller).  (PR#14391)

    • model.frame(drop.unused.levels = TRUE) did not take into account
      NA values of factors when deciding to drop levels. (PR#14393)

    • library.dynam.unload required an absolute path for libpath.
      (PR#14385)

      Both library() and loadNamespace() now record absolute paths for
      use by searchpaths() and getNamespaceInfo(ns, "path").

    • The self-starting model NLSstClosestX failed if some deviation
      was exactly zero.  (PR#14384)

    • X11(type = "cairo") (and other devices such as png using
      cairographics) and which use Pango font selection now work around
      a bug in Pango when very small fonts (those with sizes between 0
      and 1 in Pango's internal units) are requested.  (PR#14369)

    • Added workaround for the font problem with X11(type = "cairo")
      and similar on Mac OS X whereby italic and bold styles were
      interchanged.  (PR#13463 amongst many other reports.)

    • source(chdir = TRUE) failed to reset the working directory if it
      could not be determined - that is now an error.

    • Fix for crash of example(rasterImage) on x11(type="Xlib").

    • Force Quartz to bring the on-screen display up-to-date
      immediately before the snapshot is taken by grid.cap() in the
      Cocoa implementation. (PR#14260)

    • model.frame had an unstated 500 byte limit on variable names.
      (Example reported by Terry Therneau.)

    • The 256-byte limit on names is now documented.    • Subassignment by [, [[ or $ on an expression object with value
      NULL coerced the object to a list.

 

 

Libre Office

Some ambiguity about Libre Office and why it needed to change from Open Office- just when Open Office seemed so threatening on the desktop

FROM- http://www.documentfoundation.org/faq/

Q: So is this a breakaway project?

A: Not at all. The Document Foundation will continue to be focused on developing, supporting, and promoting the same software, and it’s very much business as usual. We are simply moving to a new and more appropriate organisational model for the next decade – a logical development from Sun’s inspirational launch a decade ago.

Q: Why are you calling yourselves “The Document Foundation”?

A: For ten years we have used the same name – “OpenOffice.org” – for both the Community and the software. We’ve decided it removes ambiguity to have a different name for the two, so the Community is now “The Document Foundation”, and the software “LibreOffice”. Note: there are other examples of this usage in the free software community – e.g. the Mozilla Foundation with the Firefox browser.

Q: Does this mean you intend to develop other pieces of software?

A: We would like to have that possibility open to us in the future…

Q: And why are you calling the software “LibreOffice” instead of “OpenOffice.org”?

A: The OpenOffice.org trademark is owned by Oracle Corporation. Our hope is that Oracle will donate this to the Foundation, along with the other assets it holds in trust for the Community, in due course, once legal etc issues are resolved. However, we need to continue work in the meantime – hence “LibreOffice” (“free office”).

Q: Why are you building a new web infrastructure?

A: Since Oracle’s takeover of Sun Microsystems, the Community has been under “notice to quit” from our previous Collabnet infrastructure. With today’s announcement of a Foundation, we now have an entity which can own our emerging new infrastructure.

Q: What does this announcement mean to other derivatives of OpenOffice.org?

A: We want The Document Foundation to be open to code contributions from as many people as possible. We are delighted to announce that the enhancements produced by the Go-OOo team will be merged into LibreOffice, effective immediately. We hope that others will follow suit.

Q: What difference will this make to the commercial products produced by Oracle Corporation, IBM, Novell, Red Flag, etc?

A: The Document Foundation cannot answer for other bodies. However, there is nothing in the licence arrangements to stop companies continuing to release commercial derivatives of LibreOffice. The new Foundation will also mean companies can contribute funds or resources without worries that they may be helping a commercial competitor.

Q: What difference will The Document Foundation make to developers?

A: The Document Foundation sets out deliberately to be as developer friendly as possible. We do not demand that contributors share their copyright with us. People will gain status in our community based on peer evaluation of their contributions – not by who their employer is.

Q: What difference will The Document Foundation make to users of LibreOffice?

A: LibreOffice is The Document Foundation’s reason for existence. We do not have and will not have a commercial product which receives preferential treatment. We only have one focus – delivering the best free office suite for our users – LibreOffice.

—————————————————————————————————-

Non Microsoft and Non Oracle vendors are indeed going to find it useful the possiblities of bundling a free Libre Office that reduces the total cost of ownership for analytics software. Right now, some of the best free advertising for Microsoft OS and Office is done by enterprise software vendors who create Windows Only Products and enable MS Office integration better than  Open Office integration. This is done citing user demand- but it is a chicken egg dilemma- as functionality leads to enhanced demand. Microsoft on the other hand is aware of this dependence and has made SQL Server and SQL Analytics (besides investing in analytics startups like Revolution Analytics) along with it’s own infrastructure -Azure Cloud Platform/EC2 instances.

Microsoft Online Games

No, this is not about the X Box kind of games. It is about Microsoft ‘s tactical shift in the online space from going it alone, and building stuff itself, –to partnering, and sometimes investing and exiting business.

In Blogs- It recently announced a migration of MS Live Spaces to WordPress.com – It gives Automattic 30 million more users- no small change consider there were 26 million existing WP users.

Microsoft Messenger, which is the oldest online app in the suite, now provides instant messaging services to about 350 million users, and from now on Windows Live Writer works specifically with the WordPress.com blog service by default. Hopefully Skype, and Google Voice will show MS the way to monitize that business app yet.

Google buying blogger-blogspot seems to have done little, but given Biz Stone room to create another content disruption-Twitter.

With the round of lawsuits by proxy, in Android -Motorola, or for acquisitions – MS is just doing what Marc Anderseen (who’s apparently a better VC than Paul Allen was), Sun and co did to it in the nineties.

Google seems to be regretting putting a spade in the Yahoo acquisition- that would have tied up a big chunk of Idle MS cash- leaving it little room for niche investments (like the 250 mill that helped Facebook ramp up in time).

The real surprise here could be Apple- it has shown little interest in cloud computing- and it seems to be testing the waters with Ping. But Apple sure smells competition- and Android is doing to Iphone what Windows did to the Mac in the early 1990’s.

Google lacks presence in online gaming (despite it’s own Zynga investment)- and needs to start monetizing properties like Android OS (say 10$ for every phone license ??), Google Maps (as an app for GPS) and Google Voice. Indeed it may be time for the big G to start thinking of spinning off atleast some products- earning better returns, while retaining control (dual stock splits) and killing those anti trust lawyer fees forever.

As the Ancient Chinese said, May you live in interesting times. Fun to watch the online games people play.

 

 

Interfaces to R

This is a fairly long post and is a basic collection  of material for a book/paper. It is on interfaces to use R. If you feel I need to add more on a  particular R interface, or if there is an error in this- please feel to contact me on twitter @decisionstats or mail ohri2007 on google mail.

R Interfaces

There are multiple ways to use the R statistical language.

Command Line- The default method is using the command prompt by the installed software on download from http://r-project.org
For windows users there is a simple GUI which has an option for Packages (loading package, installing package, setting CRAN mirror for downloading packages) , Misc (useful for listing all objects loaded in workspace as well as clearing objects to free up memory), and Help Menu.

Using Click and Point- Besides the command prompt, there are many Graphical User Interfaces which enable the analyst to use click and point methods to analyze data without getting into the details of learning complex and at times overwhelming R syntax. R GUIs are very popular both as mode of instruction in academia as well as in actual usage as it cuts down considerably on time taken to adapt to the language. As with all command line and GUI software, for advanced tweaks and techniques, command prompt will come in handy as well.

Advantages and Limitations of using Visual Programming Interfaces to R as compared to Command Line.

 

Advantages Limitations
Faster learning for new programmers Can create junk analysis by clicking menus in GUI
Easier creation of advanced models or graphics Cannot create custom functions unless you use command line
Repeatability of analysis is better Advanced techniques and custom flexibility of data handling R can be done in command line
Syntax is auto-generated Can limit scope and exposure in learning R syntax




A brief list of the notable Graphical User Interfaces is below-

1) R Commander- Basic statistics
2) Rattle- Data Mining
3) Deducer- Graphics (including GGPlot Integration) and also uses JGR (a Jave based  GUI)
4) RKward- Comprehensive R GUI for customizable graphs
5) Red-R – Dataflow programming interface using widgets

1) R Commander- R Commander was primarily created by Professor John Fox of McMaster University to cover the content of a basic statistics course. However it is extensible and many other packages can be added in menu form to it- in the form R Commander Plugins. Quite noticeably it is one of the most widely used R GUI and it also has a script window so you can write R code in combination with the menus.
As you point and click a particular menu item, the corresponding R code is automatically generated in the log window and executed.

It can be found on CRAN at http://cran.r-project.org/web/packages/Rcmdr/index.html



Advantages of Using  R Commander-
1) Useful for beginner in R language to do basic graphs and analysis and building models.
2) Has script window, output window and log window (called messages) in same screen which helps user as code is auto-generated on clicking on menus, and can be customized easily. For example in changing labels and options in Graphs.  Graphical output is shown in seperate window from output window.
3) Extensible for other R packages like qcc (for quality control), Teaching Demos (for training), survival analysis and Design of Experiments (DoE)
4) Easy to understand interface even for first time user.
5) Menu items which are not relevant are automatically greyed out- if there are only two variables, and you try to build a 3D scatterplot graph, that menu would simply not be available and is greyed out.

Comparative Disadvantages of using R Commander-
1) It is basically aimed at a statistical audience( originally students in statistics) and thus the terms as well as menus are accordingly labeled. Hence it is more of a statistical GUI rather than an analytics GUI.
2) Has limited ability to evaluate models from a business analysts perspective (ROC curve is not given as an option) even though it has extensive statistical tests for model evaluation in model sub menu. Indeed creating a Model is treated as a subsection of statistics rather than a separate menu item.
3) It is not suited for projects that do not involve advanced statistical testing and for users not proficient in statistics (particularly hypothesis testing), and for data miners.

Menu items in the R Commander window:
File Menu – For loading script files and saving Script files, Output and Workspace
It is also needed for changing the present working directory and for exiting R.
Edit Menu – For editing scripts and code in the script window.
Data Menu – For creating new dataset, inputting or importing data and manipulating data through variables. Data Import can be from text,comma separated values,clipboard, datasets from SPSS, Stata,Minitab, Excel ,dbase,  Access files or from url.
Data manipulation included deleting rows of data as well as manipulating variables.
Also this menu has the option for merging two datasets by row or columns.
Statistics Menu-This menu has options for descriptive statistics, hypothesis tests, factor analysis and clustering and also for creating models. Note there is a separate menu for evaluating the model so created.
Graphs Menu-It has options for creating various kinds of graphs including box-plot, histogram, line, pie charts and x-y plots.
The first option is color palette- it can be used for customizing the colors. It is recommended you adjust colors based on your need for publication or presentation.
A notable option is 3 D graphs for evaluating 3 variables at a time- this is really good and impressive feature and exposes the user to advanced graphs in R all at few clicks. You may want to dazzle a presentation using this graph.
Also consider scatterplot matrix graphs for graphical display of variables.
Graphical display of R surpasses any other statistical software in appeal as well as ease of creation- using GUI to create graphs can further help the user to get the most of data insights using R at a very minimum effort.
Models Menu-This is somewhat of a labeling peculiarity of R Commander as this menu is only for evaluating models which have been created using the statistics menu-model sub menu.
It includes options for graphical interpretation of model results,residuals,leverage and confidence intervals and adding back residuals to the data set.
Distributions Menu- is for cumulative probabilities, probability density, graphs of distributions, quantiles and features for standard distributions and can be used in lieu of standard statistical tables for the distributions. It has 13 standard statistical continuous distributions and 5 discrete distributions.
Tools Menu- allows you to load other packages and also load R Commander plugins (which are then added to the Interface Menu after the R Commander GUI is restarted). It also contains options sub menu for fine tuning (like opting to send output to R Menu)
Help Menu- Standard documentation and help menu. Essential reading is the short 25 page manual in it called Getting “Started With the R Commander”.

R Commander Plugins- There are twenty extensions to R Commander that greatly enhance it’s appeal -these include basic time series forecasting, survival analysis, qcc and more.

see a complete list at

  1. DoE – http://cran.r-project.org/web/packages/RcmdrPlugin.DoE/RcmdrPlugin.DoE.pdf
  2. doex
  3. EHESampling
  4. epack- http://cran.r-project.org/web/packages/RcmdrPlugin.epack/RcmdrPlugin.epack.pdf
  5. Export- http://cran.r-project.org/web/packages/RcmdrPlugin.Export/RcmdrPlugin.Export.pdf
  6. FactoMineR
  7. HH
  8. IPSUR
  9. MAc- http://cran.r-project.org/web/packages/RcmdrPlugin.MAc/RcmdrPlugin.MAc.pdf
  10. MAd
  11. orloca
  12. PT
  13. qcc- http://cran.r-project.org/web/packages/RcmdrPlugin.qcc/RcmdrPlugin.qcc.pdf and http://cran.r-project.org/web/packages/qcc/qcc.pdf
  14. qual
  15. SensoMineR
  16. SLC
  17. sos
  18. survival-http://cran.r-project.org/web/packages/RcmdrPlugin.survival/RcmdrPlugin.survival.pdf
  19. SurvivalT
  20. Teaching Demos

Note the naming convention for above e plugins is always with a Prefix of “RCmdrPlugin.” followed by the names above
Also on loading a Plugin, it must be already installed locally to be visible in R Commander’s list of load-plugin, and R Commander loads the e-plugin after restarting.Hence it is advisable to load all R Commander plugins in the beginning of the analysis session.

However the notable E Plugins are
1) DoE for Design of Experiments-
Full factorial designs, orthogonal main effects designs, regular and non-regular 2-level fractional
factorial designs, central composite and Box-Behnken designs, latin hypercube samples, and simple D-optimal designs can currently be generated from the GUI. Extensions to cover further latin hypercube designs as well as more advanced D-optimal designs (with blocking) are planned for the future.
2) Survival- This package provides an R Commander plug-in for the survival package, with dialogs for Cox models, parametric survival regression models, estimation of survival curves, and testing for differences in survival curves, along with data-management facilities and a variety of tests, diagnostics and graphs.
3) qcc -GUI for  Shewhart quality control charts for continuous, attribute and count data. Cusum and EWMA charts. Operating characteristic curves. Process capability analysis. Pareto chart and cause-and-effect chart. Multivariate control charts
4) epack- an Rcmdr “plug-in” based on the time series functions. Depends also on packages like , tseries, abind,MASS,xts,forecast. It covers Log-Exceptions garch
and following Models -Arima, garch, HoltWinters
5)Export- The package helps users to graphically export Rcmdr output to LaTeX or HTML code,
via xtable() or Hmisc::latex(). The plug-in was originally intended to facilitate exporting Rcmdr
output to formats other than ASCII text and to provide R novices with an easy-to-use,
easy-to-access reference on exporting R objects to formats suited for printed output. The
package documentation contains several pointers on creating reports, either by using
conventional word processors or LaTeX/LyX.
6) MAc- This is an R-Commander plug-in for the MAc package (Meta-Analysis with
Correlations). This package enables the user to conduct a meta-analysis in a menu-driven,
graphical user interface environment (e.g., SPSS), while having the full statistical capabilities of
R and the MAc package. The MAc package itself contains a variety of useful functions for
conducting a research synthesis with correlational data. One of the unique features of the MAc
package is in its integration of user-friendly functions to complete the majority of statistical steps
involved in a meta-analysis with correlations.
You can read more on R Commander Plugins at http://wp.me/p9q8Y-1Is
—————————————————————————————————————————-
Rattle- R Analytical Tool To Learn Easily (download from http://rattle.togaware.com/)
Rattle is more advanced user Interface than R Commander though not as popular in academia. It has been designed explicitly for data mining and it also has a commercial version for sale by Togaware. Rattle has a Tab and radio button/check box rather than Menu- drop down approach towards the graphical design. Also the Execute button needs to be clicked after checking certain options, just the same as submit button is clicked after writing code. This is different from clicking on a drop down menu.

Advantages of Using Rattle
1) Useful for beginner in R language to do building models,cluster and data mining.
2) Has separate tabs for data entry,summary, visualization,model building,clustering, association and evaluation. The design is intuitive and easy to understand even for non statistical background as the help is conveniently explained as each tab, button is clicked. Also the tabs are placed in a very sequential and logical order.
3) Uses a lot of other R packages to build a complete analytical platform. Very good for correlation graph,clustering as well decision trees.
4) Easy to understand interface even for first time user.
5) Log  for R code is auto generated and time stamp is placed.
6) Complete solution for model building from partitioning datasets randomly for testing,validation to building model, evaluating lift and ROC curve, and exporting PMML output of model for scoring.
7) Has a well documented online help as well as in-software documentation. The help helps explain terms even to non statistical users and is highly useful for business users.

Example Documentation for Hypothesis Testing in Test Tab in Rattle is ”
Distribution of the Data
* Kolomogorov-Smirnov     Non-parametric Are the distributions the same?
* Wilcoxon Signed Rank    Non-parametric Do paired samples have the same distribution?
Location of the Average
* T-test               Parametric     Are the means the same?
* Wilcoxon Rank-Sum    Non-parametric Are the medians the same?
Variation in the Data
* F-test Parametric Are the variances the same?
Correlation
* Correlation    Pearsons Are the values from the paired samples correlated?”

Comparative Disadvantages of using Rattle-
1) It is basically aimed at a data miner.  Hence it is more of a data mining GUI rather than an analytics GUI.
2) Has limited ability to create different types of graphs from a business analysts perspective Numeric variables can be made into Box-Plot, Histogram, Cumulative as well Benford Graphs. While interactivity using GGobi and Lattiticist is involved- the number of graphical options is still lesser than other GUI.
3) It is not suited for projects that involve multiple graphical analysis and which do not have model building or data mining.For example Data Plot is given in clustering tab but not in general Explore tab.
4) Despite the fact that it is meant for data miners, no support to biglm packages, as well as parallel programming is enabled in GUI for bigger datasets, though these can be done by R command line in conjunction with the Rattle GUI. Data m7ining is typically done on bigger datsets.
5) May have some problems installing it as it is dependent on GTK and has a lot of packages as dependencies.

Top Row-
This has the Execute Button (shown as two gears) and which has keyboard shortcut F2. It is used to execute the options in Tabs-and is equivalent of submit code button.
Other buttons include new Projects,Save  and Load projects which are files with extension to .rattle an which store all related information from Rattle.
It also has a button for exporting information in the current Tab as an open office document, and buttons for interrupting current process as well as exiting Rattle.

Data Tab-
It has the following options.
●        Data Type- These are radio buttons between Spreadsheet (and Comma Separated Values), ARFF files (Weka), ODBC (for Database Connections),Library (for Datasets from Packages),R Dataset or R datafile, Corpus (for Text Mining) and Script for generating the data by code.
●        The second row-in Data Tab in Rattle is Detail on Data Type- and its apperance shifts as per the radio button selection of data type in previous step. For Spreadsheet, it will show Path of File, Delimiters, Header Row while for ODBC it will show DSN, Tables, Rows and for Library it will show you a dropdown of all datasets in all R packages installed locally.
●        The third row is a Partition field for splitting dataset in training,testing,validation and it shows ratio. It also specifies a Random seed which can be customized for random partitions which can be replicated. This is very useful as model building requires model to be built and tested on random sub sets of full dataset.
●        The fourth row is used to specify the variable type of inputted data. The variable types are
○        Input: Used for modeling as independent variables
○        Target: Output for modeling or the dependent variable. Target is a categoric variable for classification, numeric for regression and for survival analysis both Time and Status need to be defined
○        Risk: A variable used in the Risk Chart
○        Ident: An identifier for unique observations in the data set like AccountId or Customer Id
○        Ignore: Variables that are to be ignored.
●        In addition the weight calculator can be used to perform mathematical operations on certain variables and identify certain variables as more important than others.

Explore Tab-
Summary Sub-Tab has Summary for brief summary of variables, Describe for detailed summary and Kurtosis and Skewness for comparing them across numeric variables.
Distributions Sub-Tab allows plotting of histograms, box plots, and cumulative plots for numeric variables and for categorical variables Bar Plot and Dot Plot.
It also has Benford Plot for Benford’s Law on probability of distribution of digits.
Correlation Sub-Tab– This displays corelation between variables as a table and also as a very nice plot.
Principal Components Sub-Tab– This is for use with Principal Components Analysis including the SVD (singular value decomposition) and Eigen methods.
Interactive Sub-Tab- Allows interactive data exploration using GGobi and Lattice software. It is a powerful visual tool.

Test Tab-This has options for hypothesis testing of data for two sample tests.
Transform Tab-This has options for rescaling data, missing values treatment, and deleting invalid or missing values.
Cluster Tab-It gives an option to KMeans, Hierarchical and Bi-Cluster clustering methods with automated graphs,plots (including dendogram, discriminant plot and data plot) and cluster results available. It is highly recommended for clustering projects especially for people who are proficient in clustering but not in R.

Associate Tab-It helps in building association rules between categorical variables, which are in the form of “if then”statements. Example. If day is Thursday, and someone buys Milk, there is 80% chance they will buy Diapers. These probabilities are generated from observed frequencies.

Model Tab-The Model tab makes Rattle one of the most advanced data mining tools, as it incorporates decision trees(including boosted models and forest method), linear and logistic regression, SVM,neural net,survival models.
Evaluate Tab-It as functionality for evaluating models including lift,ROC,confusion matrix,cost curve,risk chart,precision, specificity, sensitivity as well as scoring datasets with built model or models. Example – A ROC curve generated by Rattle for Survived Passengers in Titanic (as function of age,class,sex) This shows comparison of various models built.

Log Tab- R Code is automatically generated by Rattle as the respective operation is executed. Also timestamp is done so it helps in reviewing error as well as evaluating speed for code optimization.
—————————————————————————————————————————-
JGR- Deducer- (see http://www.deducer.org/pmwiki/pmwiki.php?n=Main.DeducerManual
JGR is a Java Based GUI. Deducer is recommended for use with JGR.
Deducer has basically been made to implement GGPLOT in a GUI- an advanced graphics package based on Grammer of Graphics and was part of Google Summer of Code project.

It first asks you to either open existing dataset or load a new dataset with just two icons. It has two initial views in Data Viewer- a Data view and Variable view which is quite similar to Base SPSS. The other Deducer options are loaded within the JGR console.

Advantages of Using  Deducer
1.      It has an option for factor as well as reliability analysis which is missing in other graphical user interfaces like R Commander and Rattle.
2.      The plot builder option gives very good graphics -perhaps the best in other GUIs. This includes a color by option which allows you to shade the colors based on variable value. An addition innovation is the form of templates which enables even a user not familiar with data visualization to choose among various graphs and click and drag them to plot builder area.
3.      You can set the Java Gui for R (JGR) menu to automatically load some packages by default using an easy checkbox list.
4.      Even though Deducer is a very young package, it offers a way for building other R GUIs using Java Widgets.
5.      Overall feel is of SPSS (Base GUI) to it’s drop down menu, and selecting variables in the sub menu dialogue by clicking to transfer to other side.SPSS users should be more comfortable at using this.
6.      A surprising thing is it rearranges the help documentation of all R in a very presentable and organized manner
7.      Very convenient to move between two or more datasets using dropdown.
8.      The most convenient GUI for merging two datasets using common variable.

Dis Advantages of Using  Deducer
1.      Not able to save plots as images (only options are .pdf and .eps), you can however copy as image.
2.      Basically a data viualization GUI – it does offer support for regression, descriptive statistics in the menu item Extras- however the menu suggests it is a work in progress.
3.      Website for help is outdated, and help documentation specific to Deducer lacks detail.



Components of Deducer-
Data Menu-Gives options for data manipulation including recoding variables,transform variables (binning, mathematical operation), sort dataset,  transpose dataset ,merge two datasets.
Analysis Menu-Gives options for frequency tables, descriptive statistics,cross tabs, one sample tests (with plots) ,two sample tests (with plots),k sample tests, correlation,linear and logistic models,generalized linear models.
Plot Builder Menu- This allows plots of various kinds to be made in an interactive manner.

Correlation using Deducer.

————————————————————————————————————————–
Red-R – A dataflow user interface for R (see http://red-r.org/

Red R uses dataflow concepts as a user interface rather than menus and tabs. Thus it is more similar to Enterprise Miner or Rapid Miner in design. For repeatable analysis dataflow programming is preferred by some analysts. Red-R is written in Python.


Advantages of using Red-R
1) Dataflow style makes it very convenient to use. It is the only dataflow GUI for R.
2) You can save the data as well as analysis in the same file.
3) User Interface makes it easy to read R code generated, and commit code.
4) For repeatable analysis-like reports or creating models it is very useful as you can replace just one widget and other widget/operations remain the same.
5) Very easy to zoom into data points by double clicking on graphs. Also to change colors and other options in graphs.
6) One minor feature- It asks you to set CRAN location just once and stores it even for next session.
7) Automated bug report submission.

Disadvantages of using Red-R
1) Current version is 1.8 and it needs a lot of improvement for building more modeling types as well as debugging errors.
2) Limited features presently.
———————————————————————————————————————-
RKWard (see http://rkward.sourceforge.net/)

It is primarily a KDE GUI for R, so it can be used on Ubuntu Linux. The windows version is available but has some bugs.

Advantages of using RKWard
1) It is the only R GUI for time series at present.
In addition it seems like the only R GUI explicitly for Item Response Theory (which includes credit response models,logistic models) and plots contains Pareto Charts.
2) It offers a lot of detail in analysis especially in plots(13 types of plots), analysis and  distribution analysis ( 8 Tests of normality,14 continuous and 6 discrete distributions). This detail makes it more suitable for advanced statisticians rather than business analytics users.
3) Output can be easily copied to Office documents.

Disadvantages of using RKWard
1) It does not have stable Windows GUI. Since a graphical user interface is aimed at making interaction easier for users- this is major disadvantage.
2) It has a lot of dependencies so may have some issues in installing.
3) The design categorization of analysis,plots and distributions seems a bit unbalanced considering other tabs are File, Edit, View, Workspace,Run,Settings, Windows,Help.
Some of the other tabs can be collapsed, while the three main tabs of analysis,plots,distributions can be better categorized (especially into modeling and non-modeling analysis).
4) Not many options for data manipulation (like subset or transpose) by the GUI.
5) Lack of detail in documentation as it is still on version 0.5.3 only.

Components-
Analysis, Plots and Distributions are the main components and they are very very extensive, covering perhaps the biggest range of plots,analysis or distribution analysis that can be done.
Thus RKWard is best combined with some other GUI, when doing advanced statistical analysis.

 

GNU General Public License
Image via Wikipedia

GrapherR

GrapheR is a Graphical User Interface created for simple graphs.

Depends: R (>= 2.10.0), tcltk, mgcv
Description: GrapheR is a multiplatform user interface for drawing highly customizable graphs in R. It aims to be a valuable help to quickly draw publishable graphs without any knowledge of R commands. Six kinds of graphs are available: histogram, box-and-whisker plot, bar plot, pie chart, curve and scatter plot.
License: GPL-2
LazyLoad: yes
Packaged: 2011-01-24 17:47:17 UTC; Maxime
Repository: CRAN
Date/Publication: 2011-01-24 18:41:47

More information about GrapheR at CRAN
Path: /cran/newpermanent link

Advantages of using GrapheR

  • It is bi-lingual (English and French) and can import in text and csv files
  • The intention is for even non users of R, to make the simple types of Graphs.
  • The user interface is quite cleanly designed. It is thus aimed as a data visualization GUI, but for a more basic level than Deducer.
  • Easy to rename axis ,graph titles as well use sliders for changing line thickness and color

Disadvantages of using GrapheR

  • Lack of documentation or help. Especially tips on mouseover of some options should be done.
  • Some of the terms like absicca or ordinate axis may not be easily understood by a business user.
  • Default values of color are quite plain (black font on white background).
  • Can flood terminal with lots of repetitive warnings (although use of warnings() function limits it to top 50)
  • Some of axis names can be auto suggested based on which variable s being chosen for that axis.
  • Package name GrapheR refers to a graphical calculator in Mac OS – this can hinder search engine results

Using GrapheR

  • Data Input -Data Input can be customized for CSV and Text files.
  • GrapheR gives information on loaded variables (numeric versus Factors)
  • It asks you to choose the type of Graph 
  • It then asks for usual Graph Inputs (see below). Note colors can be customized (partial window). Also number of graphs per Window can be easily customized 
  • Graph is ready for publication



Related Articles

 

Summary of R GUIs


Using R from other software- Please note that interfaces to R exist from other software as well. These include software from SAS Institute, IBM SPSS, Rapid Miner,Knime  and Oracle.

A brief list is shown below-

1) SAS/IML Interface to R- You can read about the SAS Institute’s SAS/ IML Studio interface to R at http://www.sas.com/technologies/analytics/statistics/iml/index.html
2) Rapid  Miner Extension to R-You can view integration with Rapid Miner’s extension to R here at http://www.youtube.com/watch?v=utKJzXc1Cow
3) IBM SPSS plugin for R-SPSS software has R integration in the form of a plugin. This was one of the earliest third party software offering interaction with R and you can read more at http://www.spss.com/software/statistics/developer/
4) Knime- Konstanz Information Miner also has R integration. You can view this on
http://www.knime.org/downloads/extensions
5) Oracle Data Miner- Oracle has a data mining offering to it’s very popular database software which is integrated with the R language. The R Interface to Oracle Data Mining ( R-ODM) allows R users to access the power of Oracle Data Mining’s in-database functions using the familiar R syntax. http://www.oracle.com/technetwork/database/options/odm/odm-r-integration-089013.html
6) JMP- JMP version 9 is the latest to offer interface to R.  You can read example scripts here at http://blogs.sas.com/jmp/index.php?/archives/298-JMP-Into-R!.html

R Excel- Using R from Microsoft Excel

Microsoft Excel is the most widely used spreadsheet program for data manipulation, entry and graphics. Yet as dataset sizes have increased, Excel’s statistical capabilities have lagged though it’s design has moved ahead in various product versions.

R Excel basically works at adding a .xla plugin to
Excel just like other Plugins. It does so by connecting to R through R packages.

Basically it offers the functionality of R
functions and capabilities to the most widely distributed spreadsheet program. All data summaries, reports and analysis end up in a spreadsheet-

R Excel enables R to be very useful for people not
knowing R. In addition it adds (by option) the menus of R Commander as menus in Excel spreadsheet.


Advantages-
Enables R and Excel to communicate thus tieing an advanced statistical tool to the most widely used business analytics tool.

Disadvantages-
No major disadvatage at all to a business user. For a data statistical user, Microsoft Excel is limited to 100,000 rows, so R data needs to be summarized or reduced.

Graphical capabilities of R are very useful, but to a new user, interactive graphics in Excel may be easier than say using Ggplot ot Ggobi.
You can read more on this at http://rcom.univie.ac.at/ or  the complete Springer Book http://www.springer.com/statistics/computanional+statistics/book/978-1-4419-0051-7

The combination of cloud computing and internet offers a new kind of interaction possible for scientists as well analysts.

Here is a way to use R on an Amazon EC2 machine, thus renting by hour hardware and computing resources which are scaleable to massive levels , whereas the software is free.

Here is how you can connect to Amazon EC2 and run R.
Running R for Cloud Computing.
1) Logging onto Amazon Console http://aws.amazon.com/ec2/
Note you need your Amazon Id (even the same id which you use for buying books).Note we are into Amazon EC2 as shown by the upper tab. Click upper tab to get into the Amazon EC2
2) Choosing the right AMI-On the left margin, you can click AMI -Images. Now you can search for the image-I chose Ubuntu images (linux images are cheaper) and latest Ubuntu Lucid  in the search .You can choose whether you want 32 bit or 64 bit image. 64 bit images will lead to  faster processing of data.Click on launch instance in the upper tab ( near the search feature). A pop up comes up, which shows the 5 step process to launch your computing.
3) Choose the right compute instance- – there are various compute instances and they all are at different multiples of prices or compute units. They differ in terms of RAM memory and number of processors.After choosing the compute instance of your choice (extra large is highlighted)- click on continue-
4) Instance Details-Do not  choose cloudburst monitoring if you are on a budget as it has a extra charge. For critical production it would be advisable to choose cloudburst monitoring once you have become comfortable with handling cloud computing..
5) Add Tag Details- If you are running a lot of instances you need to create your own tags to help you manage them. It is advisable if you are going to run many instances.
6) Create a key pair- A key pair is an added layer of encryption. Click on create new pair and name it (note the name will be handy in coming steps)
7) After clicking and downloading the key pair- you come into security groups. Security groups is just a set of instructions to help keep your data transfer secure. You want to enable access to your cloud instance to certain IP addresses (if you are going to connect from fixed IP address and to certain ports in your computer. It is necessary in security group to enable  SSH using Port 22.
Last step- Review Details and Click Launch
8) On the Left margin click on instances ( you were in Images.>AMI earlier)
It will take some 3-5 minutes to launch an instance. You can see status as pending till then.
9) Pending instance as shown by yellow light-
10) Once the instance is running -it is shown by a green light.
Click on the check box, and on upper tab go to instance actions. Click on connect-
You see a popup with instructions like these-
· Open the SSH client of your choice (e.g., PuTTY, terminal).
·  Locate your private key, nameofkeypair.pem
·  Use chmod to make sure your key file isn’t publicly viewable, ssh won’t work otherwise:
chmod 400 decisionstats.pem
·  Connect to your instance using instance’s public DNS [ec2-75-101-182-203.compute-1.amazonaws.com].
Example
Enter the following command line:
ssh -i decisionstats2.pem root@ec2-75-101-182-203.compute-1.amazonaws.com

Note- If you are using Ubuntu Linux on your desktop/laptop you will need to change the above line to ubuntu@… from root@..

ssh -i yourkeypairname.pem -X ubuntu@ec2-75-101-182-203.compute-1.amazonaws.com

(Note X11 package should be installed for Linux users- Windows Users will use Remote Desktop)

12) Install R Commander on the remote machine (which is running Ubuntu Linux) using the command

sudo apt-get install r-cran-rcmdr