#SAS 9.3 and #Rstats 2.13.1 Released

A bit early but the latest editions of both SAS and R were released last week.

SAS 9.3 is clearly a major release with multiple enhancements to make SAS both relevant and pertinent in enterprise software in the age of big data. Also many more R specific, JMP specific and partners like Teradata specific enhancements.

http://support.sas.com/software/93/index.html

Features

Data management

  • Enhanced manageability for improved performance
  • In-database processing (EL-T pushdown)
  • Enhanced performance for loading oracle data
  • New ET-L transforms
  • Data access

Data quality

  • SAS® Data Integration Server includes DataFlux® Data Management Platform for enhanced data quality
  • Master Data Management (DataFlux® qMDM)
    • Provides support for master hub of trusted entity data.

Analytics

  • SAS® Enterprise Miner™
    • New survival analysis predicts when an event will happen, not just if it will happen.
    • New rate making capability for insurance predicts optimal insurance premium for individuals based on attributes known at application time.
    • Time Series Data Mining node (experimental) applies data mining techniques to transactional, time-stamped data.
    • Support Vector Machines node (experimental) provides a supervised machine learning method for prediction and classification.
  • SAS® Forecast Server
    • SAS Forecast Server is integrated with the SAP APO Demand Planning module to provide SAP users with access to a superior forecasting engine and automatic forecasting capabilities.
  • SAS® Model Manager
    • Seamless integration of R models with the ability to register and manage R models in SAS Model Manager.
    • Ability to perform champion/challenger side-by-side comparisons between SAS and R models to see which model performs best for a specific need.
  • SAS/OR® and SAS® Simulation Studio
    • Optimization
    • Simulation
      • Automatic input distribution fitting using JMP with SAS Simulation Studio.

Text analytics

  • SAS® Text Miner
  • SAS® Enterprise Content Categorization
  • SAS® Sentiment Analysis

Scalability and high-performance

  • SAS® Analytics Accelerator for Teradata (new product)
  • SAS® Grid Manager
 and latest from http://www.r-project.org/ I was a bit curious to know why the different licensing for R now (from GPL2 to GPL2- GPL 3)

LICENCE:

No parts of R are now licensed solely under GPL-2. The licences for packages rpart and survival have been changed, which means that the licence terms for R as distributed are GPL-2 | GPL-3.


This is a maintenance release to consolidate various minor fixes to 2.13.0.
CHANGES IN R VERSION 2.13.1:

  NEW FEATURES:

    • iconv() no longer translates NA strings as "NA".

    • persp(box = TRUE) now warns if the surface extends outside the
      box (since occlusion for the box and axes is computed assuming
      the box is a bounding box). (PR#202.)

    • RShowDoc() can now display the licences shipped with R, e.g.
      RShowDoc("GPL-3").

    • New wrapper function showNonASCIIfile() in package tools.

    • nobs() now has a "mle" method in package stats4.

    • trace() now deals correctly with S4 reference classes and
      corresponding reference methods (e.g., $trace()) have been added.

    • xz has been updated to 5.0.3 (very minor bugfix release).

    • tools::compactPDF() gets more compression (usually a little,
      sometimes a lot) by using the compressed object streams of PDF
      1.5.

    • cairo_ps(onefile = TRUE) generates encapsulated EPS on platforms
      with cairo >= 1.6.

    • Binary reads (e.g. by readChar() and readBin()) are now supported
      on clipboard connections.  (Wish of PR#14593.)

    • as.POSIXlt.factor() now passes ... to the character method
      (suggestion of Joshua Ulrich).  [Intended for R 2.13.0 but
      accidentally removed before release.]

    • vector() and its wrappers such as integer() and double() now warn
      if called with a length argument of more than one element.  This
      helps track down user errors such as calling double(x) instead of
      as.double(x).

  INSTALLATION:

    • Building the vignette PDFs in packages grid and utils is now part
      of running make from an SVN checkout on a Unix-alike: a separate
      make vignettes step is no longer required.

      These vignettes are now made with keep.source = TRUE and hence
      will be laid out differently.

    • make install-strip failed under some configuration options.

    • Packages can customize non-standard installation of compiled code
      via a src/install.libs.R script. This allows packages that have
      architecture-specific binaries (beyond the package's shared
      objects/DLLs) to be installed in a multi-architecture setting.

  SWEAVE & VIGNETTES:

    • Sweave() and Stangle() gain an encoding argument to specify the
      encoding of the vignette sources if the latter do not contain a
      \usepackage[]{inputenc} statement specifying a single input
      encoding.

    • There is a new Sweave option figs.only = TRUE to run each figure
      chunk only for each selected graphics device, and not first using
      the default graphics device.  This will become the default in R
      2.14.0.

    • Sweave custom graphics devices can have a custom function
      foo.off() to shut them down.

    • Warnings are issued when non-portable filenames are found for
      graphics files (and chunks if split = TRUE).  Portable names are
      regarded as alphanumeric plus hyphen, underscore, plus and hash
      (periods cause problems with recognizing file extensions).

    • The Rtangle() driver has a new option show.line.nos which is by
      default false; if true it annotates code chunks with a comment
      giving the line number of the first line in the sources (the
      behaviour of R >= 2.12.0).

    • Package installation tangles the vignette sources: this step now
      converts the vignette sources from the vignette/package encoding
      to the current encoding, and records the encoding (if not ASCII)
      in a comment line at the top of the installed .R file.

  DEPRECATED AND DEFUNCT:

    • The internal functions .readRDS() and .saveRDS() are now
      deprecated in favour of the public functions readRDS() and
      saveRDS() introduced in R 2.13.0.

    • Switching off lazy-loading of code _via_ the LazyLoad field of
      the DESCRIPTION file is now deprecated.  In future all packages
      will be lazy-loaded.

    • The off-line help() types "postscript" and "ps" are deprecated.

  UTILITIES:

    • R CMD check on a multi-architecture installation now skips the
      user's .Renviron file for the architecture-specific tests (which
      do read the architecture-specific Renviron.site files).  This is
      consistent with single-architecture checks, which use
      --no-environ.

    • R CMD build now looks for DESCRIPTION fields BuildResaveData and
      BuildKeepEmpty for per-package overrides.  See ‘Writing R
      Extensions’.

  BUG FIXES:

    • plot.lm(which = 5) was intended to order factor levels in
      increasing order of mean standardized residual.  It ordered the
      factor labels correctly, but could plot the wrong group of
      residuals against the label.  (PR#14545)

    • mosaicplot() could clip the factor labels, and could overlap them
      with the cells if a non-default value of cex.axis was used.
      (Related to PR#14550.)

    • dataframe[[row,col]] now dispatches on [[ methods for the
      selected column (spotted by Bill Dunlap).

    • sort.int() would strip the class of an object, but leave its
      object bit set.  (Reported by Bill Dunlap.)

    • pbirthday() and qbirthday() did not implement the algorithm
      exactly as given in their reference and so were unnecessarily
      inaccurate.

      pbirthday() now solves the approximate formula analytically
      rather than using uniroot() on a discontinuous function.

      The description of the problem was inaccurate: the probability is
      a tail probablity (‘2 _or more_ people share a birthday’)

    • Complex arithmetic sometimes warned incorrectly about producing
      NAs when there were NaNs in the input.

    • seek(origin = "current") incorrectly reported it was not
      implemented for a gzfile() connection.

    • c(), unlist(), cbind() and rbind() could silently overflow the
      maximum vector length and cause a segfault.  (PR#14571)

    • The fonts argument to X11(type = "Xlib") was being ignored.

    • Reading (e.g. with readBin()) from a raw connection was not
      advancing the pointer, so successive reads would read the same
      value.  (Spotted by Bill Dunlap.)

    • Parsed text containing embedded newlines was printed incorrectly
      by as.character.srcref().  (Reported by Hadley Wickham.)

    • decompose() used with a series of a non-integer number of periods
      returned a seasonal component shorter than the original series.
      (Reported by Rob Hyndman.)

    • fields = list() failed for setRefClass().  (Reported by Michael
      Lawrence.)

    • Reference classes could not redefine an inherited field which had
      class "ANY". (Reported by Janko Thyson.)

    • Methods that override previously loaded versions will now be
      installed and called.  (Reported by Iago Mosqueira.)

    • addmargins() called numeric(apos) rather than
      numeric(length(apos)).

    • The HTML help search sometimes produced bad links.  (PR#14608)

    • Command completion will no longer be broken if tail.default() is
      redefined by the user. (Problem reported by Henrik Bengtsson.)

    • LaTeX rendering of markup in titles of help pages has been
      improved; in particular, \eqn{} may be used there.

    • isClass() used its own namespace as the default of the where
      argument inadvertently.

    • Rd conversion to latex mis-handled multi-line titles (including
      cases where there was a blank line in the \title section).
Also see this interesting blog
Examples of tasks replicated in SAS and R

Analytics 2011 Conference

From http://www.sas.com/events/analytics/us/

The Analytics 2011 Conference Series combines the power of SAS’s M2010 Data Mining Conference and F2010 Business Forecasting Conference into one conference covering the latest trends and techniques in the field of analytics. Analytics 2011 Conference Series brings the brightest minds in the field of analytics together with hundreds of analytics practitioners. Join us as these leading conferences change names and locations. At Analytics 2011, you’ll learn through a series of case studies, technical presentations and hands-on training. If you are in the field of analytics, this is one conference you can’t afford to miss.

Conference Details

October 24-25, 2011
Grande Lakes Resort
Orlando, FL

Analytics 2011 topic areas include:

Lovely forecasting blog

Eight different random walks.
Image via Wikipedia

I really loved this simple, smart and yet elegant explanation of forecasting. even a high school quarterback could understand it, and maybe get a internship job building and running and re running code for Mars shot.

Despite my plea that you remain svelte in real life, I implore you to be naïve in business forecasting – and use a naïve forecasting model early and often. A naïve forecasting model is the most important model you will ever use in business forecasting.

and now the killer line

Purists may argue that the only true naïve forecast is the “no-change” forecast, meaning either a random walk (forecast = last known actual) or a seasonal random walk (e.g. forecast = actual from corresponding period last year). These are referred to as NF1 and NF2 in the Makridakis text (where NF = Naïve Forecast). In our 2006 SAS webseries Finding Flaws in Forecasting, an attendee asked “What about using a simple time series forecast with no intervention as the naïve forecast?” Is that allowed?

i did write a blog article on forecasting some time back, but back then I was a little blogger, with the website name being http://iwannacrib.com

great work in helping make forecasting easier to understand for people who have flower shops and dont have a bee, to help them with the forecasts, nor an geeky email list, not 4000$.

make it easier for the little guy to forecast his sales, so he cuts down on his supply chain inventory, lowering his carbon footprint.

Blog.sas.com take a bow, on labour day, helping workers with easy to understand models.

http://blogs.sas.com/forecasting/index.php?/archives/68-Which-Naive-Model-to-Use.html

Broad Guidelines for Graphs

Here are some broad guidelines for Graphs from EIA.gov , so you can say these are the official graphical guidelines of USA Gov

They can be really useful for sites planning to get into the Tableau Software/NYT /Guardian Infographic mode- or even for communities of blogs that have recurrent needs to display graphical plots- particularly since communication, statistical and design specialists are different areas/expertise/people.

Energy Information Administration Standard

Broad Guidelines for Graphs-I am reproducing an example from EIA ‘s guidelines for graphs-
http://www.eia.gov/about/eia_standards.cfm#Standard25

Energy Information Administration Standard 2009-25

Title: Statistical Graphs
Superseded Version: Standard 2002-25
Purpose: To ensure the utility (usefulness to intended users) and objectivity (accuracy, clarity, completeness, and lack of bias) of energy information presented in statistical graphs.
Applicability: All EIA information products.
Required Actions:

  1. Graphs should be used to show and compare changes, trends and/or relationships, and to assist users in visualizing the conclusions drawn from the data represented.
  2. A graph should contain sufficient Continue reading “Broad Guidelines for Graphs”

Choosing R for business – What to consider?

A composite of the GNU logo and the OSI logo, ...
Image via Wikipedia

Additional features in R over other analytical packages-

1) Source Code is given to ensure complete custom solution and embedding for a particular application. Open source code has an advantage that is extensively peer- reviewed in Journals and Scientific Literature.  This means bugs will found, shared and corrected transparently.

2) Wide literature of training material in the form of books is available for the R analytical platform.

3) Extensively the best data visualization tools in analytical software (apart from Tableau Software ‘s latest version). The extensive data visualization available in R is of the form a variety of customizable graphs, as well as animation. The principal reason third-party software initially started creating interfaces to R is because the graphical library of packages in R is more advanced as well as rapidly getting more features by the day.

4) Free in upfront license cost for academics and thus budget friendly for small and large analytical teams.

5) Flexible programming for your data environment. This includes having packages that ensure compatibility with Java, Python and C++.

 

6) Easy migration from other analytical platforms to R Platform. It is relatively easy for a non R platform user to migrate to R platform and there is no danger of vendor lock-in due to the GPL nature of source code and open community.

Statistics are numbers that tell (descriptive), advise ( prescriptive) or forecast (predictive). Analytics is a decision-making help tool. Analytics on which no decision is to be made or is being considered can be classified as purely statistical and non analytical. Thus ease of making a correct decision separates a good analytical platform from a not so good analytical platform. The distinction is likely to be disputed by people of either background- and business analysis requires more emphasis on how practical or actionable the results are and less emphasis on the statistical metrics in a particular data analysis task. I believe one clear reason between business analytics is different from statistical analysis is the cost of perfect information (data costs in real world) and the opportunity cost of delayed and distorted decision-making.

Specific to the following domains R has the following costs and benefits

  • Business Analytics
    • R is free per license and for download
    • It is one of the few analytical platforms that work on Mac OS
    • It’s results are credibly established in both journals like Journal of Statistical Software and in the work at LinkedIn, Google and Facebook’s analytical teams.
    • It has open source code for customization as per GPL
    • It also has a flexible option for commercial vendors like Revolution Analytics (who support 64 bit windows) as well as bigger datasets
    • It has interfaces from almost all other analytical software including SAS,SPSS, JMP, Oracle Data Mining, Rapid Miner. Existing license holders can thus invoke and use R from within these software
    • Huge library of packages for regression, time series, finance and modeling
    • High quality data visualization packages
    • Data Mining
      • R as a computing platform is better suited to the needs of data mining as it has a vast array of packages covering standard regression, decision trees, association rules, cluster analysis, machine learning, neural networks as well as exotic specialized algorithms like those based on chaos models.
      • Flexibility in tweaking a standard algorithm by seeing the source code
      • The RATTLE GUI remains the standard GUI for Data Miners using R. It was created and developed in Australia.
      • Business Dashboards and Reporting
      • Business Dashboards and Reporting are an essential piece of Business Intelligence and Decision making systems in organizations. R offers data visualization through GGPLOT, and GUI like Deducer and Red-R can help even non R users create a metrics dashboard
        • For online Dashboards- R has packages like RWeb, RServe and R Apache- which in combination with data visualization packages offer powerful dashboard capabilities.
        • R can be combined with MS Excel using the R Excel package – to enable R capabilities to be imported within Excel. Thus a MS Excel user with no knowledge of R can use the GUI within the R Excel plug-in to use powerful graphical and statistical capabilities.

Additional factors to consider in your R installation-

There are some more choices awaiting you now-
1) Licensing Choices-Academic Version or Free Version or Enterprise Version of R

2) Operating System Choices-Which Operating System to choose from? Unix, Windows or Mac OS.

3) Operating system sub choice- 32- bit or 64 bit.

4) Hardware choices-Cost -benefit trade-offs for additional hardware for R. Choices between local ,cluster and cloud computing.

5) Interface choices-Command Line versus GUI? Which GUI to choose as the default start-up option?

6) Software component choice- Which packages to install? There are almost 3000 packages, some of them are complimentary, some are dependent on each other, and almost all are free.

7) Additional Software choices- Which additional software do you need to achieve maximum accuracy, robustness and speed of computing- and how to use existing legacy software and hardware for best complementary results with R.

1) Licensing Choices-
You can choose between two kinds of R installations – one is free and open source from http://r-project.org The other R installation is commercial and is offered by many vendors including Revolution Analytics. However there are other commercial vendors too.

Commercial Vendors of R Language Products-
1) Revolution Analytics http://www.revolutionanalytics.com/
2) XL Solutions- http://www.experience-rplus.com/
3) Information Builder – Webfocus RStat -Rattle GUI http://www.informationbuilders.com/products/webfocus/PredictiveModeling.html
4) Blue Reference- Inference for R http://inferenceforr.com/default.aspx

  1. Choosing Operating System
      1. Windows

 

Windows remains the most widely used operating system on this planet. If you are experienced in Windows based computing and are active on analytical projects- it would not make sense for you to move to other operating systems. This is also based on the fact that compatibility problems are minimum for Microsoft Windows and the help is extensively documented. However there may be some R packages that would not function well under Windows- if that happens a multiple operating system is your next option.

        1. Enterprise R from Revolution Analytics- Enterprise R from Revolution Analytics has a complete R Development environment for Windows including the use of code snippets to make programming faster. Revolution is also expected to make a GUI available by 2011. Revolution Analytics claims several enhancements for it’s version of R including the use of optimized libraries for faster performance.
      1. MacOS

 

Reasons for choosing MacOS remains its considerable appeal in aesthetically designed software- but MacOS is not a standard Operating system for enterprise systems as well as statistical computing. However open source R claims to be quite optimized and it can be used for existing Mac users. However there seem to be no commercially available versions of R available as of now for this operating system.

      1. Linux

 

        1. Ubuntu
        2. Red Hat Enterprise Linux
        3. Other versions of Linux

 

Linux is considered a preferred operating system by R users due to it having the same open source credentials-much better fit for all R packages and it’s customizability for big data analytics.

Ubuntu Linux is recommended for people making the transition to Linux for the first time. Ubuntu Linux had an marketing agreement with revolution Analytics for an earlier version of Ubuntu- and many R packages can  installed in a straightforward way as Ubuntu/Debian packages are available. Red Hat Enterprise Linux is officially supported by Revolution Analytics for it’s enterprise module. Other versions of Linux popular are Open SUSE.

      1. Multiple operating systems-
        1. Virtualization vs Dual Boot-

 

You can also choose between having a VMware VM Player for a virtual partition on your computers that is dedicated to R based computing or having operating system choice at the startup or booting of your computer. A software program called wubi helps with the dual installation of Linux and Windows.

  1. 64 bit vs 32 bit – Given a choice between 32 bit versus 64 bit versions of the same operating system like Linux Ubuntu, the 64 bit version would speed up processing by an approximate factor of 2. However you need to check whether your current hardware can support 64 bit operating systems and if so- you may want to ask your Information Technology manager to upgrade atleast some operating systems in your analytics work environment to 64 bit operating systems.

 

  1. Hardware choices- At the time of writing this book, the dominant computing paradigm is workstation computing followed by server-client computing. However with the introduction of cloud computing, netbooks, tablet PCs, hardware choices are much more flexible in 2011 than just a couple of years back.

Hardware costs are a significant cost to an analytics environment and are also  remarkably depreciated over a short period of time. You may thus examine your legacy hardware, and your future analytical computing needs- and accordingly decide between the various hardware options available for R.
Unlike other analytical software which can charge by number of processors, or server pricing being higher than workstation pricing and grid computing pricing extremely high if available- R is well suited for all kinds of hardware environment with flexible costs. Given the fact that R is memory intensive (it limits the size of data analyzed to the RAM size of the machine unless special formats and /or chunking is used)- it depends on size of datasets used and number of concurrent users analyzing the dataset. Thus the defining issue is not R but size of the data being analyzed.

    1. Local Computing- This is meant to denote when the software is installed locally. For big data the data to be analyzed would be stored in the form of databases.
      1. Server version- Revolution Analytics has differential pricing for server -client versions but for the open source version it is free and the same for Server or Workstation versions.
      2. Workstation
    2. Cloud Computing- Cloud computing is defined as the delivery of data, processing, systems via remote computers. It is similar to server-client computing but the remote server (also called cloud) has flexible computing in terms of number of processors, memory, and data storage. Cloud computing in the form of public cloud enables people to do analytical tasks on massive datasets without investing in permanent hardware or software as most public clouds are priced on pay per usage. The biggest cloud computing provider is Amazon and many other vendors provide services on top of it. Google is also coming for data storage in the form of clouds (Google Storage), as well as using machine learning in the form of API (Google Prediction API)
      1. Amazon
      2. Google
      3. Cluster-Grid Computing/Parallel processing- In order to build a cluster, you would need the RMpi and the SNOW packages, among other packages that help with parallel processing.
    3. How much resources
      1. RAM-Hard Disk-Processors- for workstation computing
      2. Instances or API calls for cloud computing
  1. Interface Choices
    1. Command Line
    2. GUI
    3. Web Interfaces
  2. Software Component Choices
    1. R dependencies
    2. Packages to install
    3. Recommended Packages
  3. Additional software choices
    1. Additional legacy software
    2. Optimizing your R based computing
    3. Code Editors
      1. Code Analyzers
      2. Libraries to speed up R

citation-  R Development Core Team (2010). R: A language and environment for statistical computing. R Foundation for Statistical Computing,Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.

(Note- this is a draft in progress)