compression – DECISION STATS

Interview Dan Steinberg Founder Salford Systems

Here is an interview with Dan Steinberg, Founder and President of Salford Systems (http://www.salford-systems.com/ )

Ajay- Describe your journey from academia to technology entrepreneurship. What are the key milestones or turning points that you remember.

Dan- When I was in graduate school studying econometrics at Harvard, a number of distinguished professors at Harvard (and MIT) were actively involved in substantial real world activities. Professors that I interacted with, or studied with, or whose software I used became involved in the creation of such companies as Sun Microsystems, Data Resources, Inc. or were heavily involved in business consulting through their own companies or other influential consultants. Some not involved in private sector consulting took on substantial roles in government such as membership on the President’s Council of Economic Advisors. The atmosphere was one that encouraged free movement between academia and the private sector so the idea of forming a consulting and software company was quite natural and did not seem in any way inconsistent with being devoted to the advancement of science.

Ajay- What are the latest products by Salford Systems? Any future product plans or modification to work on Big Data analytics, mobile computing and cloud computing.

Dan- Our central set of data mining technologies are CART, MARS, TreeNet, RandomForests, and PRIM, and we have always maintained feature rich logistic regression and linear regression modules. In our latest release scheduled for January 2012 we will be including a new data mining approach to linear and logistic regression allowing for the rapid processing of massive numbers of predictors (e.g., one million columns), with powerful predictor selection and coefficient shrinkage. The new methods allow not only classic techniques such as ridge and lasso regression, but also sub-lasso model sizes. Clear tradeoff diagrams between model complexity (number of predictors) and predictive accuracy allow the modeler to select an ideal balance suitable for their requirements.

The new version of our data mining suite, Salford Predictive Modeler (SPM), also includes two important extensions to the boosted tree technology at the heart of TreeNet. The first, Importance Sampled learning Ensembles (ISLE), is used for the compression of TreeNet tree ensembles. Starting with, say, a 1,000 tree ensemble, the ISLE compression might well reduce this down to 200 reweighted trees. Such compression will be valuable when models need to be executed in real time. The compression rate is always under the modeler’s control, meaning that if a deployed model may only contain, say, 30 trees, then the compression will deliver an optimal 30-tree weighted ensemble. Needless to say, compression of tree ensembles should be expected to be lossy and how much accuracy is lost when extreme compression is desired will vary from case to case. Prior to ISLE, practitioners have simply truncated the ensemble to the maximum allowable size. The new methodology will substantially outperform truncation.

The second major advance is RULEFIT, a rule extraction engine that starts with a TreeNet model and decomposes it into the most interesting and predictive rules. RULEFIT is also a tree ensemble post-processor and offers the possibility of improving on the original TreeNet predictive performance. One can think of the rule extraction as an alternative way to explain and interpret an otherwise complex multi-tree model. The rules extracted are similar conceptually to the terminal nodes of a CART tree but the various rules will not refer to mutually exclusive regions of the data.

Ajay- You have led teams that have won multiple data mining competitions. What are some of your favorite techniques or approaches to a data mining problem.

Dan- We only enter competitions involving problems for which our technology is suitable, generally, classification and regression. In these areas, we are partial to TreeNet because it is such a capable and robust learning machine. However, we always find great value in analyzing many aspects of a data set with CART, especially when we require a compact and easy to understand story about the data. CART is exceptionally well suited to the discovery of errors in data, often revealing errors created by the competition organizers themselves. More than once, our reports of data problems have been responsible for the competition organizer’s decision to issue a corrected version of the data and we have been the only group to discover the problem.

In general, tackling a data mining competition is no different than tackling any analytical challenge. You must start with a solid conceptual grasp of the problem and the actual objectives, and the nature and limitations of the data. Following that comes feature extraction, the selection of a modeling strategy (or strategies), and then extensive experimentation to learn what works best.

Ajay- I know you have created your own software. But are there other software that you use or liked to use?

Dan- For analytics we frequently test open source software to make sure that our tools will in fact deliver the superior performance we advertise. In general, if a problem clearly requires technology other than that offered by Salford, we advise clients to seek other consultants expert in that other technology.

Ajay- Your software is installed at 3500 sites including 400 universities as per http://www.salford-systems.com/company/aboutus/index.html What is the key to managing and keeping so many customers happy?

Dan- First, we have taken great pains to make our software reliable and we make every effort to avoid problems related to bugs. Our testing procedures are extensive and we have experts dedicated to stress-testing software . Second, our interface is designed to be natural, intuitive, and easy to use, so the challenges to the new user are minimized. Also, clear documentation, help files, and training videos round out how we allow the user to look after themselves. Should a client need to contact us we try to achieve 24-hour turn around on tech support issues and monitor all tech support activity to ensure timeliness, accuracy, and helpfulness of our responses. WebEx/GotoMeeting and other internet based contact permit real time interaction.

Ajay- What do you do to relax and unwind?

Dan- I am in the gym almost every day combining weight and cardio training. No matter how tired I am before the workout I always come out energized so locating a good gym during my extensive travels is a must. I am also actively learning Portuguese so I look to watch a Brazilian TV show or Portuguese dubbed movie when I have time; I almost never watch any form of video unless it is available in Portuguese.

Biography-

http://www.salford-systems.com/blog/dan-steinberg.html

Dan Steinberg, President and Founder of Salford Systems, is a well-respected member of the statistics and econometrics communities. In 1992, he developed the first PC-based implementation of the original CART procedure, working in concert with Leo Breiman, Richard Olshen, Charles Stone and Jerome Friedman. In addition, he has provided consulting services on a number of biomedical and market research projects, which have sparked further innovations in the CART program and methodology.

Dr. Steinberg received his Ph.D. in Economics from Harvard University, and has given full day presentations on data mining for the American Marketing Association, the Direct Marketing Association and the American Statistical Association. After earning a PhD in Econometrics at Harvard Steinberg began his professional career as a Member of the Technical Staff at Bell Labs, Murray Hill, and then as Assistant Professor of Economics at the University of California, San Diego. A book he co-authored on Classification and Regression Trees was awarded the 1999 Nikkei Quality Control Literature Prize in Japan for excellence in statistical literature promoting the improvement of industrial quality control and management.

His consulting experience at Salford Systems has included complex modeling projects for major banks worldwide, including Citibank, Chase, American Express, Credit Suisse, and has included projects in Europe, Australia, New Zealand, Malaysia, Korea, Japan and Brazil. Steinberg led the teams that won first place awards in the KDDCup 2000, and the 2002 Duke/TeraData Churn modeling competition, and the teams that won awards in the PAKDD competitions of 2006 and 2007. He has published papers in economics, econometrics, computer science journals, and contributes actively to the ongoing research and development at Salford.

#SAS 9.3 and #Rstats 2.13.1 Released

A bit early but the latest editions of both SAS and R were released last week.

SAS 9.3 is clearly a major release with multiple enhancements to make SAS both relevant and pertinent in enterprise software in the age of big data. Also many more R specific, JMP specific and partners like Teradata specific enhancements.

http://support.sas.com/software/93/index.html

Features

Data management

Enhanced manageability for improved performance
In-database processing (EL-T pushdown)
Enhanced performance for loading oracle data
New ET-L transforms
Data access

Data quality

SAS^® Data Integration Server includes DataFlux^® Data Management Platform for enhanced data quality
Master Data Management (DataFlux® qMDM)
- Provides support for master hub of trusted entity data.

Analytics

SAS^® Enterprise Miner™
- New survival analysis predicts when an event will happen, not just if it will happen.
- New rate making capability for insurance predicts optimal insurance premium for individuals based on attributes known at application time.
- Time Series Data Mining node (experimental) applies data mining techniques to transactional, time-stamped data.
- Support Vector Machines node (experimental) provides a supervised machine learning method for prediction and classification.
SAS^® Forecast Server
- SAS Forecast Server is integrated with the SAP APO Demand Planning module to provide SAP users with access to a superior forecasting engine and automatic forecasting capabilities.
SAS^® Model Manager
- Seamless integration of R models with the ability to register and manage R models in SAS Model Manager.
- Ability to perform champion/challenger side-by-side comparisons between SAS and R models to see which model performs best for a specific need.
SAS/OR^® and SAS^® Simulation Studio
- Optimization
- Simulation
  - Automatic input distribution fitting using JMP with SAS Simulation Studio.

Text analytics

SAS^® Text Miner
SAS^® Enterprise Content Categorization
SAS^® Sentiment Analysis

Scalability and high-performance

SAS^® Analytics Accelerator for Teradata (new product)
SAS^® Grid Manager

and latest from http://www.r-project.org/ I was a bit curious to know why the different licensing for R now (from GPL2 to GPL2- GPL 3)

http://www.gnu.org/licenses/gpl-2.0.html

and http://gplv3.fsf.org/

and http://www.gnu.org/licenses/quick-guide-gplv3.html

LICENCE:

• No parts of R are now licensed solely under GPL-2. The licences for packages rpart and survival have been changed, which means that the licence terms for R as distributed are GPL-2 | GPL-3.

https://stat.ethz.ch/pipermail/r-announce/2011/000541.html

This is a maintenance release to consolidate various minor fixes to 2.13.0.

CHANGES IN R VERSION 2.13.1:

  NEW FEATURES:

    • iconv() no longer translates NA strings as "NA".

    • persp(box = TRUE) now warns if the surface extends outside the
      box (since occlusion for the box and axes is computed assuming
      the box is a bounding box). (PR#202.)

    • RShowDoc() can now display the licences shipped with R, e.g.
      RShowDoc("GPL-3").

    • New wrapper function showNonASCIIfile() in package tools.

    • nobs() now has a "mle" method in package stats4.

    • trace() now deals correctly with S4 reference classes and
      corresponding reference methods (e.g., $trace()) have been added.

    • xz has been updated to 5.0.3 (very minor bugfix release).

    • tools::compactPDF() gets more compression (usually a little,
      sometimes a lot) by using the compressed object streams of PDF
      1.5.

    • cairo_ps(onefile = TRUE) generates encapsulated EPS on platforms
      with cairo >= 1.6.

    • Binary reads (e.g. by readChar() and readBin()) are now supported
      on clipboard connections.  (Wish of PR#14593.)

    • as.POSIXlt.factor() now passes ... to the character method
      (suggestion of Joshua Ulrich).  [Intended for R 2.13.0 but
      accidentally removed before release.]

    • vector() and its wrappers such as integer() and double() now warn
      if called with a length argument of more than one element.  This
      helps track down user errors such as calling double(x) instead of
      as.double(x).

  INSTALLATION:

    • Building the vignette PDFs in packages grid and utils is now part
      of running make from an SVN checkout on a Unix-alike: a separate
      make vignettes step is no longer required.

      These vignettes are now made with keep.source = TRUE and hence
      will be laid out differently.

    • make install-strip failed under some configuration options.

    • Packages can customize non-standard installation of compiled code
      via a src/install.libs.R script. This allows packages that have
      architecture-specific binaries (beyond the package's shared
      objects/DLLs) to be installed in a multi-architecture setting.

  SWEAVE & VIGNETTES:

    • Sweave() and Stangle() gain an encoding argument to specify the
      encoding of the vignette sources if the latter do not contain a
      \usepackage[]{inputenc} statement specifying a single input
      encoding.

    • There is a new Sweave option figs.only = TRUE to run each figure
      chunk only for each selected graphics device, and not first using
      the default graphics device.  This will become the default in R
      2.14.0.

    • Sweave custom graphics devices can have a custom function
      foo.off() to shut them down.

    • Warnings are issued when non-portable filenames are found for
      graphics files (and chunks if split = TRUE).  Portable names are
      regarded as alphanumeric plus hyphen, underscore, plus and hash
      (periods cause problems with recognizing file extensions).

    • The Rtangle() driver has a new option show.line.nos which is by
      default false; if true it annotates code chunks with a comment
      giving the line number of the first line in the sources (the
      behaviour of R >= 2.12.0).

    • Package installation tangles the vignette sources: this step now
      converts the vignette sources from the vignette/package encoding
      to the current encoding, and records the encoding (if not ASCII)
      in a comment line at the top of the installed .R file.

  DEPRECATED AND DEFUNCT:

    • The internal functions .readRDS() and .saveRDS() are now
      deprecated in favour of the public functions readRDS() and
      saveRDS() introduced in R 2.13.0.

    • Switching off lazy-loading of code _via_ the LazyLoad field of
      the DESCRIPTION file is now deprecated.  In future all packages
      will be lazy-loaded.

    • The off-line help() types "postscript" and "ps" are deprecated.

  UTILITIES:

    • R CMD check on a multi-architecture installation now skips the
      user's .Renviron file for the architecture-specific tests (which
      do read the architecture-specific Renviron.site files).  This is
      consistent with single-architecture checks, which use
      --no-environ.

    • R CMD build now looks for DESCRIPTION fields BuildResaveData and
      BuildKeepEmpty for per-package overrides.  See ‘Writing R
      Extensions’.

  BUG FIXES:

    • plot.lm(which = 5) was intended to order factor levels in
      increasing order of mean standardized residual.  It ordered the
      factor labels correctly, but could plot the wrong group of
      residuals against the label.  (PR#14545)

    • mosaicplot() could clip the factor labels, and could overlap them
      with the cells if a non-default value of cex.axis was used.
      (Related to PR#14550.)

    • dataframe[[row,col]] now dispatches on [[ methods for the
      selected column (spotted by Bill Dunlap).

    • sort.int() would strip the class of an object, but leave its
      object bit set.  (Reported by Bill Dunlap.)

    • pbirthday() and qbirthday() did not implement the algorithm
      exactly as given in their reference and so were unnecessarily
      inaccurate.

      pbirthday() now solves the approximate formula analytically
      rather than using uniroot() on a discontinuous function.

      The description of the problem was inaccurate: the probability is
      a tail probablity (‘2 _or more_ people share a birthday’)

    • Complex arithmetic sometimes warned incorrectly about producing
      NAs when there were NaNs in the input.

    • seek(origin = "current") incorrectly reported it was not
      implemented for a gzfile() connection.

    • c(), unlist(), cbind() and rbind() could silently overflow the
      maximum vector length and cause a segfault.  (PR#14571)

    • The fonts argument to X11(type = "Xlib") was being ignored.

    • Reading (e.g. with readBin()) from a raw connection was not
      advancing the pointer, so successive reads would read the same
      value.  (Spotted by Bill Dunlap.)

    • Parsed text containing embedded newlines was printed incorrectly
      by as.character.srcref().  (Reported by Hadley Wickham.)

    • decompose() used with a series of a non-integer number of periods
      returned a seasonal component shorter than the original series.
      (Reported by Rob Hyndman.)

    • fields = list() failed for setRefClass().  (Reported by Michael
      Lawrence.)

    • Reference classes could not redefine an inherited field which had
      class "ANY". (Reported by Janko Thyson.)

    • Methods that override previously loaded versions will now be
      installed and called.  (Reported by Iago Mosqueira.)

    • addmargins() called numeric(apos) rather than
      numeric(length(apos)).

    • The HTML help search sometimes produced bad links.  (PR#14608)

    • Command completion will no longer be broken if tail.default() is
      redefined by the user. (Problem reported by Henrik Bengtsson.)

    • LaTeX rendering of markup in titles of help pages has been
      improved; in particular, \eqn{} may be used there.

    • isClass() used its own namespace as the default of the where
      argument inadvertently.

    • Rd conversion to latex mis-handled multi-line titles (including
      cases where there was a blank line in the \title section).

Also see this interesting blog

http://sas-and-r.blogspot.com/

Examples of tasks replicated in SAS and R

Changes in R software

The newest version of R is now available for download. R 2.13 is ready !!

http://cran.at.r-project.org/bin/windows/base/CHANGES.R-2.13.0.html

Windows-specific changes to R

CHANGES IN R VERSION 2.13.0

WINDOWS VERSION

Windows 2000 is no longer supported. (It went end-of-life in July 2010.)

NEW FEATURES

win_iconv has been updated: this version has a change in the behaviour with BOMs on UTF-16 and UTF-32 files – it removes BOMs when reading and adds them when writing. (This is consistent with Microsoft applications, but Unix versions of iconv usually ignore them.)
Support for repository type win64.binary (used for 64-bit Windows binaries for R 2.11.x only) has been removed.
The installers no longer put an ‘Uninstall’ item on the start menu (to conform to current Microsoft UI guidelines).
Running R always sets the environment variable R_ARCH (as it does on a Unix-alike from the shell-script front-end).
The defaults for options("browser") and options("pdfviewer") are now set from environment variables R_BROWSER and R_PDFVIEWER respectively (as on a Unix-alike). A value of "false" suppresses display (even if there is no false.exe present on the path).
If options("install.lock") is set to TRUE, binary package installs are protected against failure similar to the way source package installs are protected.
file.exists() and unlink() have more support for files > 2GB.
The versions of R.exe in ‘R_HOME/bin/i386,x64/bin’ now support options such as R --vanilla CMD: there is no comparable interface for ‘Rcmd.exe’.
A few more file operations will now work with >2GB files.
The environment variable R_HOME in an R session now uses slash as the path separator (as it always has when set by Rcmd.exe).
Rgui has a new menu item for the PDF ‘Sweave User Manual’.

DEPRECATED

zip.unpack() is deprecated: use unzip().

INSTALLATION

There is support for libjpeg-turbo via setting JPEGDIR to that value in ‘MkRules.local’.
Support for jpeg-6b has been removed.
The sources now work with libpng-1.5.1, jpegsrc.v8c (which are used in the CRAN builds) and tiff-4.0.0beta6 (CRAN builds use 3.9.1). It is possible that they no longer work with older versions than libpng-1.4.5.

BUG FIXES

Workaround for the incorrect values given by Windows’ casinh function on the branch cuts.
Bug fixes for drawing raster objects on windows(). The symptom was the occasional raster image not being drawn, especially when drawing multiple raster images in a single expression. Thanks to Michael Sumner for report and testing.
Printing extremely long string values could overflow the stack and cause the GUI to crash. (PR#14543)

Tonnes of changes!!

http://cran.at.r-project.org/src/base/NEWS

CHANGES IN R VERSION 2.13.0:

  SIGNIFICANT USER-VISIBLE CHANGES:

    â€¢ replicate() (by default) and vapply() (always) now return a
      higher-dimensional array instead of a matrix in the case where
      the inner function value is an array of dimension >= 2.

    â€¢ Printing and formatting of floating point numbers is now using
      the correct number of digits, where it previously rarely differed
      by a few digits. (See â€œscientificâ€ entry below.)  This affects
      _many_ *.Rout.save checks in packages.

  NEW FEATURES:

    â€¢ normalizePath() has been moved to the base package (from utils):
      this is so it can be used by library() and friends.

      It now does tilde expansion.

      It gains new arguments winslash (to select the separator on
      Windows) and mustWork to control the action if a canonical path
      cannot be found.

    â€¢ The previously barely documented limit of 256 bytes on a symbol
      name has been raised to 10,000 bytes (a sanity check).  Long
      symbol names can sometimes occur when deparsing expressions (for
      example, in model.frame).

    â€¢ reformulate() gains a intercept argument.

    â€¢ cmdscale(add = FALSE) now uses the more common definition that
      there is a representation in n-1 or less dimensions, and only
      dimensions corresponding to positive eigenvalues are used.
      (Avoids confusion such as PR#14397.)

    â€¢ Names used by c(), unlist(), cbind() and rbind() are marked with
      an encoding when this can be ascertained.

    â€¢ R colours are now defined to refer to the sRGB color space.

      The PDF, PostScript, and Quartz graphics devices record this
      fact.  X11 (and Cairo) and Windows just assume that your screen
      conforms.

    â€¢ system.file() gains a mustWork argument (suggestion of Bill
      Dunlap).

    â€¢ new.env(hash = TRUE) is now the default.

    â€¢ list2env(envir = NULL) defaults to hashing (with a suitably sized
      environment) for lists of more than 100 elements.

    â€¢ text() gains a formula method.

    â€¢ IQR() now has a type argument which is passed to quantile().

    â€¢ as.vector(), as.double() etc duplicate less when they leave the
      mode unchanged but remove attributes.

      as.vector(mode = "any") no longer duplicates when it does not
      remove attributes.  This helps memory usage in matrix() and
      array().

      matrix() duplicates less if data is an atomic vector with
      attributes such as names (but no class).

      dim(x) <- NULL duplicates less if x has neither dimensions nor
      names (since this operation removes names and dimnames).

    â€¢ setRepositories() gains an addURLs argument.

    â€¢ chisq.test() now also returns a stdres component, for
      standardized residuals (which have unit variance, unlike the
      Pearson residuals).

    â€¢ write.table() and friends gain a fileEncoding argument, to
      simplify writing files for use on other OSes (e.g. a spreadsheet
      intended for Windows or Mac OS X Excel).

    â€¢ Assignment expressions of the form foo::bar(x) <- y and
      foo:::bar(x) <- y now work; the replacement functions used are
      foo::`bar<-` and foo:::`bar<-`.

    â€¢ Sys.getenv() gains a names argument so Sys.getenv(x, names =
      FALSE) can replace the common idiom of as.vector(Sys.getenv()).
      The default has been changed to not name a length-one result.

    â€¢ Lazy loading of environments now preserves attributes and locked
      status. (The locked status of bindings and active bindings are
      still not preserved; this may be addressed in the future).

    â€¢ options("install.lock") may be set to FALSE so that
      install.packages() defaults to --no-lock installs, or (on
      Windows) to TRUE so that binary installs implement locking.

    â€¢ sort(partial = p) for large p now tries Shellsort if quicksort is
      not appropriate and so works for non-numeric atomic vectors.

    â€¢ sapply() gets a new option simplify = "array" which returns a
      â€œhigher rankâ€ array instead of just a matrix when FUN() returns a
      dim() length of two or more.

      replicate() has this option set by default, and vapply() now
      behaves that way internally.

    â€¢ aperm() becomes S3 generic and gets a table method which
      preserves the class.

    â€¢ merge() and as.hclust() methods for objects of class "dendrogram"
      are now provided.

    â€¢ as.POSIXlt.factor() now passes ... to the character method
      (suggestion of Joshua Ulrich).

    â€¢ The character method of as.POSIXlt() now tries to find a format
      that works for all non-NA inputs, not just the first one.

    â€¢ str() now has a method for class "Date" analogous to that for
      class "POSIXt".

    â€¢ New function file.link() to create hard links on those file
      systems (POSIX, NTFS but not FAT) that support them.

    â€¢ New Summary() group method for class "ordered" implements min(),
      max() and range() for ordered factors.

    â€¢ mostattributes<-() now consults the "dim" attribute and not the
      dim() function, making it more useful for objects (such as data
      frames) from classes with methods for dim().  It also uses
      attr<-() in preference to the generics name<-(), dim<-() and
      dimnames<-().  (Related to PR#14469.)

    â€¢ There is a new option "browserNLdisabled" to disable the use of
      an empty (e.g. via the â€˜Returnâ€™ key) as a synonym for c in
      browser() or n under debug().  (Wish of PR#14472.)

    â€¢ example() gains optional new arguments character.only and
      give.lines enabling programmatic exploration.

    â€¢ serialize() and unserialize() are no longer described as
      â€˜experimentalâ€™.  The interface is now regarded as stable,
      although the serialization format may well change in future
      releases.  (serialize() has a new argument version which would
      allow the current format to be written if that happens.)

      New functions saveRDS() and readRDS() are public versions of the
      â€˜internalâ€™ functions .saveRDS() and .readRDS() made available for
      general use.  The dot-name versions remain available as several
      package authors have made use of them, despite the documentation.

      saveRDS() supports compress = "xz".

    â€¢ Many functions when called with a not-open connection will now
      ensure that the connection is left not-open in the event of
      error.  These include read.dcf(), dput(), dump(), load(),
      parse(), readBin(), readChar(), readLines(), save(), writeBin(),
      writeChar(), writeLines(), .readRDS(), .saveRDS() and
      tools::parse_Rd(), as well as functions calling these.

    â€¢ Public functions find.package() and path.package() replace the
      internal dot-name versions.

    â€¢ The default method for terms() now looks for a "terms" attribute
      if it does not find a "terms" component, and so works for model
      frames.

    â€¢ httpd() handlers receive an additional argument containing the
      full request headers as a raw vector (this can be used to parse
      cookies, multi-part forms etc.). The recommended full signature
      for handlers is therefore function(url, query, body, headers,
      ...).

    â€¢ file.edit() gains a fileEncoding argument to specify the encoding
      of the file(s).

    â€¢ The format of the HTML package listings has changed.  If there is
      more than one library tree , a table of links to libraries is
      provided at the top and bottom of the page.  Where a library
      contains more than 100 packages, an alphabetic index is given at
      the top of the section for that library.  (As a consequence,
      package names are now sorted case-insensitively whatever the
      locale.)

    â€¢ isSeekable() now returns FALSE on connections which have
      non-default encoding.  Although documented to record if â€˜in
      principleâ€™ the connection supports seeking, it seems safer to
      report FALSE when it may not work.

    â€¢ R CMD REMOVE and remove.packages() now remove file R.css when
      removing all remaining packages in a library tree.  (Related to
      the wish of PR#14475: note that this file is no longer
      installed.)

    â€¢ unzip() now has a unzip argument like zip.file.extract().  This
      allows an external unzip program to be used, which can be useful
      to access features supported by Info-ZIP's unzip version 6 which
      is now becoming more widely available.

    â€¢ There is a simple zip() function, as wrapper for an external zip
      command.

    â€¢ bzfile() connections can now read from concatenated bzip2 files
      (including files written with bzfile(open = "a")) and files
      created by some other compressors (such as the example of
      PR#14479).

    â€¢ The primitive function c() is now of type BUILTIN.

    â€¢ plot(<dendrogram>, .., nodePar=*) now obeys an optional xpd
      specification (allowing clipping to be turned off completely).

    â€¢ nls(algorithm="port") now shares more code with nlminb(), and is
      more consistent with the other nls() algorithms in its return
      value.

    â€¢ xz has been updated to 5.0.1 (very minor bugfix release).

    â€¢ image() has gained a logical useRaster argument allowing it to
      use a bitmap raster for plotting a regular grid instead of
      polygons. This can be more efficient, but may not be supported by
      all devices. The default is FALSE.

    â€¢ list.files()/dir() gains a new argument include.dirs() to include
      directories in the listing when recursive = TRUE.

    â€¢ New function list.dirs() lists all directories, (even empty
      ones).

    â€¢ file.copy() now (by default) copies read/write/execute
      permissions on files, moderated by the current setting of
      Sys.umask().

    â€¢ Sys.umask() now accepts mode = NA and returns the current umask
      value (visibly) without changing it.

    â€¢ There is a ! method for classes "octmode" and "hexmode": this
      allows xor(a, b) to work if both a and b are from one of those
      classes.

    â€¢ as.raster() no longer fails for vectors or matrices containing
      NAs.

    â€¢ New hook "before.new.plot" allows functions to be run just before
      advancing the frame in plot.new, which is potentially useful for
      custom figure layout implementations.

    â€¢ Package tools has a new function compactPDF() to try to reduce
      the size of PDF files _via_ qpdf or gs.

    â€¢ tar() has a new argument extra_flags.

    â€¢ dotchart() accepts more general objects x such as 1D tables which
      can be coerced by as.numeric() to a numeric vector, with a
      warning since that might not be appropriate.

    â€¢ The previously internal function create.post() is now exported
      from utils, and the documentation for bug.report() and
      help.request() now refer to that for create.post().

      It has a new method = "mailto" on Unix-alikes similar to that on
      Windows: it invokes a default mailer via open (Mac OS X) or
      xdg-open or the default browser (elsewhere).

      The default for ccaddress is now getOption("ccaddress") which is
      by default unset: using the username as a mailing address
      nowadays rarely works as expected.

    â€¢ The default for options("mailer") is now "mailto" on all
      platforms.

    â€¢ unlink() now does tilde-expansion (like most other file
      functions).

    â€¢ file.rename() now allows vector arguments (of the same length).

    â€¢ The "glm" method for logLik() now returns an "nobs" attribute
      (which stats4::BIC() assumed it did).

      The "nls" method for logLik() gave incorrect results for zero
      weights.

    â€¢ There is a new generic function nobs() in package stats, to
      extract from model objects a suitable value for use in BIC
      calculations.  An S4 generic derived from it is defined in
      package stats4.

    â€¢ Code for S4 reference-class methods is now examined for possible
      errors in non-local assignments.

    â€¢ findClasses, getGeneric, findMethods and hasMethods are revised
      to deal consistently with the package= argument and be consistent
      with soft namespace policy for finding objects.

    â€¢ tools::Rdiff() now has the option to return not only the status
      but a character vector of observed differences (which are still
      by default sent to stdout).

    â€¢ The startup environment variables R_ENVIRON_USER, R_ENVIRON,
      R_PROFILE_USER and R_PROFILE are now treated more consistently.
      In all cases an empty value is considered to be set and will stop
      the default being used, and for the last two tilde expansion is
      performed on the file name.  (Note that setting an empty value is
      probably impossible on Windows.)

    â€¢ Using R --no-environ CMD, R --no-site-file CMD or R
      --no-init-file CMD sets environment variables so these settings
      are passed on to child R processes, notably those run by INSTALL,
      check and build. R --vanilla CMD sets these three options (but
      not --no-restore).

    â€¢ smooth.spline() is somewhat faster.  With cv=NA it allows some
      leverage computations to be skipped,

    â€¢ The internal (C) function scientific(), at the heart of R's
      format.info(x), format(x), print(x), etc, for numeric x, has been
      re-written in order to provide slightly more correct results,
      fixing PR#14491, notably in border cases including when digits >=
      16, thanks to substantial contributions (code and experiments)
      from Petr Savicky.  This affects a noticable amount of numeric
      output from R.

    â€¢ A new function grepRaw() has been introduced for finding subsets
      of raw vectors. It supports both literal searches and regular
      expressions.

    â€¢ Package compiler is now provided as a standard package.  See
      ?compiler::compile for information on how to use the compiler.
      This package implements a byte code compiler for R: by default
      the compiler is not used in this release.  See the â€˜R
      Installation and Administration Manualâ€™ for how to compile the
      base and recommended packages.

    â€¢ Providing an exportPattern directive in a NAMESPACE file now
      causes classes to be exported according to the same pattern, for
      example the default from package.skeleton() to specify all names
      starting with a letter.  An explicit directive to
      exportClassPattern will still over-ride.

    â€¢ There is an additional marked encoding "bytes" for character
      strings.  This is intended to be used for non-ASCII strings which
      should be treated as a set of bytes, and never re-encoded as if
      they were in the encoding of the currrent locale: useBytes = TRUE
      is autmatically selected in functions such as writeBin(),
      writeLines(), grep() and strsplit().

      Only a few character operations are supported (such as substr()).

      Printing, format() and cat() will represent non-ASCII bytes in
      such strings by a \xab escape.

    â€¢ The new function removeSource() removes the internally stored
      source from a function.

    â€¢ "srcref" attributes now include two additional line number
      values, recording the line numbers in the order they were parsed.

    â€¢ New functions have been added for source reference access:
      getSrcFilename(), getSrcDirectory(), getSrcLocation() and
      getSrcref().

    â€¢ Sys.chmod() has an extra argument use_umask which defaults to
      true and restricts the file mode by the current setting of umask.
      This means that all the R functions which manipulate
      file/directory permissions by default respect umask, notably R
      CMD INSTALL.

    â€¢ tempfile() has an extra argument fileext to create a temporary
      filename with a specified extension.  (Suggestion and initial
      implementation by Dirk Eddelbuettel.)

      There are improvements in the way Sweave() and Stangle() handle
      non-ASCII vignette sources, especially in a UTF-8 locale: see
      â€˜Writing R Extensionsâ€™ which now has a subsection on this topic.

    â€¢ factanal() now returns the rotation matrix if a rotation such as
      "promax" is used, and hence factor correlations are displayed.
      (Wish of PR#12754.)

    â€¢ The gctorture2() function provides a more refined interface to
      the GC torture process.  Environment variables R_GCTORTURE,
      R_GCTORTURE_WAIT, and R_GCTORTURE_INHIBIT_RELEASE can also be
      used to control the GC torture process.

    â€¢ file.copy(from, to) no longer regards it as an error to supply a
      zero-length from: it now simply does nothing.

    â€¢ rstandard.glm gains a type argument which can be used to request
      standardized Pearson residuals.

    â€¢ A start on a Turkish translation, thanks to Murat Alkan.

    â€¢ .libPaths() calls normalizePath(winslash = "/") on the paths:
      this helps (usually) present them in a user-friendly form and
      should detect duplicate paths accessed via different symbolic
      links.

  SWEAVE CHANGES:

    â€¢ Sweave() has options to produce PNG and JPEG figures, and to use
      a custom function to open a graphics device (see ?RweaveLatex).
      (Based in part on the contribution of PR#14418.)

    â€¢ The default for Sweave() is to produce only PDF figures (rather
      than both EPS and PDF).

    â€¢ Environment variable SWEAVE_OPTIONS can be used to supply
      defaults for existing or new options to be applied after the
      Sweave driver setup has been run.

    â€¢ The Sweave manual is now included as a vignette in the utils
      package.

    â€¢ Sweave() handles keep.source=TRUE much better: it could duplicate
      some lines and omit comments. (Reported by John Maindonald and
      others.)

  C-LEVEL FACILITIES:

    â€¢ Because they use a C99 interface which a C++ compiler is not
      required to support, Rvprintf and REvprintf are only defined by
      R_ext/Print.h in C++ code if the macro R_USE_C99_IN_CXX is
      defined when it is included.

    â€¢ pythag duplicated the C99 function hypot.  It is no longer
      provided, but is used as a substitute for hypot in the very
      unlikely event that the latter is not available.

    â€¢ R_inspect(obj) and R_inspect3(obj, deep, pvec) are (hidden)
      C-level entry points to the internal inspect function and can be
      used for C-level debugging (e.g., in conjunction with the p
      command in gdb).

    â€¢ Compiling R with --enable-strict-barrier now also enables
      additional checking for use of unprotected objects. In
      combination with gctorture() or gctorture2() and a C-level
      debugger this can be useful for tracking down memory protection
      issues.

  UTILITIES:

    â€¢ R CMD Rdiff is now implemented in R on Unix-alikes (as it has
      been on Windows since R 2.12.0).

    â€¢ R CMD build no longer does any cleaning in the supplied package
      directory: all the cleaning is done in the copy.

      It has a new option --install-args to pass arguments to R CMD
      INSTALL for --build (but not when installing to rebuild
      vignettes).

      There is new option, --resave-data, to call
      tools::resaveRdaFiles() on the data directory, to compress
      tabular files (.tab, .csv etc) and to convert .R files to .rda
      files.  The default, --resave-data=gzip, is to do so in a way
      compatible even with years-old versions of R, but better
      compression is given by --resave-data=best, requiring R >=
      2.10.0.

      It now adds a datalist file for data directories of more than
      1Mb.

      Patterns in .Rbuildignore are now also matched against all
      directory names (including those of empty directories).

      There is a new option, --compact-vignettes, to try reducing the
      size of PDF files in the inst/doc directory.  Currently this
      tries qpdf: other options may be used in future.

      When re-building vignettes and a inst/doc/Makefile file is found,
      make clean is run if the makefile has a clean: target.

      After re-building vignettes the default clean-up operation will
      remove any directories (and not just files) created during the
      process: e.g. one package created a .R_cache directory.

      Empty directories are now removed unless the option
      --keep-empty-dirs is given (and a few packages do deliberately
      include empty directories).

      If there is a field BuildVignettes in the package DESCRIPTION
      file with a false value, re-building the vignettes is skipped.

    â€¢ R CMD check now also checks for filenames that are
      case-insensitive matches to Windows' reserved file names with
      extensions, such as nul.Rd, as these have caused problems on some
      Windows systems.

      It checks for inefficiently saved data/*.rda and data/*.RData
      files, and reports on those large than 100Kb.  A more complete
      check (including of the type of compression, but potentially much
      slower) can be switched on by setting environment variable
      _R_CHECK_COMPACT_DATA2_ to TRUE.

      The types of files in the data directory are now checked, as
      packages are _still_ misusing it for non-R data files.

      It now extracts and runs the R code for each vignette in a
      separate directory and R process: this is done in the package's
      declared encoding.  Rather than call tools::checkVignettes(), it
      calls tool::buildVignettes() to see if the vignettes can be
      re-built as they would be by R CMD build.  Option --use-valgrind
      now applies only to these runs, and not when running code to
      rebuild the vignettes.  This version does a much better job of
      suppressing output from successful vignette tests.

      The 00check.log file is a more complete record of what is output
      to stdout: in particular contains more details of the tests.

      It now check all syntactically valid Rd usage entries, and warns
      about assignments (unless these give the usage of replacement
      functions).

      .tar.xz compressed tarballs are now allowed, if tar supports them
      (and setting environment variable TAR to internal ensures so on
      all platforms).

    â€¢ R CMD check now warns if it finds inst/doc/makefile, and R CMD
      build renames such a file to inst/doc/Makefile.

  INSTALLATION:

    â€¢ Installing R no longer tries to find perl, and R CMD no longer
      tries to substitute a full path for awk nor perl - this was a
      legacy from the days when they were used by R itself.  Because a
      couple of packages do use awk, it is set as the make (rather than
      environment) variable AWK.

    â€¢ make check will now fail if there are differences from the
      reference output when testing package examples and if environment
      variable R_STRICT_PACKAGE_CHECK is set to a true value.

    â€¢ The C99 double complex type is now required.

      The C99 complex trigonometric functions (such as csin) are not
      currently required (FreeBSD lacks most of them): substitutes are
      used if they are missing.

    â€¢ The C99 system call va_copy is now required.

    â€¢ If environment variable R_LD_LIBRARY_PATH is set during
      configuration (for example in config.site) it is used unchanged
      in file etc/ldpaths rather than being appended to.

    â€¢ configure looks for support for OpenMP and if found compiles R
      with appropriate flags and also makes them available for use in
      packages: see â€˜Writing R Extensionsâ€™.

      This is currently experimental, and is only used in R with a
      single thread for colSums() and colMeans().  Expect it to be more
      widely used in later versions of R.

      This can be disabled by the --disable-openmp flag.

  PACKAGE INSTALLATION:

    â€¢ R CMD INSTALL --clean now removes copies of a src directory which
      are created when multiple sub-architectures are in use.
      (Following a comment from Berwin Turlach.)

    â€¢ File R.css is now installed on a per-package basis (in the
      package's html directory) rather than in each library tree, and
      this is used for all the HTML pages in the package.  This helps
      when installing packages with static HTML pages for use on a
      webserver.  It will also allow future versions of R to use
      different stylesheets for the packages they install.

    â€¢ A top-level file .Rinstignore in the package sources can list (in
      the same way as .Rbuildignore) files under inst that should not
      be installed.  (Why should there be any such files?  Because all
      the files needed to re-build vignettes need to be under inst/doc,
      but they may not need to be installed.)

    â€¢ R CMD INSTALL has a new option --compact-docs to compact any PDFs
      under the inst/doc directory.  Currently this uses qpdf, which
      must be installed (see â€˜Writing R Extensionsâ€™).

    â€¢ There is a new option --lock which can be used to cancel the
      effect of --no-lock or --pkglock earlier on the command line.

    â€¢ Option --pkglock can now be used with more than one package, and
      is now the default if only one package is specified.

    â€¢ Argument lock of install.packages() can now be use for Mac binary
      installs as well as for Windows ones.  The value "pkglock" is now
      accepted, as well as TRUE and FALSE (the default).

    â€¢ There is a new option --no-clean-on-error for R CMD INSTALL to
      retain a partially installed package for forensic analysis.

    â€¢ Packages with names ending in . are not portable since Windows
      does not work correctly with such directory names.  This is now
      warned about in R CMD check, and will not be allowed in R 2.14.x.

    â€¢ The vignette indices are more comprehensive (in the style of
      browseVignetttes()).

  DEPRECATED & DEFUNCT:

    â€¢ require(save = TRUE) is defunct, and use of the save argument is
      deprecated.

    â€¢ R CMD check --no-latex is defunct: use --no-manual instead.

    â€¢ R CMD Sd2Rd is defunct.

    â€¢ The gamma argument to hsv(), rainbow(), and rgb2hsv() is
      deprecated and no longer has any effect.

    â€¢ The previous options for R CMD build --binary (--auto-zip,
      --use-zip-data and --no-docs) are deprecated (or defunct): use
      the new option --install-args instead.

    â€¢ When a character value is used for the EXPR argument in switch(),
      only a single unnamed alternative value is now allowed.

    â€¢ The wrapper utils::link.html.help() is no longer available.

    â€¢ Zip-ing data sets in packages (and hence R CMD INSTALL options
      --use-zip-data and --auto-zip, as well as the ZipData: yes field
      in a DESCRIPTION file) is defunct.

      Installed packages with zip-ed data sets can still be used, but a
      warning that they should be re-installed will be given.

    â€¢ The â€˜experimentalâ€™ alternative specification of a name space via
      .Export() etc is now defunct.

    â€¢ The option --unsafe to R CMD INSTALL is deprecated: use the
      identical option --no-lock instead.

    â€¢ The entry point pythag in Rmath.h is deprecated in favour of the
      C99 function hypot.  A wrapper for hypot is provided for R 2.13.x
      only.

    â€¢ Direct access to the "source" attribute of functions is
      deprecated; use deparse(fn, control="useSource") to access it,
      and removeSource(fn) to remove it.

    â€¢ R CMD build --binary is now formally deprecated: R CMD INSTALL
      --build has long been the preferred alternative.

    â€¢ Single-character package names are deprecated (and R is already
      disallowed to avoid confusion in Depends: fields).

  BUG FIXES:

    â€¢ drop.terms and the [ method for class "terms" no longer add back
      an intercept.  (Reported by Niels Hansen.)

    â€¢ aggregate preserves the class of a column (e.g. a date) under
      some circumstances where it discarded the class previously.

    â€¢ p.adjust() now always returns a vector result, as documented.  In
      previous versions it copied attributes (such as dimensions) from
      the p argument: now it only copies names.

    â€¢ On PDF and PostScript devices, a line width of zero was recorded
      verbatim and this caused problems for some viewers (a very thin
      line combined with a non-solid line dash pattern could also cause
      a problem).  On these devices, the line width is now limited at
      0.01 and for very thin lines with complex dash patterns the
      device may force the line dash pattern to be solid.  (Reported by
      Jari Oksanen.)

    â€¢ The str() method for class "POSIXt" now gives sensible output for
      0-length input.

    â€¢ The one- and two-argument complex maths functions failed to warn
      if NAs were generated (as their numeric analogues do).

    â€¢ Added .requireCachedGenerics to the dont.mind list for library()
      to avoid warnings about duplicates.

    â€¢ $<-.data.frame messed with the class attribute, breaking any S4
      subclass.  The S4 data.frame class now has its own $<- method,
      and turns dispatch on for this primitive.

    â€¢ Map() did not look up a character argument f in the correct
      frame, thanks to lazy evaluation.  (PR#14495)

    â€¢ file.copy() did not tilde-expand from and to when to was a
      directory.  (PR#14507)

    â€¢ It was possible (but very rare) for the loading test in R CMD
      INSTALL to crash a child R process and so leave around a lock
      directory and a partially installed package.  That test is now
      done in a separate process.

    â€¢ plot(<formula>, data=<matrix>,..) now works in more cases;
      similarly for points(), lines() and text().

    â€¢ edit.default() contained a manual dispatch for matrices (the
      "matrix" class didn't really exist when it was written).  This
      caused an infinite recursion in the no-GUI case and has now been
      removed.

    â€¢ data.frame(check.rows = TRUE) sometimes worked when it should
      have detected an error.  (PR#14530)

    â€¢ scan(sep= , strip.white=TRUE) sometimes stripped trailing spaces
      from within quoted strings.  (The real bug in PR#14522.)

    â€¢ The rank-correlation methods for cor() and cov() with use =
      "complete.obs" computed the ranks before removing missing values,
      whereas the documentation implied incomplete cases were removed
      first.  (PR#14488)

      They also failed for 1-row matrices.

    â€¢ The perpendicular adjustment used in placing text and expressions
      in the margins of plots was not scaled by par("mex"). (Part of
      PR#14532.)

    â€¢ Quartz Cocoa device now catches any Cocoa exceptions that occur
      during the creation of the device window to prevent crashes.  It
      also imposes a limit of 144 ft^2 on the area used by a window to
      catch user errors (unit misinterpretation) early.

    â€¢ The browser (invoked by debug(), browser() or otherwise) would
      display attributes such as "wholeSrcref" that were intended for
      internal use only.

    â€¢ R's internal filename completion now properly handles filenames
      with spaces in them even when the readline library is used.  This
      resolves PR#14452 provided the internal filename completion is
      used (e.g., by setting rc.settings(files = TRUE)).

    â€¢ Inside uniroot(f, ...), -Inf function values are now replaced by
      a maximally *negative* value.

    â€¢ rowsum() could silently over/underflow on integer inputs
      (reported by Bill Dunlap).

    â€¢ as.matrix() did not handle "dist" objects with zero rows.

CHANGES IN R VERSION 2.12.2 patched:

  NEW FEATURES:

    â€¢ max() and min() work harder to ensure that NA has precedence over
      NaN, so e.g. min(NaN, NA) is NA.  (This was not previously
      documented except for within a single numeric vector, where
      compiler optimizations often defeated the code.)

  BUG FIXES:

    â€¢ A change to the C function R_tryEval had broken error messages in
      S4 method selection; the error message is now printed.

    â€¢ PDF output with a non-RGB color model used RGB for the line
      stroke color.  (PR#14511)

    â€¢ stats4::BIC() assumed without checking that an object of class
      "logLik" has an "nobs" attribute: glm() fits did not and so BIC()
      failed for them.

    â€¢ In some circumstances a one-sided mantelhaen.test() reported the
      p-value for the wrong tail.  (PR#14514)

    â€¢ Passing the invalid value lty = NULL to axis() sent an invalid
      value to the graphics device, and might cause the device to
      segfault.

    â€¢ Sweave() with concordance=TRUE could lead to invalid PDF files;
      Sweave.sty has been updated to avoid this.

    â€¢ Non-ASCII characters in the titles of help pages were not
      rendered properly in some locales, and could cause errors or
      warnings.    â€¢ checkRd() gave a spurious error if the \href macro was used.

Google Snappy

Diagram of how a 32-bit integer is arranged in... — Image via Wikipedia

a cool sounding software- yet again by the guys from California, this one enables to zip and unzip Big Data much much faster

http://news.ycombinator.com/item?id=2356735

and

https://code.google.com/p/snappy/

Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger. On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.

Snappy is widely used inside Google, in everything from BigTable and MapReduce to our internal RPC systems. (Snappy has previously been referred to as “Zippy” in some presentations and the likes.)

For more information, please see the README. Benchmarks against a few other compression libraries (zlib, LZO, LZF, FastLZ, and QuickLZ) are included in the source code distribution.

Introduction

============

Snappy is a compression/decompression library. It does not aim for maximum

compression, or compatibility with any other compression library; instead,

it aims for very high speeds and reasonable compression. For instance,

compared to the fastest mode of zlib, Snappy is an order of magnitude faster

for most inputs, but the resulting compressed files are anywhere from 20% to

100% bigger. (For more information, see “Performance”, below.)

Snappy has the following properties:

* Fast: Compression speeds at 250 MB/sec and beyond, with no assembler code.

See “Performance” below.

* Stable: Over the last few years, Snappy has compressed and decompressed

petabytes of data in Google’s production environment. The Snappy bitstream

format is stable and will not change between versions.

* Robust: The Snappy decompressor is designed not to crash in the face of

corrupted or malicious input.

* Free and open source software: Snappy is licensed under the Apache license,

version 2.0. For more information, see the included COPYING file.

Snappy has previously been called “Zippy” in some Google presentations

and the like.

Performance

===========

Snappy is intended to be fast. On a single core of a Core i7 processor

in 64-bit mode, it compresses at about 250 MB/sec or more and decompresses at

about 500 MB/sec or more. (These numbers are for the slowest inputs in our

benchmark suite; others are much faster.) In our tests, Snappy usually

is faster than algorithms in the same class (e.g. LZO, LZF, FastLZ, QuickLZ,

etc.) while achieving comparable compression ratios.

Typical compression ratios (based on the benchmark suite) are about 1.5-1.7x

for plain text, about 2-4x for HTML, and of course 1.0x for JPEGs, PNGs and

other already-compressed data. Similar numbers for zlib in its fastest mode

are 2.6-2.8x, 3-7x and 1.0x, respectively. More sophisticated algorithms are

capable of achieving yet higher compression rates, although usually at the

expense of speed. Of course, compression ratio will vary significantly with

the input.

Although Snappy should be fairly portable, it is primarily optimized

for 64-bit x86-compatible processors, and may run slower in other environments.

In particular:

– Snappy uses 64-bit operations in several places to process more data at

once than would otherwise be possible.

– Snappy assumes unaligned 32- and 64-bit loads and stores are cheap.

On some platforms, these must be emulated with single-byte loads

and stores, which is much slower.

– Snappy assumes little-endian throughout, and needs to byte-swap data in

several places if running on a big-endian platform.

Experience has shown that even heavily tuned code can be improved.

Performance optimizations, whether for 64-bit x86 or other platforms,

are of course most welcome; see “Contact”, below.

Usage

=====

Note that Snappy, both the implementation and the interface,

is written in C++.

To use Snappy from your own program, include the file “snappy.h” from

your calling file, and link against the compiled library.

There are many ways to call Snappy, but the simplest possible is

snappy::Compress(input, &output);

and similarly

snappy::Uncompress(input, &output);

where “input” and “output” are both instances of std::string.

Google releases snappy, the compression library used in Bigtable (code.google.com)
Maximizing Search Engine Visitors The Correct Way (ronmedlin.com)
MapReduce from the basics to the actually useful (in under 30 minutes) (cloudant.com)

Cutting Down Office Costs:Downloading by DAP and Bit Torrents

Some of these tips may be familiar.Some may be surprisingly different

Here are ways to 1) search for hard to find softwares 2)download and queue downloads with resume/pause functions

Bit Torrents are the best way to download anything . To quote the Openoffice website ”

BitTorrent is a P2P method where a central ‘tracker’ keeps track of who is downloading and sharing specific files.

When using BitTorrent to download OpenOffice.org, your computer automatically uses spare bandwidth to help share the file with others, and this means that you don’t have to put up with slower downloads during peak download times (such as just after a release), because the more people downloading, the more people sharing.

Also, your download is automatically checked for integrity to make sure that it is identical to the official version.

To use BitTorrent technology, you must have a BitTorrent “client” installed.

uTorrent (Windows)
Official BitTorrent Client (Cross-Platform)
Azureus (Cross-Platform)
ABC (Windows, Linux)
Shareaza (Windows)
Tomato Torrent (Mac OS X)
BitComet (Windows)
aria2 (Linux) “

For normal downloads , use DAP from www.speedbit.com. Thats best suited. Also try compressing stuff before uploading.This does have an impact on office bandwidth usage costs. For cutting down on software costs for your organization, download Ubuntu Linux from http://www.ubuntu.com and OpenOffice from http://www.openoffice.org and use it for top 10 % technically qualified people, or bottom 10 % computers that basically use simple processing tasks like email, office,front desk etc.Then expand or tweak the percentages based on the results and satisfaction from users.

To cut down on intranet costs , you can use simple softwares from www.wordpress.org and host it on a computer for whole office to use it as an intranet. For creating an office newsletter , you can burn the feed at www.feedburner.com and use the email plugin to offer subscription to the email users.

Compression Tips

1) Stuck with Huge Datasets in SAS.

Use SAS Code,

Options compress=yes

2)Stuck with huge datasets in UNIX Space.

Use compress “filename.extension”

3) Huge data in Windows- Use the following utility

Use 7 Zip.Open source

You don’t need to register or pay for 7-Zip.

www.7–zip.org/

SAVE SPACE ON YOUR SYSTEMS 🙂

Please share:

Features

Data management

Data quality

Analytics

Text analytics

Scalability and high-performance

Please share:

Windows-specific changes to R

CHANGES IN R VERSION 2.13.0

WINDOWS VERSION

NEW FEATURES

DEPRECATED

INSTALLATION

BUG FIXES

Please share:

Related Articles

Please share:

Please share:

Please share: