parsing – DECISION STATS

Protected: Converting SAS language code to Java

Book Review- Machine Learning for Hackers

This is review of the fashionably named book Machine Learning for Hackers by Drew Conway and John Myles White (O’Reilly ). The book is about hacking code in R.

The preface introduces the reader to the authors conception of what machine learning and hacking is all about. If the name of the book was machine learning for business analytsts or data miners, I am sure the content would have been unchanged though the popularity (and ambiguity) of the word hacker can often substitute for its usefulness. Indeed the many wise and learned Professors of statistics departments through out the civilized world would be mildly surprised and bemused by their day to day activities as hacking or teaching hackers. The book follows a case study and example based approach and uses the GGPLOT2 package within R programming almost to the point of ignoring any other native graphics system based in R. It can be quite useful for the aspiring reader who wishes to understand and join the booming market for skilled talent in statistical computing.

Chapter 1 has a very useful set of functions for data cleansing and formatting. It walks you through the basics of formatting based on dates and conditions, missing value and outlier treatment and using ggplot package in R for graphical analysis. The case study used is an Infochimps dataset with 60,000 recordings of UFO sightings. The case study is lucid, and done at a extremely helpful pace illustrating the powerful and flexible nature of R functions that can be used for data cleansing.The chapter mentions text editors and IDEs but fails to list them in a tabular format, while listing several other tables like Packages used in the book. It also jumps straight from installation instructions to functions in R without getting into the various kinds of data types within R or specifying where these can be referenced from. It thus assumes a higher level of basic programming understanding for the reader than the average R book.

Chapter 2 discusses data exploration, and has a very clear set of diagrams that explain the various data summary operations that are performed routinely. This is an innovative approach and will help students or newcomers to the field of data analysis. It introduces the reader to type determination functions, as well different kinds of encoding. The introduction to creating functions is quite elegant and simple , and numerical summary methods are explained adequately. While the chapter explains data exploration with the help of various histogram options in ggplot2 , it fails to create a more generic framework for data exploration or rules to assist the reader in visual data exploration in non standard data situations. While the examples are very helpful for a reader , there needs to be slightly more depth to step out of the example and into a framework for visual data exploration (or references for the same). A couple of case studies however elaborately explained cannot do justice to the vast field of data exploration and especially visual data exploration.

Chapter 3 discussed binary classification for the specific purpose for spam filtering using a dataset from SpamAssassin. It introduces the reader to the naïve Bayes classifier and the principles of text mining suing the tm package in R. Some of the example codes could have been better commented for easier readability in the book. Overall it is quite a easy tutorial for creating a naïve Bayes classifier even for beginners.

Chapter 4 discusses the issues in importance ranking and creating recommendation systems specifically in the case of ordering email messages into important and not important. It introduces the useful grepl, gsub, strsplit, strptime ,difftime and strtrim functions for parsing data. The chapter further introduces the reader to the concept of log (and affine) transformations in a lucid and clear way that can help even beginners learn this powerful transformation concept. Again the coding within this chapter is sparsely commented which can cause difficulties to people not used to learn reams of code. ( it may have been part of the code attached with the book, but I am reading an electronic book and I did not find an easy way to go back and forth between the code and the book). The readability of the chapters would be further enhanced by the use of flow charts explaining the path and process followed than overtly verbose textual descriptions running into multiple pages. The chapters are quite clearly written, but a helpful visual summary can help in both revising the concepts and elucidate the approach taken further.A suggestion for the authors could be to compile the list of useful functions they introduce in this book as a sort of reference card (or Ref Card) for R Hackers or atleast have a chapter wise summary of functions, datasets and packages used.

Chapter 5 discusses linear regression , and it is a surprising and not very good explanation of regression theory in the introduction to regression. However the chapter makes up in practical example what it oversimplifies in theory. The chapter on regression is not the finest chapter written in this otherwise excellent book. Part of this is because of relative lack of organization- correlation is explained after linear regression is explained. Once again the lack of a function summary and a process flow diagram hinders readability and a separate section on regression metrics that help make a regression result good or not so good could be a welcome addition. Functions introduced include lm.

Chapter 6 showcases Generalized Additive Model (GAM) and Polynomial Regression, including an introduction to singularity and of over-fitting. Functions included in this chapter are transform, and poly while the package glmnet is also used here. The chapter also introduces the reader formally to the concept of cross validation (though examples of cross validation had been introduced in earlier chapters) and regularization. Logistic regression is also introduced at the end in this chapter.

Chapter 7 is about optimization. It describes error metric in a very easy to understand way. It creates a grid by using nested loops for various values of intercept and slope of a regression equation and computing the sum of square of errors. It then describes the optim function in detail including how it works and it’s various parameters. It introduces the curve function. The chapter then describes ridge regression including definition and hyperparameter lamda. The use of optim function to optimize the error in regression is useful learning for the aspiring hacker. Lastly it describes a case study of breaking codes using the simplistic Caesar cipher, a lexical database and the Metropolis method. Functions introduced in this chapter include .Machine$double.eps .

Chapter 8 deals with Principal Component Analysis and unsupervised learning. It uses the ymd function from lubridate package to convert string to date objects, and the cast function from reshape package to further manipulate the structure of data. Using the princomp functions enables PCA in R.The case study creates a stock market index and compares the results with the Dow Jones index.

Chapter 9 deals with Multidimensional Scaling as well as clustering US senators on the basis of similarity in voting records on legislation .It showcases matrix multiplication using %*% and also the dist function to compute distance matrix.

Chapter 10 has the subject of K Nearest Neighbors for recommendation systems. Packages used include class ,reshape and and functions used include cor, function and log. It also demonstrates creating a custom kNN function for calculating Euclidean distance between center of centroids and data. The case study used is the R package recommendation contest on Kaggle. Overall a simplistic introduction to creating a recommendation system using K nearest neighbors, without getting into any of the prepackaged packages within R that deal with association analysis , clustering or recommendation systems.

Chapter 11 introduces the reader to social network analysis (and elements of graph theory) using the example of Erdos Number as an interesting example of social networks of mathematicians. The example of Social Graph API by Google for hacking are quite new and intriguing (though a bit obsolete by changes, and should be rectified in either the errata or next edition) . However there exists packages within R that should be atleast referenced or used within this chapter (like TwitteR package that use the Twitter API and ROauth package for other social networks). Packages used within this chapter include Rcurl, RJSONIO, and igraph packages of R and functions used include rbind and ifelse. It also introduces the reader to the advanced software Gephi. The last example is to build a recommendation engine for whom to follow in Twitter using R.

Chapter 12 is about model comparison and introduces the concept of Support Vector Machines. It uses the package e1071 and shows the svm function. It also introduces the concept of tuning hyper parameters within default algorithms . A small problem in understanding the concepts is the misalignment of diagram pages with the relevant code. It lastly concludes with using mean square error as a method for comparing models built with different algorithms.

Overall the book is a welcome addition in the library of books based on R programming language, and the refreshing nature of the flow of material and the practicality of it’s case studies make this a recommended addition to both academic and corporate business analysts trying to derive insights by hacking lots of heterogeneous data.

Have a look for yourself at-

http://shop.oreilly.com/product/0636920018483.do

Google Speed Test

Here is a new service in beta for helping test your website for speed. It is continuing series of initiatives for Google to help the internet (and their own computing resources)

If you want to request access to the limited beta

Request Access – Here

https://docs.google.com/spreadsheet/viewform?hl=en_US&formkey=dDdjcmNBZFZsX2c0SkJPQnR3aGdnd0E6MQ&ifq

Continue reading “Google Speed Test”

Getting Inside R

Forums and Minerals, the new Internet tools — Image via Wikipedia

I loved the new upgraded design of Inside-R, Revo’s new(?) community.

And promptly shot up a blog application.

What makes Inside- R- slightly better than SDC, Analyticbridge,PlanetR and R _bloggers (with due respects)

Open Id logins (I think thats a new and good step)
Options for automated feed parsing for blogs
More than just a blog aggregator- includes sections on other stuff- thus more like a community than a big feed
Abbreviated feeds- just gives you two-three lines of summary per post than the whole big schmakaround -thats a time saver for me —(D Smith is the only -lonely blogger atm there)
The more the merrier- One more place to read and write R.

btw is the name insider (as in guy who knows inside stuff) or Inside- R (as in get inside the R box)- just kidding. With PlyR, ManipulatR, ApplyR and now Inside R- the pun gets MerrieR

If my blog app gets rejected- these views may change ,grr

R-bloggers for other languages – wanna join? (r-bloggers.com)
National Thank a Blogger Day (mom-101.com)
WoW Insider is available on your Amazon Kindle (wow.joystiq.com)
Revolution Analytics Launches Inside-R.org for the Open Source R Statistics Community (eon.businesswire.com)
Chasing Value: Oracle Insiders Selling as Stock Hits 52-Week High (bloggingstocks.com)
Is it me or does blogging suck up a lot of time? from Gothridge Manor (gothridgemanor.blogspot.com)
Inside-R.org, a new community site for R (revolutionanalytics.com)

So what's new in R 2.12.0

and as per http://cran.r-project.org/src/base/NEWS

the answer is plenty is new in the newR.

While you and me, were busy writing and reading blogs, or generally writing code for earning more money, or our own research- Uncle Peter D and his band of merry men have been really busy in a much more upgraded R.

————————————–

CHANGES————————-

NEW FEATURES:

    â€¢ Reading a packages's CITATION file now defaults to ASCII rather
      than Latin-1: a package with a non-ASCII CITATION file should
      declare an encoding in its DESCRIPTION file and use that encoding
      for the CITATION file.

    â€¢ difftime() now defaults to the "tzone" attribute of "POSIXlt"
      objects rather than to the current timezone as set by the default
      for the tz argument.  (Wish of PR#14182.)

    â€¢ pretty() is now generic, with new methods for "Date" and "POSIXt"
      classes (based on code contributed by Felix Andrews).

    â€¢ unique() and match() are now faster on character vectors where
      all elements are in the global CHARSXP cache and have unmarked
      encoding (ASCII).  Thanks to Matthew Dowle for suggesting
      improvements to the way the hash code is generated in unique.c.

    â€¢ The enquote() utility, in use internally, is exported now.

    â€¢ .C() and .Fortran() now map non-zero return values (other than
      NA_LOGICAL) for logical vectors to TRUE: it has been an implicit
      assumption that they are treated as true.

    â€¢ The print() methods for "glm" and "lm" objects now insert
      linebreaks in long calls in the same way that the print() methods
      for "summary.[g]lm" objects have long done.  This does change the
      layout of the examples for a number of packages, e.g. MASS.
      (PR#14250)

    â€¢ constrOptim() can now be used with method "SANN".  (PR#14245)

      It gains an argument hessian to be passed to optim(), which
      allows all the ... arguments to be intended for f() and grad().
      (PR#14071)

    â€¢ curve() now allows expr to be an object of mode "expression" as
      well as "call" and "function".

    â€¢ The "POSIX[cl]t" methods for Axis() have been replaced by a
      single method for "POSIXt".

      There are no longer separate plot() methods for "POSIX[cl]t" and
      "Date": the default method has been able to handle those classes
      for a long time.  This _inter alia_ allows a single date-time
      object to be supplied, the wish of PR#14016.

      The methods had a different default ("") for xlab.

    â€¢ Classes "POSIXct", "POSIXlt" and "difftime" have generators
      .POSIXct(), .POSIXlt() and .difftime().  Package authors are
      advised to make use of them (they are available from R 2.11.0) to
      proof against planned future changes to the classes.

      The ordering of the classes has been changed, so "POSIXt" is now
      the second class.  See the document â€˜Updating packages for
      changes in R 2.12.xâ€™ on  for
      the consequences for a handful of CRAN packages.

    â€¢ The "POSIXct" method of as.Date() allows a timezone to be
      specified (but still defaults to UTC).

    â€¢ New list2env() utility function as an inverse of
      as.list() and for fast multi-assign() to existing
      environment.  as.environment() is now generic and uses list2env()
      as list method.

    â€¢ There are several small changes to output which â€˜zapâ€™ small
      numbers, e.g. in printing quantiles of residuals in summaries
      from "lm" and "glm" fits, and in test statisics in print.anova().

    â€¢ Special names such as "dim", "names", etc, are now allowed as
      slot names of S4 classes, with "class" the only remaining
      exception.

    â€¢ File .Renviron can have architecture-specific versions such as
      .Renviron.i386 on systems with sub-architectures.

    â€¢ installed.packages() has a new argument subarch to filter on
      sub-architecture.

    â€¢ The summary() method for packageStatus() now has a separate
      print() method.

    â€¢ The default summary() method returns an object inheriting from
      class "summaryDefault" which has a separate print() method that
      calls zapsmall() for numeric/complex values.

    â€¢ The startup message now includes the platform and if used,
      sub-architecture: this is useful where different
      (sub-)architectures run on the same OS.

    â€¢ The getGraphicsEvent() mechanism now allows multiple windows to
      return graphics events, through the new functions
      setGraphicsEventHandlers(), setGraphicsEventEnv(), and
      getGraphicsEventEnv().  (Currently implemented in the windows()
      and X11() devices.)

    â€¢ tools::texi2dvi() gains an index argument, mainly for use by R
      CMD Rd2pdf.

      It avoids the use of texindy by texinfo's texi2dvi >= 1.157,
      since that does not emulate 'makeindex' well enough to avoid
      problems with special characters (such as (, {, !) in indices.

    â€¢ The ability of readLines() and scan() to re-encode inputs to
      marked UTF-8 strings on Windows since R 2.7.0 is extended to
      non-UTF-8 locales on other OSes.

    â€¢ scan() gains a fileEncoding argument to match read.table().

    â€¢ points() and lines() gain "table" methods to match plot().  (Wish
      of PR#10472.)

    â€¢ Sys.chmod() allows argument mode to be a vector, recycled along
      paths.

    â€¢ There are |, & and xor() methods for classes "octmode" and
      "hexmode", which work bitwise.

    â€¢ Environment variables R_DVIPSCMD, R_LATEXCMD, R_MAKEINDEXCMD,
      R_PDFLATEXCMD are no longer used nor set in an R session.  (With
      the move to tools::texi2dvi(), the conventional environment
      variables LATEX, MAKEINDEX and PDFLATEX will be used.
      options("dvipscmd") defaults to the value of DVIPS, then to
      "dvips".)

    â€¢ New function isatty() to see if terminal connections are
      redirected.

    â€¢ summaryRprof() returns the sampling interval in component
      sample.interval and only returns in by.self data for functions
      with non-zero self times.

    â€¢ print(x) and str(x) now indicate if an empty list x is named.

    â€¢ install.packages() and remove.packages() with lib unspecified and
      multiple libraries in .libPaths() inform the user of the library
      location used with a message rather than a warning.

    â€¢ There is limited support for multiple compressed streams on a
      file: all of [bgx]zfile() allow streams to be appended to an
      existing file, but bzfile() reads only the first stream.

    â€¢ Function person() in package utils now uses a given/family scheme
      in preference to first/middle/last, is vectorized to handle an
      arbitrary number of persons, and gains a role argument to specify
      person roles using a controlled vocabulary (the MARC relator
      terms).

    â€¢ Package utils adds a new "bibentry" class for representing and
      manipulating bibliographic information in enhanced BibTeX style,
      unifying and enhancing the previously existing mechanisms.

    â€¢ A bibstyle() function has been added to the tools package with
      default JSS style for rendering "bibentry" objects, and a
      mechanism for registering other rendering styles.

    â€¢ Several aspects of the display of text help are now customizable
      using the new Rd2txt_options() function.
      options("help_text_width") is no longer used.

    â€¢ Added \href tag to the Rd format, to allow hyperlinks to URLs
      without displaying the full URL.

    â€¢ Added \newcommand and \renewcommand tags to the Rd format, to
      allow user-defined macros.

    â€¢ New toRd() generic in the tools package to convert objects to
      fragments of Rd code, and added "fragment" argument to Rd2txt(),
      Rd2HTML(), and Rd2latex() to support it.

    â€¢ Directory R_HOME/share/texmf now follows the TDS conventions, so
      can be set as a texmf tree (â€˜root directoryâ€™ in MiKTeX parlance).

    â€¢ S3 generic functions now use correct S4 inheritance when
      dispatching on an S4 object.  See ?Methods, section on â€œMethods
      for S3 Generic Functionsâ€ for recommendations and details.

    â€¢ format.pval() gains a ... argument to pass arguments such as
      nsmall to format().  (Wish of PR#9574)

    â€¢ legend() supports title.adj.  (Wish of PR#13415)

    â€¢ Added support for subsetting "raster" objects, plus assigning to
      a subset, conversion to a matrix (of colour strings), and
      comparisons (== and !=).

    â€¢ Added a new parseLatex() function (and related functions
      deparseLatex() and latexToUtf8()) to support conversion of
      bibliographic entries for display in R.

    â€¢ Text rendering of \itemize in help uses a Unicode bullet in UTF-8
      and most single-byte Windows locales.

    â€¢ Added support for polygons with holes to the graphics engine.
      This is implemented for the pdf(), postscript(),
      x11(type="cairo"), windows(), and quartz() devices (and
      associated raster formats), but not for x11(type="Xlib") or
      xfig() or pictex().  The user-level interface is the polypath()
      function in graphics and grid.path() in grid.

    â€¢ File NEWS is now generated at installation with a slightly
      different format: it will be in UTF-8 on platforms using UTF-8,
      and otherwise in ASCII.  There is also a PDF version, NEWS.pdf,
      installed at the top-level of the R distribution.

    â€¢ kmeans(x, 1) now works.  Further, kmeans now returns between and
      total sum of squares.

    â€¢ arrayInd() and which() gain an argument useNames.  For arrayInd,
      the default is now false, for speed reasons.

    â€¢ As is done for closures, the default print method for the formula
      class now displays the associated environment if it is not the
      global environment.

    â€¢ A new facility has been added for inserting code into a package
      without re-installing it, to facilitate testing changes which can
      be selectively added and backed out.  See ?insertSource.

    â€¢ New function readRenviron to (re-)read files in the format of
      ~/.Renviron and Renviron.site.

    â€¢ require() will now return FALSE (and not fail) if loading the
      package or one of its dependencies fails.

    â€¢ aperm() now allows argument perm to be a character vector when
      the array has named dimnames (as the results of table() calls
      do).  Similarly, array() allows MARGIN to be a character vector.
      (Based on suggestions of Michael Lachmann.)

    â€¢ Package utils now exports and documents functions
      aspell_package_Rd_files() and aspell_package_vignettes() for
      spell checking package Rd files and vignettes using Aspell,
      Ispell or Hunspell.

    â€¢ Package news can now be given in Rd format, and news() prefers
      these inst/NEWS.Rd files to old-style plain text NEWS or
      inst/NEWS files.

    â€¢ New simple function packageVersion().

    â€¢ The PCRE library has been updated to version 8.10.

    â€¢ The standard Unix-alike terminal interface declares its name to
      readline as 'R', so that can be used for conditional sections in
      ~/.inputrc files.

    â€¢ â€˜Writing R Extensionsâ€™ now stresses that the standard sections in
      .Rd files (other than \alias, \keyword and \note) are intended to
      be unique, and the conversion tools now drop duplicates with a
      warning.

      The .Rd conversion tools also warn about an unrecognized type in
      a \docType section.

    â€¢ ecdf() objects now have a quantile() method.

    â€¢ format() methods for date-time objects now attempt to make use of
      a "tzone" attribute with "%Z" and "%z" formats, but it is not
      always possible.  (Wish of PR#14358.)

    â€¢ tools::texi2dvi(file, clean = TRUE) now works in more cases (e.g.
      where emulation is used and when file is not in the current
      directory).

    â€¢ New function droplevels() to remove unused factor levels.

    â€¢ system(command, intern = TRUE) now gives an error on a Unix-alike
      (as well as on Windows) if command cannot be run.  It reports a
      non-success exit status from running command as a warning.

      On a Unix-alike an attempt is made to return the actual exit
      status of the command in system(intern = FALSE): previously this
      had been system-dependent but on POSIX-compliant systems the
      value return was 256 times the status.

    â€¢ system() has a new argument ignore.stdout which can be used to
      (portably) ignore standard output.

    â€¢ system(intern = TRUE) and pipe() connections are guaranteed to be
      avaliable on all builds of R.

    â€¢ Sys.which() has been altered to return "" if the command is not
      found (even on Solaris).

    â€¢ A facility for defining reference-based S4 classes (in the OOP
      style of Java, C++, etc.) has been added experimentally to
      package methods; see ?ReferenceClasses.

    â€¢ The predict method for "loess" fits gains an na.action argument
      which defaults to na.pass rather than the previous default of
      na.omit.

      Predictions from "loess" fits are now named from the row names of
      newdata.

    â€¢ Parsing errors detected during Sweave() processing will now be
      reported referencing their original location in the source file.

    â€¢ New adjustcolor() utility, e.g., for simple translucent color
      schemes.

    â€¢ qr() now has a trivial lm method with a simple (fast) validity
      check.

    â€¢ An experimental new programming model has been added to package
      methods for reference (OOP-style) classes and methods.  See
      ?ReferenceClasses.

    â€¢ bzip2 has been updated to version 1.0.6 (bug-fix release).
      --with-system-bzlib now requires at least version 1.0.6.

    â€¢ R now provides jss.cls and jss.bst (the class and bib style file
      for the Journal of Statistical Software) as well as RJournal.bib
      and Rnews.bib, and R CMD ensures that the .bst and .bib files are
      found by BibTeX.

    â€¢ Functions using the TAR environment variable no longer quote the
      value when making system calls.  This allows values such as tar
      --force-local, but does require additional quotes in, e.g., TAR =
      "'/path with spaces/mytar'".

  DEPRECATED & DEFUNCT:

    â€¢ Supplying the parser with a character string containing both
      octal/hex and Unicode escapes is now an error.

    â€¢ File extension .C for C++ code files in packages is now defunct.

    â€¢ R CMD check no longer supports configuration files containing
      Perl configuration variables: use the environment variables
      documented in â€˜R Internalsâ€™ instead.

    â€¢ The save argument of require() now defaults to FALSE and save =
      TRUE is now deprecated.  (This facility is very rarely actually
      used, and was superseded by the Depends field of the DESCRIPTION
      file long ago.)

    â€¢ R CMD check --no-latex is deprecated in favour of --no-manual.

    â€¢ R CMD Sd2Rd is formally deprecated and will be removed in R
      2.13.0.

  PACKAGE INSTALLATION:

    â€¢ install.packages() has a new argument libs_only to optionally
      pass --libs-only to R CMD INSTALL and works analogously for
      Windows binary installs (to add support for 64- or 32-bit
      Windows).

    â€¢ When sub-architectures are in use, the installed architectures
      are recorded in the Archs field of the DESCRIPTION file.  There
      is a new default filter, "subarch", in available.packages() to
      make use of this.

      Code is compiled in a copy of the src directory when a package is
      installed for more than one sub-architecture: this avoid problems
      with cleaning the sources between building sub-architectures.

    â€¢ R CMD INSTALL --libs-only no longer overrides the setting of
      locking, so a previous version of the package will be restored
      unless --no-lock is specified.

  UTILITIES:

    â€¢ R CMD Rprof|build|check are now based on R rather than Perl
      scripts.  The only remaining Perl scripts are the deprecated R
      CMD Sd2Rd and install-info.pl (used only if install-info is not
      found) as well as some maintainer-mode-only scripts.

      *NB:* because these have been completely rewritten, users should
      not expect undocumented details of previous implementations to
      have been duplicated.

      R CMD no longer manipulates the environment variables PERL5LIB
      and PERLLIB.

    â€¢ R CMD check has a new argument --extra-arch to confine tests to
      those needed to check an additional sub-architecture.

      Its check for â€œSubdirectory 'inst' contains no filesâ€ is more
      thorough: it looks for files, and warns if there are only empty
      directories.

      Environment variables such as R_LIBS and those used for
      customization can be set for the duration of checking _via_ a
      file ~/.R/check.Renviron (in the format used by .Renviron, and
      with sub-architecture specific versions such as
      ~/.R/check.Renviron.i386 taking precedence).

      There are new options --multiarch to check the package under all
      of the installed sub-architectures and --no-multiarch to confine
      checking to the sub-architecture under which check is invoked.
      If neither option is supplied, a test is done of installed
      sub-architectures and all those which can be run on the current
      OS are used.

      Unless multiple sub-architectures are selected, the install done
      by check for testing purposes is only of the current
      sub-architecture (_via_ R CMD INSTALL --no-multiarch).

      It will skip the check for non-ascii characters in code or data
      if the environment variables _R_CHECK_ASCII_CODE_ or
      _R_CHECK_ASCII_DATA_ are respectively set to FALSE.  (Suggestion
      of Vince Carey.)

    â€¢ R CMD build no longer creates an INDEX file (R CMD INSTALL does
      so), and --force removes (rather than overwrites) an existing
      INDEX file.

      It supports a file ~/.R/build.Renviron analogously to check.

      It now runs build-time \Sexpr expressions in help files.

    â€¢ R CMD Rd2dvi makes use of tools::texi2dvi() to process the
      package manual.  It is now implemented entirely in R (rather than
      partially as a shell script).

    â€¢ R CMD Rprof now uses utils::summaryRprof() rather than Perl.  It
      has new arguments to select one of the tables and to limit the
      number of entries printed.

    â€¢ R CMD Sweave now runs R with --vanilla so the environment setting
      of R_LIBS will always be used.

  C-LEVEL FACILITIES:

    â€¢ lang5() and lang6() (in addition to pre-existing lang[1-4]())
      convenience functions for easier construction of eval() calls.
      If you have your own definition, do wrap it inside #ifndef lang5
      .... #endif to keep it working with old and new R.

    â€¢ Header R.h now includes only the C headers it itself needs, hence
      no longer includes errno.h.  (This helps avoid problems when it
      is included from C++ source files.)

    â€¢ Headers Rinternals.h and R_ext/Print.h include the C++ versions
      of stdio.h and stdarg.h respectively if included from a C++
      source file.

  INSTALLATION:

    â€¢ A C99 compiler is now required, and more C99 language features
      will be used in the R sources.

    â€¢ Tcl/Tk >= 8.4 is now required (increased from 8.3).

    â€¢ System functions access, chdir and getcwd are now essential to
      configure R.  (In practice they have been required for some
      time.)

    â€¢ make check compares the output of the examples from several of
      the base packages to reference output rather than the previous
      output (if any).  Expect some differences due to differences in
      floating-point computations between platforms.

    â€¢ File NEWS is no longer in the sources, but generated as part of
      the installation.  The primary source for changes is now
      doc/NEWS.Rd.

    â€¢ The popen system call is now required to build R.  This ensures
      the availability of system(intern = TRUE), pipe() connections and
      printing from postscript().

    â€¢ The pkg-config file libR.pc now also works when R is installed
      using a sub-architecture.

    â€¢ R has always required a BLAS that conforms to IE60559 arithmetic,
      but after discovery of more real-world problems caused by a BLAS
      that did not, this is tested more thoroughly in this version.

  BUG FIXES:

    â€¢ Calls to selectMethod() by default no longer cache inherited
      methods.  This could previously corrupt methods used by as().

    â€¢ The densities of non-central chi-squared are now more accurate in
      some cases in the extreme tails, e.g. dchisq(2000, 2, 1000), as a
      series expansion was truncated too early.  (PR#14105)

    â€¢ pt() is more accurate in the left tail for ncp large, e.g.
      pt(-1000, 3, 200).  (PR#14069)

    â€¢ The default C function (R_binary) for binary ops now sets the S4
      bit in the result if either argument is an S4 object.  (PR#13209)

    â€¢ source(echo=TRUE) failed to echo comments that followed the last
      statement in a file.

    â€¢ S4 classes that contained one of "matrix", "array" or "ts" and
      also another class now accept superclass objects in new().  Also
      fixes failure to call validObject() for these classes.

    â€¢ Conditional inheritance defined by argument test in
      methods::setIs() will no longer be used in S4 method selection
      (caching these methods could give incorrect results).  See
      ?setIs.

    â€¢ The signature of an implicit generic is now used by setGeneric()
      when that does not use a definition nor explicitly set a
      signature.

    â€¢ A bug in callNextMethod() for some examples with "..." in the
      arguments has been fixed.  See file
      src/library/methods/tests/nextWithDots.R in the sources.

    â€¢ match(x, table) (and hence %in%) now treat "POSIXlt" consistently
      with, e.g., "POSIXct".

    â€¢ Built-in code dealing with environments (get(), assign(),
      parent.env(), is.environment() and others) now behave
      consistently to recognize S4 subclasses; is.name() also
      recognizes subclasses.

    â€¢ The abs.tol control parameter to nlminb() now defaults to 0.0 to
      avoid false declarations of convergence in objective functions
      that may go negative.

    â€¢ The standard Unix-alike termination dialog to ask whether to save
      the workspace takes a EOF response as n to avoid problems with a
      damaged terminal connection.  (PR#14332)

    â€¢ Added warn.unused argument to hist.default() to allow suppression
      of spurious warnings about graphical parameters used with
      plot=FALSE.  (PR#14341)

    â€¢ predict.lm(), summary.lm(), and indeed lm() itself had issues
      with residual DF in zero-weighted cases (the latter two only in
      connection with empty models). (Thanks to Bill Dunlap for
      spotting the predict() case.)

    â€¢ aperm() treated resize = NA as resize = TRUE.

    â€¢ constrOptim() now has an improved convergence criterion, notably
      for cases where the minimum was (very close to) zero; further,
      other tweaks inspired from code proposals by Ravi Varadhan.

    â€¢ Rendering of S3 and S4 methods in man pages has been corrected
      and made consistent across output formats.

    â€¢ Simple markup is now allowed in \title sections in .Rd files.

    â€¢ The behaviour of as.logical() on factors (to use the levels) was
      lost in R 2.6.0 and has been restored.

    â€¢ prompt() did not backquote some default arguments in the \usage
      section.  (Reported by Claudia Beleites.)

    â€¢ writeBin() disallows attempts to write 2GB or more in a single
      call. (PR#14362)

    â€¢ new() and getClass() will now work if Class is a subclass of
      "classRepresentation" and should also be faster in typical calls.

    â€¢ The summary() method for data frames makes a better job of names
      containing characters invalid in the current locale.

    â€¢ [[ sub-assignment for factors could create an invalid factor
      (reported by Bill Dunlap).

    â€¢ Negate(f) would not evaluate argument f until first use of
      returned function (reported by Olaf Mersmann).

    â€¢ quietly=FALSE is now also an optional argument of library(), and
      consequently, quietly is now propagated also for loading
      dependent packages, e.g., in require(*, quietly=TRUE).

    â€¢ If the loop variable in a for loop was deleted, it would be
      recreated as a global variable.  (Reported by Radford Neal; the
      fix includes his optimizations as well.)

    â€¢ Task callbacks could report the wrong expression when the task
      involved parsing new code. (PR#14368)

    â€¢ getNamespaceVersion() failed; this was an accidental change in
      2.11.0. (PR#14374)

    â€¢ identical() returned FALSE for external pointer objects even when
      the pointer addresses were the same.

    â€¢ L$a@x[] <- val did not duplicate in a case it should have.

    â€¢ tempfile() now always gives a random file name (even if the
      directory is specified) when called directly after startup and
      before the R RNG had been used.  (PR#14381)

    â€¢ quantile(type=6) behaved inconsistently.  (PR#14383)

    â€¢ backSpline(.) behaved incorrectly when the knot sequence was
      decreasing.  (PR#14386)

    â€¢ The reference BLAS included in R was assuming that 0*x and x*0
      were always zero (whereas they could be NA or NaN in IEC 60559
      arithmetic).  This was seen in results from tcrossprod, and for
      example that log(0) %*% 0 gave 0.

    â€¢ The calculation of whether text was completely outside the device
      region (in which case, you draw nothing) was wrong for screen
      devices (which have [0, 0] at top-left).  The symptom was (long)
      text disappearing when resizing a screen window (to make it
      smaller).  (PR#14391)

    â€¢ model.frame(drop.unused.levels = TRUE) did not take into account
      NA values of factors when deciding to drop levels. (PR#14393)

    â€¢ library.dynam.unload required an absolute path for libpath.
      (PR#14385)

      Both library() and loadNamespace() now record absolute paths for
      use by searchpaths() and getNamespaceInfo(ns, "path").

    â€¢ The self-starting model NLSstClosestX failed if some deviation
      was exactly zero.  (PR#14384)

    â€¢ X11(type = "cairo") (and other devices such as png using
      cairographics) and which use Pango font selection now work around
      a bug in Pango when very small fonts (those with sizes between 0
      and 1 in Pango's internal units) are requested.  (PR#14369)

    â€¢ Added workaround for the font problem with X11(type = "cairo")
      and similar on Mac OS X whereby italic and bold styles were
      interchanged.  (PR#13463 amongst many other reports.)

    â€¢ source(chdir = TRUE) failed to reset the working directory if it
      could not be determined - that is now an error.

    â€¢ Fix for crash of example(rasterImage) on x11(type="Xlib").

    â€¢ Force Quartz to bring the on-screen display up-to-date
      immediately before the snapshot is taken by grid.cap() in the
      Cocoa implementation. (PR#14260)

    â€¢ model.frame had an unstated 500 byte limit on variable names.
      (Example reported by Terry Therneau.)

    â€¢ The 256-byte limit on names is now documented.    â€¢ Subassignment by [, [[ or $ on an expression object with value
      NULL coerced the object to a list.

R 2.12.0 released (r-bloggers.com)
Rcpp 0.8.7 (r-bloggers.com)
R 2.12.0 released (revolutionanalytics.com)
RcppArmadillo 0.2.7 (dirk.eddelbuettel.com)

Parsing XML files easily

To parse a XML (or KML or PMML) file easily without using any complicated softwares, here is a piece of code that fits right in your excel sheet.

Just import this file using Excel, and then use the function getElement, after pasting the XML code in 1 cell.

It is used for simply reading the xml/kml code as a text string. Just pasted all the xml code in one cell, and used the start ,end function (for example start=<constraints> and end=</constraints> to get the value of constraints in the xml code).

Simply read into the value in another cell using the getElement function.

heres the code if you ever need it.Just paste it into the VB editor of Excel to create the GetElement function (if not there already) or simply import the file in the link above.

Attribute VB_Name = “Module1”
Public Function getElement(xml As String, start As String, finish As String)
For i = 1 To Len(xml)
If Mid(xml, i, Len(start)) = start Then
For j = i + Len(start) To Len(xml)
If Mid(xml, j, Len(finish)) = finish Then
getElement = Mid(xml, i + Len(start), j – i – Len(start))
Exit Function
End If
Next j
End If
Next i
End Function

FOR Using the R Package for parsing XML …………………………reference this site –

http://www.omegahat.org/RSXML/Overview.html

or this thread from R -Help

> Lines <- ‘
+ <root>
+ <data loc=”1″>
+ <val i=”t1″> 22 </val>
+ <val i=”t2″> 45 </val>
+ </data>
+ <data loc=”2″>
+ <val i=”t1″> 44 </val>
+ <val i=”t2″> 11 </val>
+ </data>
+ </root>
+ ‘
>
> library(XML)
> doc <- xmlTreeParse(Lines, asText = TRUE, trim = TRUE, useInternalNodes = TRUE)
> root <- xmlRoot(doc)
>
> data1 <- getNodeSet(root, “//data”)[[1]]
> xmlValue(getNodeSet(data1, “//val”)[[1]])
[1] ” 22 “

Please share:

Please share:

Related Articles

Please share:

Related Articles

Please share:

Please share: