Rapid Miner- R Extension

Here is a new video which shows exactly how you can use Rapid Miner and R together. Advantages of using both together is using Rapid Miner’s GUI (including the flowchart style for data mning) and adding R statistical functionality to it.

From http://rapid-i.com/content/view/219/1/

The web site features a video showing how easy R models and scripts can be integrated into the RapidMiner analysis processes. RapidMiner offers a new R perspective consisting of the known R console together with the great plotting facilities of R. All variables as well as R scripts can be stored in the RapidMiner Repository and used from there which helps to organize the usually large number of scripts. Furthermore, widely used modeling methods are directly integrated as RapidMiner operators as usual.

“This is a huge step for open source data analysis. RapidMiner offers a great user interface, a clear process structure and lots of ETL and analysis capabilities necessary for real-world problems. R adds a lot of flexibility and many analysis and data manipulation methods. The result is the by far most powerful data transformation and analysis solution worldwide. And this analysis power is now combined with the ease-of-use already known from RapidMiner.” states Dr. Ingo Mierswa, CEO of Rapid-I.

Visit the RCOMM 2010 and learn more about how to integrate analysis and preprocessing methods offered by R as well as how to use the new R perspective offering a full R console and access to all R plotters.

Thus Rapid Miner is one more mainstream software (after SPSS, SAS etc) to add R functionality to it.

Dryad- Microsoft's answer to MR

While reading across the internet I came across Microsoft’s version to MapReduce called Dryad- which has been around for some time, but has not generated quite the buzz that Hadoop or MapReduce are doing.

http://research.microsoft.com/en-us/projects/dryadlinq/

DryadLINQ

DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC clusters.

Overview

New! An academic release of Dryad/DryadLINQ is now available for public download.

The goal of DryadLINQ is to make distributed computing on large compute cluster simple enough for every programmers. DryadLINQ combines two important pieces of Microsoft technology: the Dryad distributed execution engine and the .NET Language Integrated Query (LINQ).

Dryad provides reliable, distributed computing on thousands of servers for large-scale data parallel applications. LINQ enables developers to write and debug their applications in a SQL-like query language, relying on the entire .NET library and using Visual Studio.

DryadLINQ translates LINQ programs into distributed Dryad computations:

  • C# and LINQ data objects become distributed partitioned files.
  • LINQ queries become distributed Dryad jobs.
  • C# methods become code running on the vertices of a Dryad job.

DryadLINQ has the following features:

  • Declarative programming: computations are expressed in a high-level language similar to SQL
  • Automatic parallelization: from sequential declarative code the DryadLINQ compiler generates highly parallel query plans spanning large computer clusters. For exploiting multi-core parallelism on each machine DryadLINQ relies on the PLINQ parallelization framework.
  • Integration with Visual Studio: programmers in DryadLINQ take advantage of the comprehensive VS set of tools: Intellisense, code refactoring, integrated debugging, build, source code management.
  • Integration with .Net: all .Net libraries, including Visual Basic, and dynamic languages are available.
  • and
  • Conciseness: the following line of code is a complete implementation of the Map-Reduce computation framework in DryadLINQ:
    • public static IQueryable<R>
      MapReduce<S,M,K,R>(this IQueryable<S> source,
      Expression<Func<S,IEnumerable<M>>> mapper,
      Expression<Func<M,K>> keySelector,
      Expression<Func<K,IEnumerable<M>,R>> reducer)
      {
      return source.SelectMany(mapper).GroupBy(keySelector, reducer);
      }

    and http://research.microsoft.com/en-us/projects/dryad/

    Dryad

    The Dryad Project is investigating programming models for writing parallel and distributed programs to scale from a small cluster to a large data-center.

    Overview

    New! An academic release of DryadLINQ is now available for public download.

    Dryad is an infrastructure which allows a programmer to use the resources of a computer cluster or a data center for running data-parallel programs. A Dryad programmer can use thousands of machines, each of them with multiple processors or cores, without knowing anything about concurrent programming.

    The Structure of Dryad Jobs

    A Dryad programmer writes several sequential programs and connects them using one-way channels. The computation is structured as a directed graph: programs are graph vertices, while the channels are graph edges. A Dryad job is a graph generator which can synthesize any directed acyclic graph. These graphs can even change during execution, in response to important events in the computation.

    Dryad is quite expressive. It completely subsumes other computation frameworks, such as Google’s map-reduce, or the relational algebra. Moreover, Dryad handles job creation and management, resource management, job monitoring and visualization, fault tolerance, re-execution, scheduling, and accounting.

    The Dryad Software Stack

    As a proof of Dryad’s versatility, a rich software ecosystem has been built on top Dryad:

    • SSIS on Dryad executes many instances of SQL server, each in a separate Dryad vertex, taking advantage of Dryad’s fault tolerance and scheduling. This system is currently deployed in a live production system as part of one of Microsoft’s AdCenter log processing pipelines.
    • DryadLINQ generates Dryad computations from the LINQ Language-Integrated Query extensions to C#.
    • The distributed shell is a generalization of the pipe concept from the Unix shell in three ways. If Unix pipes allow the construction of one-dimensional (1-D) process structures, the distributed shell allows the programmer to build 2-D structures in a scripting language. The distributed shell generalizes Unix pipes in three ways:
      1. It allows processes to easily connect multiple file descriptors of each process — hence the 2-D aspect.
      2. It allows the construction of pipes spanning multiple machines, across a cluster.
      3. It virtualizes the pipelines, allowing the execution of pipelines with many more processes than available machines, by time-multiplexing processors and buffering results.
    • Several languages are compiled to distributed shell processes. PSQL is an early version, recently replaced with Scope.

    Publications

    Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
    Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly
    European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007

    Video of a presentation on Dryad at the Google Campus, given by Michael Isard, Nov 1, 2007.

    Also interesting to read-

    Why does Dryad use a DAG?

    he basic computational model we decided to adopt for Dryad is the directed-acyclic graph (DAG). Each node in the graph is a computation, and each edge in the graph is a stream of data traveling in the direction of the edge. The amount of data on any given edge is assumed to be finite, the computations are assumed to be deterministic, and the inputs are assumed to be immutable. This isn’t by any means a new way of structuring a distributed computation (for example Condor had DAGMan long before Dryad came along), but it seemed like a sweet spot in the design space given our other constraints.

    So, why is this a sweet spot? A DAG is very convenient because it induces an ordering on the nodes in the graph. That makes it easy to design scheduling policies, since you can define a node to be ready when its inputs are available, and at any time you can choose to schedule as many ready nodes as you like in whatever order you like, and as long as you always have at least one scheduled you will continue to make progress and never deadlock. It also makes fault-tolerance easy, since given our determinism and immutability assumptions you can backtrack as far as you want in the DAG and re-execute as many nodes as you like to regenerate intermediate data that has been lost or is unavailable due to cluster failures.

    from

    http://blogs.msdn.com/b/dryad/archive/2010/07/23/why-does-dryad-use-a-dag.aspx

      Software Lawsuits :Ergo

      The latest round of software lawsuits makes things more interesting especially for Google. There are two notable developments

      1) Google’s pact with Verizon for Even more Open Internet -From

      http://googlepublicpolicy.blogspot.com/2010/08/joint-policy-proposal-for-open-internet.html

      A provider that offers a broadband Internet access service
      complying with the above principles could offer any other additional or differentiated services. Such other services would have to be distinguishable in scope and purpose from broadband . Internet access service, but could make use of or access Internet content, applications or services
      and could include traffic prioritization.

      2) Oracle’s lawsuit against Google for Intellectual Property enforcement of Java for Android. ( read here http://news.cnet.com/8301-30685_3-20013549-264.html

      I once joked about nothing remains cool forever not even Google (see https://decisionstats.wordpress.com/2008/08/05/11-ways-to-beat-up-google/ ) and I did not foresee the big G beating itself into knots on its own.

      It is hard to sympathize with Google (or Oracle or Verizon) but this is a mess that is created when lawyers (with a briefcase) steal value rather than a thousand engineers can create value.

      Interestingly Google owns the IP for Map Reduce – so could it itself sue the Hadoop community over terms of royalty someday-like Oracle did with Java- hmmmmm interesting revenue stream

      All in all I would be happy to see zero tiers on an internet (wireless or wired) and even Java developers to make some money on writing code. Open source is not free source.

      GNU PSPP- The Open Source SPSS

      If you are SPSS user (for statistics/ not data mining) you can also try 0ut GNU PSPP- which is the open source equivalent and quite eerily impressive in performance. It is available at http://www.gnu.org/software/pspp/ or http://pspp.awardspace.com/ and you can also read more at http://en.wikipedia.org/wiki/PSPP

      PSPP is a program for statistical analysis of sampled data. It is a Free replacement for the proprietary program SPSS, and appears very similar to it with a few exceptions.

      [ Image of Variable Sheet ]The most important of these exceptions are, that there are no “time bombs”; your copy of PSPP will not “expire” or deliberately stop working in the future. Neither are there any artificial limits on the number of cases or variables which you can use. There are no additional packages to purchase in order to get “advanced” functions; all functionality that PSPP currently supports is in the core package.

      PSPP can perform descriptive statistics, T-tests, linear regression and non-parametric tests. Its backend is designed to perform its analyses as fast as possible, regardless of the size of the input data. You can use PSPP with its graphical interface or the more traditional syntax commands.

      A brief list of some of the features of PSPP follows:

      • Supports over 1 billion cases.
      • Supports over 1 billion variables.
      • Syntax and data files are compatible with SPSS.
      • Choice of terminal or graphical user interface.
      • Choice of text, postscript or html output formats.
      • Inter-operates with GnumericOpenOffice.Org and other free software.
      • Easy data import from spreadsheets, text files and database sources.
      • Fast statistical procedures, even on very large data sets.
      • No license fees.
      • No expiration period.
      • No unethical “end user license agreements”.
      • Fully indexed user manual.
      • Free Software; licensed under GPLv3 or later.
      • Cross platform; Runs on many different computers and many different operating systems.

      PSPP is particularly aimed at statisticians, social scientists and students requiring fast convenient analysis of sampled data.

      and

      Features

      This software provides a basic set of capabilities: frequencies, cross-tabs comparison of means (T-tests and one-way ANOVA); linear regression, reliability (Cronbach’s Alpha, not failure or Weibull), and re-ordering data, non-parametric tests, factor analysis and more.

      At the user’s choice, statistical output and graphics are done in asciipdfpostscript or html formats. A limited range of statistical graphs can be produced, such as histogramspie-charts and np-charts.

      PSPP can import GnumericOpenDocument and Excel spreadsheetsPostgres databasescomma-separated values– and ASCII-files. It can export files in the SPSS ‘portable’ and ‘system’ file formats and to ASCII files. Some of the libraries used by PSPP can be accessed programmatically; PSPP-Perl provides an interface to the libraries used by PSPP.

      Origins

      The PSPP project (originally called “Fiasco”) is a free, open-source alternative to the proprietary statistics package SPSS. SPSS is closed-source and includes a restrictive licence anddigital rights management. The author of PSPP considered this ethically unacceptable, and decided to write a program which might with time become functionally identical to SPSS, except that there would be no licence expiry, and everyone would be permitted to copy, modify and share the program.

      Release history

      • 0.7.5 June 2010 http://pspp.awardspace.com/
      • 0.6.2 October 2009
      • 0.6.1 October 2008
      • 0.6.0 June 2008
      • 0.4.0.1 August 2007
      • 0.4.0 August 2005
      • 0.3.0 April 2004
      • 0.2.4 January 2000
      • 0.1.0 August 1998

      Third Party Reviews

      In the book “SPSS For Dummies“, the author discusses PSPP under the heading of “Ten Useful Things You Can Find on the Internet” [1]. In 2006, the South African Statistical Association presented a conference which included an analysis of how PSPP can be used as a free replacement to SPSS [2].

      Citation-

      Please send FSF & GNU inquiries to gnu@gnu.org. There are also other ways to contact the FSF. Please send broken links and other corrections (or suggestions) to bug-gnu-pspp@gnu.org.

      Copyright © 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007 Free Software Foundation, Inc., 51 Franklin St – Suite 330, Boston, MA 02110, USA – Verbatim copying and distribution of this entire article are permitted worldwide, without royalty, in any medium, provided this notice, and the copyright notice, are preserved.

      Q&A with David Smith, Revolution Analytics.

      Here’s a group of questions and answers that David Smith of Revolution Analytics was kind enough to answer post the launch of the new R Package which integrates Hadoop and R-                         RevoScaleR

      Ajay- How does RevoScaleR work from a technical viewpoint in terms of Hadoop integration?

      David-The point isn’t that there’s a deep technical integration between Revolution R and Hadoop, rather that we see them as complementary (not competing) technologies. Hadoop is amazing at reliably (if slowly) processing huge volumes of distributed data; the RevoScaleR package complements Hadoop by providing statistical algorithms to analyze the data processed by Hadoop. The analogy I use is to compare a freight train with a race car: use Hadoop to slog through a distributed data set and use Map/Reduce to output an aggregated, rectangular data file; then use RevoScaleR to perform statistical analysis on the processed data (and use the speed of RevolScaleR to iterate through many model options to find the best one).

      Ajay- How is it different from MapReduce and R Hipe– existing R Hadoop packages?
      David- They’re complementary. In fact, we’ll be publishing a white paper soon by Saptarshi Guha, author of the Rhipe R/Hadoop integration, showing how he uses Hadoop to process vast volumes of packet-level VOIP data to identify call time/duration from the packets, and then do a regression on the table of calls using RevoScaleR. There’s a little more detail in this blog post: http://blog.revolutionanalytics.com/2010/08/announcing-big-data-for-revolution-r.html
      Ajay- Is it going to be proprietary, free or licensable (open source)?
      David- RevoScaleR is a proprietary package, available to paid subscribers (or free to academics) with Revolution R Enterprise. (If you haven’t seen it, you might be interested in this Q&A I did with Matt Shotwell: http://biostatmatt.com/archives/533 )
      Ajay- Any existing client case studies for Terabyte level analysis using R.
      David- The VOIP example above gets close, but most of the case studies we’ve seen in beta testing have been in the 10’s to 100’s of Gb range. We’ve tested RevoScaleR on larger data sets internally, but we’re eager to hear about real-life use cases in the terabyte range.
      Ajay- How can I use RevoScaleR on my dual chip Win Intel laptop for say 5 gb of data.
      David- One of the great things about RevoScaleR is that it’s designed to work on commodity hardware like a dual-core laptop. You won’t be constrained by the limited RAM available, and the parallel processing algorithms will make use of all cores available to speed up the analysis even further. There’s an example in this white paper (http://info.revolutionanalytics.com/bigdata.html) of doing linear regression on 13Gb of data on a simple dual-core laptop in less than 5 seconds.
      AJ-Thanks to David Smith, for this fast response and wishing him, Saptarshi Guha Dr Norman Nie and the rest of guys at Revolution Analytics a congratulations for this new product launch.

      R Oracle Data Mining

      Here is a new package called R ODM and it is an interface to do Data Mining via Oracle Tables through R. You can read more here http://www.oracle.com/technetwork/database/options/odm/odm-r-integration-089013.html and here http://cran.fhcrc.org/web/packages/RODM/RODM.pdf . Also there is a contest for creative use of R and ODM.

      R Interface to Oracle Data Mining

      The R Interface to Oracle Data Mining ( R-ODM) allows R users to access the power of Oracle Data Mining’s in-database functions using the familiar R syntax. R-ODM provides a powerful environment for prototyping data analysis and data mining methodologies.

      R-ODM is especially useful for:

      • Quick prototyping of vertical or domain-based applications where the Oracle Database supports the application
      • Scripting of “production” data mining methodologies
      • Customizing graphics of ODM data mining results (examples: classificationregressionanomaly detection)

      The R-ODM interface allows R users to mine data using Oracle Data Mining from the R programming environment. It consists of a set of function wrappers written in source R language that pass data and parameters from the R environment to the Oracle RDBMS enterprise edition as standard user PL/SQL queries via an ODBC interface. The R-ODM interface code is a thin layer of logic and SQL that calls through an ODBC interface. R-ODM does not use or expose any Oracle product code as it is completely an external interface and not part of any Oracle product. R-ODM is similar to the example scripts (e.g., the PL/SQL demo code) that illustrates the use of Oracle Data Mining, for example, how to create Data Mining models, pass arguments, retrieve results etc.

      R-ODM is packaged as a standard R source package and is distributed freely as part of the R environment’s Comprehensive R Archive Network ( CRAN). For information about the R environment, R packages and CRAN, see www.r-project.org.

      and

      Present and win an Apple iPod Touch!
      The BI, Warehousing and Analytics (BIWA) SIG is giving an Apple iPOD Touch to the best new presenter. Be part of the TechCast series and get a chance to win!

      Consider highlighting a creative use of R and ODM.

      BIWA invites all Oracle professionals (experts, end users, managers, DBAs, developers, data analysts, ISVs, partners, etc.) to submit abstracts for 45 minute technical webcasts to our Oracle BIWA (IOUG SIG) Community in our Wednesday TechCast series. Note that the contest is limited to new presenters to encourage fresh participation by the BIWA community.

      Also an interview with Oracle Data Mining head, Charlie Berger https://decisionstats.wordpress.com/2009/09/02/oracle/

      R Excel :Updated

      It was really nice to see the latest version of R Excel at http://rcom.univie.ac.at/ and bundled together in an aptly named package called R and Friends.

      The look and feel of the package as well as ease of installing are really professional. I also liked the commercial equivalent at http://www.statconn.com/

      However much older-guardians and  die- hards of command line,  feel that GUI is like putting lipstick on a pig, but we respectfully demur.

      What does R Excel do? Well for one it can put the R Commander Interface INSIDE your Excel Spreadsheet. That makes it easy to use and a familiar interface even if you are newbie to R- (assuming you have done some Excel)

      Download the latest version here

      RAndFriends

      This package will automatically install and configure

      • R 2.11.1
      • rscproxy 1.3-1
      • rcom 2.2-1

      It will also download and install a suitable version of the statconnDCOM server and of RExcel during installation. Therefore you will need a working Internet connection during the installation process.
      This version of RAndFriends was created 20100516.

      Download RAndFriendsSetup2111V3.1-5-1

      We also give you information how to download all sources for R and the R packages included in RAndFriends.

      Also read a paper on R and SAS interoperability (using HMisc package from Dr Harrell) at Holland Numerics

      http://www.hollandnumerics.co.uk/pdf/SAS2R2SAS_paper.pdf