Dryad- Microsoft's answer to MR

While reading across the internet I came across Microsoft’s version to MapReduce called Dryad- which has been around for some time, but has not generated quite the buzz that Hadoop or MapReduce are doing.

http://research.microsoft.com/en-us/projects/dryadlinq/

DryadLINQ

DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC clusters.

Overview

New! An academic release of Dryad/DryadLINQ is now available for public download.

The goal of DryadLINQ is to make distributed computing on large compute cluster simple enough for every programmers. DryadLINQ combines two important pieces of Microsoft technology: the Dryad distributed execution engine and the .NET Language Integrated Query (LINQ).

Dryad provides reliable, distributed computing on thousands of servers for large-scale data parallel applications. LINQ enables developers to write and debug their applications in a SQL-like query language, relying on the entire .NET library and using Visual Studio.

DryadLINQ translates LINQ programs into distributed Dryad computations:

  • C# and LINQ data objects become distributed partitioned files.
  • LINQ queries become distributed Dryad jobs.
  • C# methods become code running on the vertices of a Dryad job.

DryadLINQ has the following features:

  • Declarative programming: computations are expressed in a high-level language similar to SQL
  • Automatic parallelization: from sequential declarative code the DryadLINQ compiler generates highly parallel query plans spanning large computer clusters. For exploiting multi-core parallelism on each machine DryadLINQ relies on the PLINQ parallelization framework.
  • Integration with Visual Studio: programmers in DryadLINQ take advantage of the comprehensive VS set of tools: Intellisense, code refactoring, integrated debugging, build, source code management.
  • Integration with .Net: all .Net libraries, including Visual Basic, and dynamic languages are available.
  • and
  • Conciseness: the following line of code is a complete implementation of the Map-Reduce computation framework in DryadLINQ:
    • public static IQueryable<R>
      MapReduce<S,M,K,R>(this IQueryable<S> source,
      Expression<Func<S,IEnumerable<M>>> mapper,
      Expression<Func<M,K>> keySelector,
      Expression<Func<K,IEnumerable<M>,R>> reducer)
      {
      return source.SelectMany(mapper).GroupBy(keySelector, reducer);
      }

    and http://research.microsoft.com/en-us/projects/dryad/

    Dryad

    The Dryad Project is investigating programming models for writing parallel and distributed programs to scale from a small cluster to a large data-center.

    Overview

    New! An academic release of DryadLINQ is now available for public download.

    Dryad is an infrastructure which allows a programmer to use the resources of a computer cluster or a data center for running data-parallel programs. A Dryad programmer can use thousands of machines, each of them with multiple processors or cores, without knowing anything about concurrent programming.

    The Structure of Dryad Jobs

    A Dryad programmer writes several sequential programs and connects them using one-way channels. The computation is structured as a directed graph: programs are graph vertices, while the channels are graph edges. A Dryad job is a graph generator which can synthesize any directed acyclic graph. These graphs can even change during execution, in response to important events in the computation.

    Dryad is quite expressive. It completely subsumes other computation frameworks, such as Google’s map-reduce, or the relational algebra. Moreover, Dryad handles job creation and management, resource management, job monitoring and visualization, fault tolerance, re-execution, scheduling, and accounting.

    The Dryad Software Stack

    As a proof of Dryad’s versatility, a rich software ecosystem has been built on top Dryad:

    • SSIS on Dryad executes many instances of SQL server, each in a separate Dryad vertex, taking advantage of Dryad’s fault tolerance and scheduling. This system is currently deployed in a live production system as part of one of Microsoft’s AdCenter log processing pipelines.
    • DryadLINQ generates Dryad computations from the LINQ Language-Integrated Query extensions to C#.
    • The distributed shell is a generalization of the pipe concept from the Unix shell in three ways. If Unix pipes allow the construction of one-dimensional (1-D) process structures, the distributed shell allows the programmer to build 2-D structures in a scripting language. The distributed shell generalizes Unix pipes in three ways:
      1. It allows processes to easily connect multiple file descriptors of each process — hence the 2-D aspect.
      2. It allows the construction of pipes spanning multiple machines, across a cluster.
      3. It virtualizes the pipelines, allowing the execution of pipelines with many more processes than available machines, by time-multiplexing processors and buffering results.
    • Several languages are compiled to distributed shell processes. PSQL is an early version, recently replaced with Scope.

    Publications

    Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
    Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly
    European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007

    Video of a presentation on Dryad at the Google Campus, given by Michael Isard, Nov 1, 2007.

    Also interesting to read-

    Why does Dryad use a DAG?

    he basic computational model we decided to adopt for Dryad is the directed-acyclic graph (DAG). Each node in the graph is a computation, and each edge in the graph is a stream of data traveling in the direction of the edge. The amount of data on any given edge is assumed to be finite, the computations are assumed to be deterministic, and the inputs are assumed to be immutable. This isn’t by any means a new way of structuring a distributed computation (for example Condor had DAGMan long before Dryad came along), but it seemed like a sweet spot in the design space given our other constraints.

    So, why is this a sweet spot? A DAG is very convenient because it induces an ordering on the nodes in the graph. That makes it easy to design scheduling policies, since you can define a node to be ready when its inputs are available, and at any time you can choose to schedule as many ready nodes as you like in whatever order you like, and as long as you always have at least one scheduled you will continue to make progress and never deadlock. It also makes fault-tolerance easy, since given our determinism and immutability assumptions you can backtrack as far as you want in the DAG and re-execute as many nodes as you like to regenerate intermediate data that has been lost or is unavailable due to cluster failures.

    from

    http://blogs.msdn.com/b/dryad/archive/2010/07/23/why-does-dryad-use-a-dag.aspx

      Software Lawsuits :Ergo

      The latest round of software lawsuits makes things more interesting especially for Google. There are two notable developments

      1) Google’s pact with Verizon for Even more Open Internet -From

      http://googlepublicpolicy.blogspot.com/2010/08/joint-policy-proposal-for-open-internet.html

      A provider that offers a broadband Internet access service
      complying with the above principles could offer any other additional or differentiated services. Such other services would have to be distinguishable in scope and purpose from broadband . Internet access service, but could make use of or access Internet content, applications or services
      and could include traffic prioritization.

      2) Oracle’s lawsuit against Google for Intellectual Property enforcement of Java for Android. ( read here http://news.cnet.com/8301-30685_3-20013549-264.html

      I once joked about nothing remains cool forever not even Google (see https://decisionstats.wordpress.com/2008/08/05/11-ways-to-beat-up-google/ ) and I did not foresee the big G beating itself into knots on its own.

      It is hard to sympathize with Google (or Oracle or Verizon) but this is a mess that is created when lawyers (with a briefcase) steal value rather than a thousand engineers can create value.

      Interestingly Google owns the IP for Map Reduce – so could it itself sue the Hadoop community over terms of royalty someday-like Oracle did with Java- hmmmmm interesting revenue stream

      All in all I would be happy to see zero tiers on an internet (wireless or wired) and even Java developers to make some money on writing code. Open source is not free source.

      GNU PSPP- The Open Source SPSS

      If you are SPSS user (for statistics/ not data mining) you can also try 0ut GNU PSPP- which is the open source equivalent and quite eerily impressive in performance. It is available at http://www.gnu.org/software/pspp/ or http://pspp.awardspace.com/ and you can also read more at http://en.wikipedia.org/wiki/PSPP

      PSPP is a program for statistical analysis of sampled data. It is a Free replacement for the proprietary program SPSS, and appears very similar to it with a few exceptions.

      [ Image of Variable Sheet ]The most important of these exceptions are, that there are no “time bombs”; your copy of PSPP will not “expire” or deliberately stop working in the future. Neither are there any artificial limits on the number of cases or variables which you can use. There are no additional packages to purchase in order to get “advanced” functions; all functionality that PSPP currently supports is in the core package.

      PSPP can perform descriptive statistics, T-tests, linear regression and non-parametric tests. Its backend is designed to perform its analyses as fast as possible, regardless of the size of the input data. You can use PSPP with its graphical interface or the more traditional syntax commands.

      A brief list of some of the features of PSPP follows:

      • Supports over 1 billion cases.
      • Supports over 1 billion variables.
      • Syntax and data files are compatible with SPSS.
      • Choice of terminal or graphical user interface.
      • Choice of text, postscript or html output formats.
      • Inter-operates with GnumericOpenOffice.Org and other free software.
      • Easy data import from spreadsheets, text files and database sources.
      • Fast statistical procedures, even on very large data sets.
      • No license fees.
      • No expiration period.
      • No unethical “end user license agreements”.
      • Fully indexed user manual.
      • Free Software; licensed under GPLv3 or later.
      • Cross platform; Runs on many different computers and many different operating systems.

      PSPP is particularly aimed at statisticians, social scientists and students requiring fast convenient analysis of sampled data.

      and

      Features

      This software provides a basic set of capabilities: frequencies, cross-tabs comparison of means (T-tests and one-way ANOVA); linear regression, reliability (Cronbach’s Alpha, not failure or Weibull), and re-ordering data, non-parametric tests, factor analysis and more.

      At the user’s choice, statistical output and graphics are done in asciipdfpostscript or html formats. A limited range of statistical graphs can be produced, such as histogramspie-charts and np-charts.

      PSPP can import GnumericOpenDocument and Excel spreadsheetsPostgres databasescomma-separated values– and ASCII-files. It can export files in the SPSS ‘portable’ and ‘system’ file formats and to ASCII files. Some of the libraries used by PSPP can be accessed programmatically; PSPP-Perl provides an interface to the libraries used by PSPP.

      Origins

      The PSPP project (originally called “Fiasco”) is a free, open-source alternative to the proprietary statistics package SPSS. SPSS is closed-source and includes a restrictive licence anddigital rights management. The author of PSPP considered this ethically unacceptable, and decided to write a program which might with time become functionally identical to SPSS, except that there would be no licence expiry, and everyone would be permitted to copy, modify and share the program.

      Release history

      • 0.7.5 June 2010 http://pspp.awardspace.com/
      • 0.6.2 October 2009
      • 0.6.1 October 2008
      • 0.6.0 June 2008
      • 0.4.0.1 August 2007
      • 0.4.0 August 2005
      • 0.3.0 April 2004
      • 0.2.4 January 2000
      • 0.1.0 August 1998

      Third Party Reviews

      In the book “SPSS For Dummies“, the author discusses PSPP under the heading of “Ten Useful Things You Can Find on the Internet” [1]. In 2006, the South African Statistical Association presented a conference which included an analysis of how PSPP can be used as a free replacement to SPSS [2].

      Citation-

      Please send FSF & GNU inquiries to gnu@gnu.org. There are also other ways to contact the FSF. Please send broken links and other corrections (or suggestions) to bug-gnu-pspp@gnu.org.

      Copyright © 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007 Free Software Foundation, Inc., 51 Franklin St – Suite 330, Boston, MA 02110, USA – Verbatim copying and distribution of this entire article are permitted worldwide, without royalty, in any medium, provided this notice, and the copyright notice, are preserved.

      Big Data and R: New Product Release by Revolution Analytics

      Press Release by the Guys in Revolution Analytics- this time claiming to enable terabyte level analytics with R. Interesting stuff but techie details are awaited.

      Revolution Analytics Brings

      Big Data Analysis to R

      The world’s most powerful statistics language can now tackle terabyte-class data sets using

      Revolution R Enterpriseat a fraction of the cost of legacy analytics products


      JSM 2010 – VANCOUVER (August 3, 2010) — Revolution Analytics today introduced ‘Big Data’ analysis to its Revolution R Enterprise software, taking the popular R statistics language to unprecedented new levels of capacity and performance for analyzing very large data sets. For the first time, R users will be able to process, visualize and model terabyte-class data sets in a fraction of the time of legacy products—without employing expensive or specialized hardware.

      The new version of Revolution R Enterprise introduces an add-on package called RevoScaleR that provides a new framework for fast and efficient multi-core processing of large data sets. It includes:

      • The XDF file format, a new binary ‘Big Data’ file format with an interface to the R language that provides high-speed access to arbitrary rows, blocks and columns of data.
      • A collection of widely-used statistical algorithms optimized for Big Data, including high-performance implementations of Summary Statistics, Linear Regression, Binomial Logistic Regressionand Crosstabs—with more to be added in the near future.
      • Data Reading & Transformation tools that allow users to interactively explore and prepare large data sets for analysis.
      • Extensibility, expert R users can develop and extend their own statistical algorithms to take advantage of Revolution R Enterprise’s new speed and scalability capabilities.

      “The R language’s inherent power and extensibility has driven its explosive adoption as the modern system for predictive analytics,” said Norman H. Nie, president and CEO of Revolution Analytics. “We believe that this new Big Data scalability will help R transition from an amazing research and prototyping tool to a production-ready platform for enterprise applications such as quantitative finance and risk management, social media, bioinformatics and telecommunications data analysis.”

      Sage Bionetworks is the nonprofit force behind the open-source collaborative effort, Sage Commons, a place where data and disease models can be shared by scientists to better understand disease biology. David Henderson, Director of Scientific Computing at Sage, commented: “At Sage Bionetworks, we need to analyze genomic databases hundreds of gigabytes in size with R. We’re looking forward to using the high-speed data-analysis features of RevoScaleR to dramatically reduce the times it takes us to process these data sets.”

      Take Hadoop and Other Big Data Sources to the Next Level

      Revolution R Enterprise fits well within the modern ‘Big Data’ architecture by leveraging popular sources such as Hadoop, NoSQL or key value databases, relational databases and data warehouses. These products can be used to store, regularize and do basic manipulation on very large datasets—while Revolution R Enterprise now provides advanced analytics at unparalleled speed and scale: producing speed on speed.

      “Together, Hadoop and R can store and analyze massive, complex data,” said Saptarshi Guha, developer of the popular RHIPE R package that integrates the Hadoop framework with R in an automatically distributed computing environment. “Employing the new capabilities of Revolution R Enterprise, we will be able to go even further and compute Big Data regressions and more.”

      Platforms and Availability

      The new RevoScaleR package will be delivered as part of Revolution R Enterprise 4.0, which will be available for 32-and 64-bit Microsoft Windows in the next 30 days. Support for Red Hat Enterprise Linux (RHEL 5) is planned for later this year.

      On its website (http://www.revolutionanalytics.com/bigdata), Revolution Analytics has published performance and scalability benchmarks for Revolution R Enterprise analyzing a 13.2 gigabyte data set of commercial airline information containing more than 123 million rows, and 29 columns.

      Additionally, the company will showcase its new Big Data solution in a free webinar on August 25 at 9:00 a.m. Pacific.

      Additional Resources

      •      Big Data Benchmark whitepaper

      •      The Revolution Analytics Roadmap whitepaper

      •      Revolutions Blog

      •      Download free academic copy of Revolution R Enterprise

      •      Visit Inside-R.org for the most comprehensive set of information on R

      •      Spread the word: Add a “Download R!” badge on your website

      •      Follow @RevolutionR on Twitter

      About Revolution Analytics

      Revolution Analytics (http://www.revolutionanalytics.com) is the leading commercial provider of software and support for the popular open source R statistics language. Its Revolution R products help make predictive analytics accessible to every type of user and budget. The company is headquartered in Palo Alto, Calif. and backed by North Bridge Venture Partners and Intel Capital.

      Media Contact

      Chantal Yang
      Page One PR, for Revolution Analytics
      Tel: +1 415-875-7494

      Email:  revolution@pageonepr.com

      R Oracle Data Mining

      Here is a new package called R ODM and it is an interface to do Data Mining via Oracle Tables through R. You can read more here http://www.oracle.com/technetwork/database/options/odm/odm-r-integration-089013.html and here http://cran.fhcrc.org/web/packages/RODM/RODM.pdf . Also there is a contest for creative use of R and ODM.

      R Interface to Oracle Data Mining

      The R Interface to Oracle Data Mining ( R-ODM) allows R users to access the power of Oracle Data Mining’s in-database functions using the familiar R syntax. R-ODM provides a powerful environment for prototyping data analysis and data mining methodologies.

      R-ODM is especially useful for:

      • Quick prototyping of vertical or domain-based applications where the Oracle Database supports the application
      • Scripting of “production” data mining methodologies
      • Customizing graphics of ODM data mining results (examples: classificationregressionanomaly detection)

      The R-ODM interface allows R users to mine data using Oracle Data Mining from the R programming environment. It consists of a set of function wrappers written in source R language that pass data and parameters from the R environment to the Oracle RDBMS enterprise edition as standard user PL/SQL queries via an ODBC interface. The R-ODM interface code is a thin layer of logic and SQL that calls through an ODBC interface. R-ODM does not use or expose any Oracle product code as it is completely an external interface and not part of any Oracle product. R-ODM is similar to the example scripts (e.g., the PL/SQL demo code) that illustrates the use of Oracle Data Mining, for example, how to create Data Mining models, pass arguments, retrieve results etc.

      R-ODM is packaged as a standard R source package and is distributed freely as part of the R environment’s Comprehensive R Archive Network ( CRAN). For information about the R environment, R packages and CRAN, see www.r-project.org.

      and

      Present and win an Apple iPod Touch!
      The BI, Warehousing and Analytics (BIWA) SIG is giving an Apple iPOD Touch to the best new presenter. Be part of the TechCast series and get a chance to win!

      Consider highlighting a creative use of R and ODM.

      BIWA invites all Oracle professionals (experts, end users, managers, DBAs, developers, data analysts, ISVs, partners, etc.) to submit abstracts for 45 minute technical webcasts to our Oracle BIWA (IOUG SIG) Community in our Wednesday TechCast series. Note that the contest is limited to new presenters to encourage fresh participation by the BIWA community.

      Also an interview with Oracle Data Mining head, Charlie Berger https://decisionstats.wordpress.com/2009/09/02/oracle/

      Special Issue of JSS on R GUIs

      An announcement by the Journal of Statistical Software- call for papers on R GUIs. Initial deadline is December 2010 with final versions published along 2011.

      Announce

      Special issue of the Journal of Statistical Software on

      Graphical User Interfaces for R

      Editors: Pedro Valero-Mora and Ruben Ledesma

      Since it original paper from Gentleman and Ihaka was published, R has managed to gain an ever-increasing percentage of academic and professional statisticians but the spread of its use among novice and occasional users of statistics have not progressed at the same pace. Among the reasons for this relative lack of impact, the lack of a GUI or point and click interface is one of the causes most widely mentioned. But, however, in the last few years, this situation has been quietly changing and a number of projects have equipped R with a number of different GUIs, ranging from the very simple to the more advanced, and providing the casual user with what could be still a new source of trouble: choosing what is the GUI for him. We may have moved from the “too few” situation to the “too many” situation
      This special issue of the JSS intends as one of its main goals to offer a general overview of the different GUIs currently available for R. Thus, we think that somebody trying to find its way among different alternatives may find useful it as starting point. However, we do not want to stop in a mere listing but we want to offer a bit of a more general discussion about what could be good GUIs  for R (and how to build them). Therefore, we want to see papers submitted that discuss the whole concept of GUI in R, what elements it should include (or not), how this could be achieved, and, why not, if it is actually needed at all. Finally, despite the high success of R, this does not mean other systems may not treasure important features that we would like to see in R. Indeed, descriptions of these nice features that we do not have in R but are in other systems could be another way of driving the future progress of GUIs for R.

      In summary, we envision papers for this special issue on GUIs for R in the following categories:

      – General discussions on GUIs for statistics, and for R.

      – Implementing GUI toolboxes for R so others can program GUIs with them.

      – R GUIs examples (with two subcategories, in the desktop or in the cloud).

      – Is there life beyond R? What features have other systems that R does not have and why R needs them.

      Papers can be sent directly to Pedro Valero-Mora (valerop@uv.es) or Ruben Ledesma (rdledesma@gmail.com) and they will follow the usual JSS reviewing procedure. Initial deadline is December 2010 with final versions published along 2011.

      ====================================================
      Jan de Leeuw; Distinguished Professor and Chair, UCLA Department of Statistics;
      Director: UCLA Center for Environmental Statistics (CES);
      Editor: Journal of Multivariate Analysis, Journal of Statistical Software;

      Protected: Analyzing SAS Institute-WPS Lawsuit

      This content is password protected. To view it please enter your password below:

      %d bloggers like this: