Interview Stephanie McReynolds Director Product Marketing, AsterData

Here is an interview with Stephanie McReynolds who works as as Director of Product Marketing with AsterData. I asked her a couple of questions about the new product releases from AsterData in analytics and MapReduce.

Ajay – How does the new Eclipse Plugin help people who are already working with huge datasets but are new to AsterData’s platform?

Stephanie- Aster Data Developer Express, our new SQL-MapReduce development plug-in for Eclipse, makes MapReduce applications easy to develop. With Aster Data Developer Express, developers can develop, test and deploy a complete SQL-MapReduce application in under an hour. This is a significant increase in productivity over the traditional analytic application development process for Big Data applications, which requires significant time coding applications in low-level code and testing applications on sample data.

Ajay – What are the various analytical functions that are introduced by you recently- list say the top 10.

Stephanie- At Aster Data, we have an intense focus on making the development process easier for SQL-MapReduce applications. Aster Developer Express is a part of this initiative, as is the release of pre-defined analytic functions. We recently launched both a suite of analytic modules and a partnership program dedicated to delivering pre-defined analytic functions for the Aster Data nCluster platform. Pre-defined analytic functions delivered by Aster Data’s engineering team are delivered as modules within the Aster Data Analytic Foundation offering and include analytics in the areas of pattern matching, clustering, statistics, and text analysis– just to name a few areas. Partners like Fuzzy Logix and Cobi Systems are extending this library by delivering industry-focused analytics like Monte Carlo Simulations for Financial Services and geospatial analytics for Public Sector– to give you a few examples.

Ajay – So okay I want to do a K Means Cluster on say a million rows (and say 200 columns) using the Aster method. How do I go about it using the new plug-in as well as your product.

Stephanie- The power of the Aster Data environment for analytic application development is in SQL-MapReduce. SQL is a powerful analytic query standard because it is a declarative language. MapReduce is a powerful programming framework because it can support high performance parallel processing of Big Data and extreme expressiveness, by supporting a wide variety of programming languages, including Java, C/C#/C++, .Net, Python, etc. Aster Data has taken the performance and expressiveness of MapReduce and combined it with the familiar declarativeness of SQL. This unique combination ensures that anyone who knows standard SQL can access advanced analytic functions programmed for Big Data analysis using MapReduce techniques.

kMeans is a good example of an analytic function that we pre-package for developers as part of the Aster Data Analytic Foundation. What does that mean? It means that the MapReduce portion of the development cycle has been completed for you. Each pre-packaged Aster Data function can be called using standard SQL, and executes the defined analytic in a fully parallelized manner in the Aster Data database using MapReduce techniques. The result? High performance analytics with the expressiveness of low-level languages accessed through declarative SQL.

Ajay – I see an an increasing focus on Analytics. Is this part of your product strategy and how do you see yourself competing with pure analytics vendors.

Stephanie – Aster Data is an infrastructure provider. Our core product is a massively parallel processing database called nCluster that performs at or beyond the capabilities of any other analytic database in the market today. We developed our analytics strategy as a response to demand from our customers who were looking beyond the price/performance wars being fought today and wanted support for richer analytics from their database provider. Aster Data analytics are delivered in nCluster to enable analytic applications that are not possible in more traditional database architectures.

Ajay – Name some recent case studies in Analytics of implementation of MR-SQL with Analytical functions

Stephanie – There are three new classes of applications that Aster Data Express and Aster Analytic Foundation support: iterative analytics, prediction and optimization, and ad hoc analysis.

Aster Data customers are uncovering critical business patterns in Big Data by performing hypothesis-driven, iterative analytics. They are exploring interactively massive volumes of data—terabytes to petabytes—in a top-down deductive manner. ComScore, an Aster Data customer that performs website experience analysis is a good example of an Aster Data customer performing this type of analysis.

Other Aster Data customers are building applications for prediction and optimization that discover trends, patterns, and outliers in data sets. Examples of these types of applications are propensity to churn in telecommunications, proactive product and service recommendations in retail, and pricing and retention strategies in financial services. Full Tilt Poker, who is using Aster Data for fraud prevention is a good example of a customer in this space.

The final class of application that I would like to highlight is ad hoc analysis. Examples of ad hoc analysis that can be performed includes social network analysis, advanced click stream analysis, graph analysis, cluster analysis and a wide variety of mathematical, trigonometry, and statistical functions. LinkedIn, whose analysts and data scientists have access to all of their customer data in Aster Data are a good example of a customer using the system in this manner.

While Aster Data customers are using nCluster in a number of other ways, these three new classes of applications are areas in which we are seeing particularly innovative application development.

Biography-

Stephanie McReynolds is Director of Product Marketing at Aster Data, where she is an evangelist for Aster Data’s massively parallel data-analytics server product. Stephanie has over a decade of experience in product management and marketing for business intelligence, data warehouse, and complex event processing products at companies such as Oracle, Peoplesoft, and Business Objects. She holds both a master’s and undergraduate degree from Stanford University.

Dryad- Microsoft's answer to MR

While reading across the internet I came across Microsoft’s version to MapReduce called Dryad- which has been around for some time, but has not generated quite the buzz that Hadoop or MapReduce are doing.

http://research.microsoft.com/en-us/projects/dryadlinq/

DryadLINQ

DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC clusters.

Overview

New! An academic release of Dryad/DryadLINQ is now available for public download.

The goal of DryadLINQ is to make distributed computing on large compute cluster simple enough for every programmers. DryadLINQ combines two important pieces of Microsoft technology: the Dryad distributed execution engine and the .NET Language Integrated Query (LINQ).

Dryad provides reliable, distributed computing on thousands of servers for large-scale data parallel applications. LINQ enables developers to write and debug their applications in a SQL-like query language, relying on the entire .NET library and using Visual Studio.

DryadLINQ translates LINQ programs into distributed Dryad computations:

  • C# and LINQ data objects become distributed partitioned files.
  • LINQ queries become distributed Dryad jobs.
  • C# methods become code running on the vertices of a Dryad job.

DryadLINQ has the following features:

  • Declarative programming: computations are expressed in a high-level language similar to SQL
  • Automatic parallelization: from sequential declarative code the DryadLINQ compiler generates highly parallel query plans spanning large computer clusters. For exploiting multi-core parallelism on each machine DryadLINQ relies on the PLINQ parallelization framework.
  • Integration with Visual Studio: programmers in DryadLINQ take advantage of the comprehensive VS set of tools: Intellisense, code refactoring, integrated debugging, build, source code management.
  • Integration with .Net: all .Net libraries, including Visual Basic, and dynamic languages are available.
  • and
  • Conciseness: the following line of code is a complete implementation of the Map-Reduce computation framework in DryadLINQ:
    • public static IQueryable<R>
      MapReduce<S,M,K,R>(this IQueryable<S> source,
      Expression<Func<S,IEnumerable<M>>> mapper,
      Expression<Func<M,K>> keySelector,
      Expression<Func<K,IEnumerable<M>,R>> reducer)
      {
      return source.SelectMany(mapper).GroupBy(keySelector, reducer);
      }

    and http://research.microsoft.com/en-us/projects/dryad/

    Dryad

    The Dryad Project is investigating programming models for writing parallel and distributed programs to scale from a small cluster to a large data-center.

    Overview

    New! An academic release of DryadLINQ is now available for public download.

    Dryad is an infrastructure which allows a programmer to use the resources of a computer cluster or a data center for running data-parallel programs. A Dryad programmer can use thousands of machines, each of them with multiple processors or cores, without knowing anything about concurrent programming.

    The Structure of Dryad Jobs

    A Dryad programmer writes several sequential programs and connects them using one-way channels. The computation is structured as a directed graph: programs are graph vertices, while the channels are graph edges. A Dryad job is a graph generator which can synthesize any directed acyclic graph. These graphs can even change during execution, in response to important events in the computation.

    Dryad is quite expressive. It completely subsumes other computation frameworks, such as Google’s map-reduce, or the relational algebra. Moreover, Dryad handles job creation and management, resource management, job monitoring and visualization, fault tolerance, re-execution, scheduling, and accounting.

    The Dryad Software Stack

    As a proof of Dryad’s versatility, a rich software ecosystem has been built on top Dryad:

    • SSIS on Dryad executes many instances of SQL server, each in a separate Dryad vertex, taking advantage of Dryad’s fault tolerance and scheduling. This system is currently deployed in a live production system as part of one of Microsoft’s AdCenter log processing pipelines.
    • DryadLINQ generates Dryad computations from the LINQ Language-Integrated Query extensions to C#.
    • The distributed shell is a generalization of the pipe concept from the Unix shell in three ways. If Unix pipes allow the construction of one-dimensional (1-D) process structures, the distributed shell allows the programmer to build 2-D structures in a scripting language. The distributed shell generalizes Unix pipes in three ways:
      1. It allows processes to easily connect multiple file descriptors of each process — hence the 2-D aspect.
      2. It allows the construction of pipes spanning multiple machines, across a cluster.
      3. It virtualizes the pipelines, allowing the execution of pipelines with many more processes than available machines, by time-multiplexing processors and buffering results.
    • Several languages are compiled to distributed shell processes. PSQL is an early version, recently replaced with Scope.

    Publications

    Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
    Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly
    European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007

    Video of a presentation on Dryad at the Google Campus, given by Michael Isard, Nov 1, 2007.

    Also interesting to read-

    Why does Dryad use a DAG?

    he basic computational model we decided to adopt for Dryad is the directed-acyclic graph (DAG). Each node in the graph is a computation, and each edge in the graph is a stream of data traveling in the direction of the edge. The amount of data on any given edge is assumed to be finite, the computations are assumed to be deterministic, and the inputs are assumed to be immutable. This isn’t by any means a new way of structuring a distributed computation (for example Condor had DAGMan long before Dryad came along), but it seemed like a sweet spot in the design space given our other constraints.

    So, why is this a sweet spot? A DAG is very convenient because it induces an ordering on the nodes in the graph. That makes it easy to design scheduling policies, since you can define a node to be ready when its inputs are available, and at any time you can choose to schedule as many ready nodes as you like in whatever order you like, and as long as you always have at least one scheduled you will continue to make progress and never deadlock. It also makes fault-tolerance easy, since given our determinism and immutability assumptions you can backtrack as far as you want in the DAG and re-execute as many nodes as you like to regenerate intermediate data that has been lost or is unavailable due to cluster failures.

    from

    http://blogs.msdn.com/b/dryad/archive/2010/07/23/why-does-dryad-use-a-dag.aspx