map-reduce – DECISION STATS

Interview James G Kobielus IBM Big Data

Here is an interview with James G Kobielus, who is the Senior Program Director, Product Marketing, Big Data Analytics Solutions at IBM. Special thanks to Payal Patel Cudia of IBM’s communication team,for helping with the logistics for this.

Ajay -What are the specific parts of the IBM Platform that deal with the three layers of Big Data -variety, velocity and volume

James-Well first of all, let’s talk about the IBM Information Management portfolio. Our big data platform addresses the three layers of big data to varying degrees either together in a product , or two out of the three or even one of the three aspects. We don’t have separate products for the variety, velocity and volume separately.

Let us define these three layers-Volume refers to the hundreds of terabytes and petabytes of stored data inside organizations today. Velocity refers to the whole continuum from batch to real time continuous and streaming data.

Variety refers to multi-structure data from structured to unstructured files, managed and stored in a common platform analyzed through common tooling.

For Volume-IBM has a highly scalable Big Data platform. This includes Netezza and Infosphere groups of products, and Watson-like technologies that can support petabytes volume of data for analytics. But really the support of volume ranges across IBM’s Information Management portfolio both on the database side and the advanced analytics side.

For real time Velocity, we have real time data acquisition. We have a product called IBM Infosphere, part of our Big Data platform, that is specifically built for streaming real time data acquisition and delivery through complex event processing. We have a very rich range of offerings that help clients build a Hadoop environment that can scale.

Our Hadoop platform is the most real time capable of all in the industry. We are differentiated by our sheer breadth, sophistication and functional depth and tooling integrated in our Hadoop platform. We are differentiated by our streaming offering integrated into the Hadoop platform. We also offer a great range of modeling and analysis tools, pretty much more than any other offering in the Big Data space.

Attached- Jim’s slides from Hadoop World

Ibm big data hadoop summit 2012 james kobielus final 6-13-12(1)

Ajay- Any plans for Mahout for Hadoop

Jim- I cant speak about product plans. We have plans but I cant tell you anything more. We do have a feature in Big Insights called System ML, a library for machine learning.

Ajay- How integral are acquisitions for IBM in the Big Data space (Netezza,Cognos,SPSS etc). Is it true that everything that you have in Big Data is acquired or is the famous IBM R and D contributing here . (see a partial list of IBM acquisitions at at http://www.ibm.com/investor/strategy/acquisitions.wss )

Jim- We have developed a lot on our own. We have the deepest R and D of anybody in the industry in all things Big Data.

For example – Watson has Big Insights Hadoop at its core. Apache Hadoop is the heart and soul of Big Data (see http://www-01.ibm.com/software/data/infosphere/hadoop/ ). A great deal that makes Big Insights so differentiated is that not everything that has been built has been built by the Hadoop community.

We have built additions out of the necessity for security, modeling, monitoring, and governance capabilities into BigInsights to make it truly enterprise ready. That is one example of where we have leveraged open source and we have built our own tools and technologies and layered them on top of the open source code.

Yes of course we have done many strategic acquisitions over the last several years related to Big Data Management and we continue to do so. This quarter we have done 3 acquisitions with strong relevance to Big Data. One of them is Vivisimo (http://www-03.ibm.com/press/us/en/pressrelease/37491.wss ).

Vivisimo provides federated Big Data discovery, search and profiling capabilities to help you figure out what data is out there,what is relevance of that data to your data science project- to help you answer the question which data should you bring in your Hadoop Cluster.

We also did Varicent , which is more performance management and we did TeaLeaf , which is a customer experience solution provider where customer experience management and optimization is one of the hot killer apps for Hadoop in the cloud. We have done great many acquisitions that have a clear relevance to Big Data.

Netezza already had a massively parallel analytics database product with an embedded library of models called Netezza Analytics, and in-database capabilties to massively parallelize Map Reduce and other analytics management functions inside the database. In many ways, Netezza provided capabilities similar to that IBM had provided for many years under the Smart Analytics Platform (http://www-01.ibm.com/software/data/infosphere/what-is-advanced-analytics/ ) .

There is a differential between Netezza and ISAS.

ISAS was built predominantly in-house over several years . If you go back a decade ago IBM acquired Ascential Software , a product portfolio that was the heart and soul of IBM InfoSphere Information Manager that is core to our big Data platform. In addition to Netezza, IBM bought SPSS two years back. We already had data mining tools and predictive modeling in the InfoSphere portfolio, but we realized we needed to have the best of breed, SPSS provided that and so IBM acquired them.

Cognos– We had some BI reporting capabilities in the InfoSphere portfolio that we had built ourselves and also acquired for various degrees from prior acquisitions. But clearly Cognos was one of the best BI vendors , and we were lacking such a rich tool set in our product in visualization and cubing and so for that reason we acquired Cognos.

There is also Unica – which is a marketing campaign optimization which in many ways is a killer app for Hadoop. Projects like that are driving many enterprises.

Ajay- How would you rank order these acquisitions in terms of strategic importance rather than data of acquisition or price paid.

Jim-Think of Big Data as an ecosystem that has components that are fitted to particular functions for data analytics and data management. Is the database the core, or the modeling tool the core, or the governance tools the core, or is the hardware platform the core. Everything is critically important. We would love to hear from you what you think have been most important. Each acquisition has helped play a critical role to build the deepest and broadest solution offering in Big Data. We offer the hardware, software, professional services, the hosting service. I don’t think there is any validity to a rank order system.

Ajay-What are the initiatives regarding open source that Big Data group have done or are planning?

Jim- What we are doing now- We are very much involved with the Apache Hadoop community. We continue to evolve the open source code that everyone leverages.. We have built BigInsights on Apache Hadoop. We have the closest, most up to date in terms of version number to Apache Hadoop ( Hbase,HDFS, Pig etc) of all commercial distributions with our BigInsights 1.4 .

We have an R library integrated with BigInsights . We have a R library integrated with Netezza Analytics. There is support for R Models within the SPSS portfolio. We already have a fair amount of support for R across the portfolio.

Ajay- What are some of the concerns (privacy,security,regulation) that you think can dampen the promise of Big Data.

Jim- There are no showstoppers, there is really a strong momentum. Some of the concerns within the Hadoop space are immaturity of the technology, the immaturity of some of the commercial offerings out there that implement Hadoop, the lack of standardization for formal sense for Hadoop.

There is no Open Standards Body that declares, ratifies the latest version of Mahout, Map Reduce, HDFS etc. There is no industry consensus reference framework for layering these different sub projects. There are no open APIs. There are no certifications or interoperability standards or organizations to certify different vendors interoperability around a common API or framework.

The lack of standardization is troubling in this whole market. That creates risks for users because users are adopting multiple Hadoop products. There are lots of Hadoop deployments in the corporate world built around Apache Hadoop (purely open source). There may be no assurance that these multiple platforms will interoperate seamlessly. That’s a huge issue in terms of just magnifying the risk. And it increases the need for the end user to develop their own custom integrated code if they want to move data between platforms, or move map-reduce jobs between multiple distributions.

Also governance is a consideration. Right now Hadoop is used for high volume ETL on multi structured and unstructured data sources, or Hadoop is used for exploratory sand boxes for data scientists. These are important applications that are a majority of the Hadoop deployments . Some Hadoop deployments are stand alone unstructured data marts for specific applications like sentiment analysis like.

Hadoop is not yet ready for data warehousing. We don’t see a lot of Hadoop being used as an alternative to data warehouses for managing the single version of truth of system or record data. That day will come but there needs to be out there in the marketplace a broader range of data governance mechanisms , master data management, data profiling products that are mature that enterprises can use to make sure their data inside their Hadoop clusters is clean and is the single version of truth. That day has not arrived yet.

One of the great things about IBM’s acquisition of Vivisimo is that a piece of that overall governance picture is discovery and profiling for unstructured data , and that is done very well by Vivisimo for several years.

What we will see is vendors such as IBM will continue to evolve security features inside of our Hadoop platform. We will beef up our data governance capabilities for this new world of Hadoop as the core of Big Data, and we will continue to build up our ability to integrate multiple databases in our Hadoop platform so that customers can use data from a bit of Hadoop,some data from a bit of traditional relational data warehouse, maybe some noSQL technology for different roles within a very complex Big Data environment.

That latter hybrid deployment model is becoming standard across many enterprises for Big Data. A cause for concern is when your Big Data deployment has a bit of Hadoop, bit of noSQL, bit of EDW, bit of in-memory , there are no open standards or frameworks for putting it all together for a unified framework not just for interoperability but also for deployment.

There needs to be a virtualization or abstraction layer for unified access to all these different Big Data platforms by the users/developers writing the queries, by administrators so they can manage data and resources and jobs across all these disparate platforms in a seamless unified way with visual tooling. That grand scenario, the virtualization layer is not there yet in any standard way across the big data market. It will evolve, it may take 5-10 years to evolve but it will evolve.

So, that’s the concern that can dampen some of the enthusiasm for Big Data Analytics.

About-

You can read more about Jim at http://www.linkedin.com/pub/james-kobielus/6/ab2/8b0 or

follow him on Twitter at http://twitter.com/jameskobielus

You can read more about IBM Big Data at http://www-01.ibm.com/software/data/bigdata/

Dryad- Microsoft's answer to MR

While reading across the internet I came across Microsoft’s version to MapReduce called Dryad- which has been around for some time, but has not generated quite the buzz that Hadoop or MapReduce are doing.

http://research.microsoft.com/en-us/projects/dryadlinq/

DryadLINQ

DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC clusters.

Overview

New! An academic release of Dryad/DryadLINQ is now available for public download.

The goal of DryadLINQ is to make distributed computing on large compute cluster simple enough for every programmers. DryadLINQ combines two important pieces of Microsoft technology: the Dryad distributed execution engine and the .NET Language Integrated Query (LINQ).

Dryad provides reliable, distributed computing on thousands of servers for large-scale data parallel applications. LINQ enables developers to write and debug their applications in a SQL-like query language, relying on the entire .NET library and using Visual Studio.

DryadLINQ translates LINQ programs into distributed Dryad computations:

C# and LINQ data objects become distributed partitioned files.

LINQ queries become distributed Dryad jobs.

C# methods become code running on the vertices of a Dryad job.

DryadLINQ has the following features:

Declarative programming: computations are expressed in a high-level language similar to SQL

Automatic parallelization: from sequential declarative code the DryadLINQ compiler generates highly parallel query plans spanning large computer clusters. For exploiting multi-core parallelism on each machine DryadLINQ relies on the PLINQ parallelization framework.

Integration with Visual Studio: programmers in DryadLINQ take advantage of the comprehensive VS set of tools: Intellisense, code refactoring, integrated debugging, build, source code management.

Integration with .Net: all .Net libraries, including Visual Basic, and dynamic languages are available.

and

Conciseness: the following line of code is a complete implementation of the Map-Reduce computation framework in DryadLINQ:

and http://research.microsoft.com/en-us/projects/dryad/

Dryad

The Dryad Project is investigating programming models for writing parallel and distributed programs to scale from a small cluster to a large data-center.

Overview

New! An academic release of DryadLINQ is now available for public download.

Dryad is an infrastructure which allows a programmer to use the resources of a computer cluster or a data center for running data-parallel programs. A Dryad programmer can use thousands of machines, each of them with multiple processors or cores, without knowing anything about concurrent programming.

The Structure of Dryad Jobs

A Dryad programmer writes several sequential programs and connects them using one-way channels. The computation is structured as a directed graph: programs are graph vertices, while the channels are graph edges. A Dryad job is a graph generator which can synthesize any directed acyclic graph. These graphs can even change during execution, in response to important events in the computation.

Dryad is quite expressive. It completely subsumes other computation frameworks, such as Google’s map-reduce, or the relational algebra. Moreover, Dryad handles job creation and management, resource management, job monitoring and visualization, fault tolerance, re-execution, scheduling, and accounting.

The Dryad Software Stack

As a proof of Dryad’s versatility, a rich software ecosystem has been built on top Dryad:

SSIS on Dryad executes many instances of SQL server, each in a separate Dryad vertex, taking advantage of Dryad’s fault tolerance and scheduling. This system is currently deployed in a live production system as part of one of Microsoft’s AdCenter log processing pipelines.

DryadLINQ generates Dryad computations from the LINQ Language-Integrated Query extensions to C#.

The distributed shell is a generalization of the pipe concept from the Unix shell in three ways. If Unix pipes allow the construction of one-dimensional (1-D) process structures, the distributed shell allows the programmer to build 2-D structures in a scripting language. The distributed shell generalizes Unix pipes in three ways:

It allows processes to easily connect multiple file descriptors of each process — hence the 2-D aspect.

It allows the construction of pipes spanning multiple machines, across a cluster.

It virtualizes the pipelines, allowing the execution of pipelines with many more processes than available machines, by time-multiplexing processors and buffering results.

Several languages are compiled to distributed shell processes. PSQL is an early version, recently replaced with Scope.

Publications

Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly
European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007

Video of a presentation on Dryad at the Google Campus, given by Michael Isard, Nov 1, 2007.

Also interesting to read-

Why does Dryad use a DAG?

he basic computational model we decided to adopt for Dryad is the directed-acyclic graph (DAG). Each node in the graph is a computation, and each edge in the graph is a stream of data traveling in the direction of the edge. The amount of data on any given edge is assumed to be finite, the computations are assumed to be deterministic, and the inputs are assumed to be immutable. This isn’t by any means a new way of structuring a distributed computation (for example Condor had DAGMan long before Dryad came along), but it seemed like a sweet spot in the design space given our other constraints.

So, why is this a sweet spot? A DAG is very convenient because it induces an ordering on the nodes in the graph. That makes it easy to design scheduling policies, since you can define a node to be ready when its inputs are available, and at any time you can choose to schedule as many ready nodes as you like in whatever order you like, and as long as you always have at least one scheduled you will continue to make progress and never deadlock. It also makes fault-tolerance easy, since given our determinism and immutability assumptions you can backtrack as far as you want in the DAG and re-execute as many nodes as you like to regenerate intermediate data that has been lost or is unavailable due to cluster failures.

from

http://blogs.msdn.com/b/dryad/archive/2010/07/23/why-does-dryad-use-a-dag.aspx

Towards better analytical software

Here are some thoughts on using existing statistical software for better analytics and/or business intelligence (reporting)-

1) User Interface Design Matters- Most stats software have a legacy approach to user interface design. While the Graphical User Interfaces need to more business friendly and user friendly- example you can call a button T Test or You can call it Compare > Means of Samples (with a highlight called T Test). You can call a button Chi Square Test or Call it Compare> Counts Data. Also excessive reliance on drop down ignores the next generation advances in OS- namely touchscreen instead of mouse click and point.

Given the fact that base statistical procedures are the same across softwares, a more thoughtfully designed user interface (or revamped interface) can give softwares an edge over legacy designs.

2) Branding of Software Matters- One notable whine against SAS Institite products is a premier price. But really that software is actually inexpensive if you see other reporting software. What separates a Cognos from a Crystal Reports to a SAS BI is often branding (and user interface design). This plays a role in branding events – social media is often the least expensive branding and marketing channel. Same for WPS and Revolution Analytics.

3) Alliances matter- The alliances of parent companies are reflected in the sales of bundled software. For a complete solution , you need a database plus reporting plus analytical software. If you are not making all three of the above, you need to partner and cross sell. Technically this means that software (either DB, or Reporting or Analytics) needs to talk to as many different kinds of other softwares and formats. This is why ODBC in R is important, and alliances for small companies like Revolution Analytics, WPS and Netezza are just as important as bigger companies like IBM SPSS, SAS Institute or SAP. Also tie-ins with Hadoop (like R and Netezza appliance) or Teradata and SAS help create better usage.

4) Cloud Computing Interfaces could be the edge- Maybe cloud computing is all hot air. Prudent business planing demands that any software maker in analytics or business intelligence have an extremely easy to load interface ( whether it is a dedicated on demand website) or an Amazon EC2 image. Easier interfaces win and with the cloud still in early stages can help create an early lead. For R software makers this is critical since R is bad in PC usage for larger sets of data in comparison to counterparts. On the cloud that disadvantage vanishes. An easy to understand cloud interface framework is here ( its 2 years old but still should be okay) http://knol.google.com/k/data-mining-through-cloud-computing#

5) Platforms matter- Softwares should either natively embrace all possible platforms or bundle in middle ware themselves.

Here is a case study SAS stopped supporting Apple OS after Base SAS 7. Today Apple OS is strong ( 3.47 million Macs during the most recent quarter ) and the only way to use SAS on a Mac is to do either

http://goo.gl/QAs2

or do a install of Ubuntu on the Mac ( https://help.ubuntu.com/community/MacBook ) and do this

http://ubuntuforums.org/showthread.php?t=1494027

Why does this matter? Well SAS is free to academics and students from this year, but Mac is a preferred computer there. Well WPS can be run straight away on the Mac (though they are curiously not been able to provide academics or discounted student copies 😉 ) as per

http://goo.gl/aVKu

Does this give a disadvantage based on platform. Yes. However JMP continues to be supported on Mac. This is also noteworthy given the upcoming Chromium OS by Google, Windows Azure platform for cloud computing.

Webinar- Under the Analytical Hood

An interesting seminar on an increasingly buzz-worthy topic

Advanced Analytics on Multi-Terabyte Datasets- Conferences

Some news on Data Mining 2009 by Aster Data –

		SAS and Aster Data to Present “Advanced Analytics on Multi-Terabyte Datasets” at M2009 in Las Vegas – Oct. 26-27 Learn how the tight coupling of SQL and MapReduce provided by Aster Data creates new ‘big data’ analytics opportunities when combined with SAS. Aster Data will exhibit throughout the event.
More

And also a nice webcast by Curt Monash on the same Big Data topic-

	Mastering MapReduce Webinar Series, Session 1 “Big Data Reality: The Role of MapReduce in Big Data Management and Analysis”- Oct. 15 Industry analyst Curt Monash explains the basics of MapReduce, key uses cases, and which industries and applications are heavily using MapReduce. Topics include recommendations for integrating MapReduce in an enterprise business intelligence and data warehousing environment.
More

Also,

Here is a brief synopsis on the Aster Data ( http://www.facebook.com/pages/Aster-Data-Systems/5601042375) Sponsored Big Data Summit ( http://www.facebook.com/pages/Big-Data-Summit/143312171156 )which I attended-

A Plan for Large Scale Data Analytics: How to Utilize Aster nCluster and Hadoop in a Symbiotic
Relationship to Support Processing in Excess of 100 Billion Rows Per Month
– Michael Brown and Will Duckworth
(EVP, Software Engineering, comScore, Inc. and Director, Software Engineering, comScore, Inc.)

This talked of the special needs of Com Score in handling big data and why Map Reduce and Hadoop seem to be the cost effective solutions for big big data while RDBMS seems stuck in the middle of middle data. Broadly informative on the statistical challenges of the future given the explosion of data as well.

Making Sense of Hadoop – Its Fit With Data Warehouses – Colin White
(President and Founder of BI Research)

Colin brought a nice perspective on the open source Hadoop vis a vis the Properietary packages and the traditional DBMS. His perspective on the solution is no software is perfect for all needs while all softwares that sell have their own good points while the converging solution could be a heterogeneous solution of the above.

MapReduce Inside a Database System – When and How Case Studies from ShareThis, Specific Media, and Other – Tasso Argyros (Chief Technology Officer and Co-Founder of Aster Data)

This was a more detailed look at the Big Product Launch ( the Hadoop Connector) by Tasso and an interesting look at time series analysis using nPath rather than SQL . Interesting given the ongoing convergence analytics and business intelligence.

Also Tasso lived up to his presenting charm with an excellent pitch on nPath (as his interview said ).

Large-Scale Analytics at LinkedIn – Jonathan Goldman
(Former Principal Scientist at LinkedIn)

This was nice given Jonathan’s perscpective ( he has Phd In Physics) and now does consulting for LinkedIn while maintaining his interests in education- the special needs for social media websites, designing experiments on the fly with huge real time datasets as well as some interesting visualizations (like India and America have the second biggest cross country Li connections after USA- UK. Apparently Linkedin ( http://www.facebook.com/group.php?gid=2211231478 ) does not sound so good when translated in Chinese ( AT Dinner I learnt from a fellow Chinese student that China censors Facebook – sigh!).

Networking Mixer: Beer, wine, hot hors d’oeuvres

I got interviewed ( AFTER) I had mixed some Beer and Wine for myself. The Video interview which was the first video interview I have given ( You know- I have taken SOME interviews by Email and plan to do some more while in Vegas for the Data Mining 2009 with SAS http://www.facebook.com/group.php?gid=2227381262)

They are still editing that interview 😉

—That was all – you need to send me a Facebook invite to see the rest of the NY trip or better still just join the Facebook page of Decision Stats at

http://www.facebook.com/pages/DecisionStats/191421035186

After two weeks I hope to have some more coverage on Data Mining 2009 while at the same time enjoying my much needed Fall Break- Life at University at Tennessee is looking up ( since we beat Georgia 45-19 🙂 )

r*xE5HeUJa(%

The Big Data Event- Why am I here?

I am here braving New York’s cold weather, as I prepare for this evening’s events. If you follow this blog closely ( including the poems) ,it is a welcome change— New York is a nice city people are friendly if you ask them nicely and the bus is a great way to watch the city – best of all I like the crowds which I have grown used while living in India.

Why Am I here?

Because the topics that are discussed here are cutting edge to the point that I cannot find anyone willing to teach me Hadoop and Map-Reduce while in University and at the same time teach me statistics on them as well ( as in how do we do a K Means clustering on a 1 terabyte dataset).

I asked the organizers on what makes the event special ( every event promises special Mojo after all).

This is what they said-

What is the unique value proposition of the event that will help developers and both current and potential customers-

The essence of the event is to explore new innovations in massively-parallel processing data warehousing technology and how it can help companies gain more insight from their data. Applications include fraud detection, behavioral targeting, social network analysis, better predictions/forecasting, bioinformatics, etc. We are exploring how MapReduce and Hadoop can be integrated into the enterprise IT system to help evolve data warehousing/BI/data mining

and to put it even more nicely’

““The industry’s first big data event, Big Data Summit ‘09, being held this evening in New York City, will showcase Hadoop’s fit with MPP data warehouses. Aster Data will be presenting alongside Colin White, President and Founder of BI Research, Mike Brown of comScore Inc., and Jonathan Goldman, who represents LinkedIn.”

That’s good enough for me to drop into Roosevelt Hotel on East 45th Street at around 6 pm for some reluctant networking ( read: beers). 5 years ago whie working for GE , I used to run queries using SAS on a 147 million row database (the size of the DB) and wait 3 hours for it to come back. Today that much data fits very snugly in my laptop. How soon will we have Terabyte level personal computing, and Petabyte level business computing and the challenges it poses to standard statistical assumptions and synching of hardware and software- Big Big Data is an interesting area to watch.

Interview Shawn Kung Sr Director Aster Data

Here is an interview with Shawn Kung, Senior Director of Product Management at Aster Data. Shawn explains the difference between the various database technologies, Aster’s rising appeal to its unique technological approach and touches upon topics of various other interests as well to people in the BI and technology space.

Ajay -Describe your career journey from a high school student of science till today .Do you think science is a more lucrative career?

Shawn: My career journey has spanned over a decade in several Silicon Valley technology companies. In both high school and my college studies at Princeton, I had a fervent interest in math and quantitative economics. Silicon Valley drew me to companies like upstart procurement software maker Ariba and database giant Oracle. I continued my studies by returning to get a Master’s in Management Science at Stanford before going on to lead core storage systems for nearly 5 years at NetApp and subsequently Aster.

Science (whether it is math, physics, economics, or the hard engineering sciences) provides a solid foundation. It teaches you to think and test your assumptions – those are valuable skills that can lead to a both a financially lucrative and personally inspiring career.

Ajay- How would you describe the difference between Map Reduce and Hadoop and Oracle and SAS, DBMS and Teradata and Aster Data products to a class of undergraduate engineers ?

Shawn: Let’s start with the database guys – Oracle and Teradata. They focus on structured data – data that has a logical schema and is manipulated via a standards-based structured query language (SQL). Oracle tries to be everything to everyone – it does OLTP (low-latency transactions like credit card or stock trade execution apps) and some data warehousing (typically summary reporting). Oracle’s data warehouse is not known for large-scale data warehousing and is more often used for back-office reporting.

Teradata is focused on data warehousing and scales very well, but is extremely expensive – it runs on high-end custom hardware and takes a mainframe approach to data processing. This approach makes less sense as commodity hardware becomes more compute-rich and better software comes along to support large-scale MPP data warehousing.

SAS is very different – it’s not a relational database. It really offers an application platform for data analysis, specifically data mining. Unlike Oracle and Teradata which is used by SQL developers and managed by DBAs, SAS is typically run in business units by data analysts – for example a quantitative marketing analyst, a statistician/mathematician, or a savvy engineer with a data mining/math background. SAS is used to try to find patterns, understand behaviors, and offer predictive analytics that enable businesses to identify trends and make smarter decisions than their competitors.

Hadoop offers an open-source framework for large-scale data processing. MapReduce is a component of Hadoop, which also contains multiple other modules including a distributed filesystem (HDFS). MapReduce offers a programming paradigm for distributed computing (a parallel data flow processing framework).

Both Hadoop and MapReduce are catered toward the application developer or programmer. It’s not catered for enterprise data centers or IT. If you have a finite project in a line of business and want to get it done, Hadoop offers a low-cost way to do this. For example, if you want to do large-scale data munging like aggregations, transformations, manipulations of unstructured data – Hadoop offers a solution for this without compromising on the performance of your main data warehouse. Once the data munging is finished, the post-processed data set can be loaded into a database for interactive analysis or analytics. It is a great combination of big data technologies for certain use-cases.

Aster takes a very unique approach. Our Aster nCluster software offers the best of all worlds – we offer the potential for deep analytics of SAS, the low-cost scalability and parallel processing of Hadoop/MapReduce, and the structured data advantages (schema, SQL, ACID compliance and transactional integrity, indexes, etc) of a relational database like Teradata and Oracle. Often, we find complementary approaches and therefore view SAS and Hadoop/MapReduce as synergistic to a complete solution. Data warehouses like Teradata and Oracle tend to be more competitive.

Ajay- What exciting products have you launched so far and what makes them unique both from a technical developer perspective and a business owner perspective

Shawn: Aster was the first-to-market to offer In-Database MapReduce, which provides the standards and familiarity of SQL and databases with the analytic power of MapReduce. This is very unique as it offers technical developers and application programmers to write embedded procedural algorithms once, upload it, and allow business analysts or IT folks (SQL developers, DBAs, etc) to invoke these SQL-MapReduce functions forever.

It is highly polymorphic (re-usable), highly fault-tolerant, highly flexible (any language – Java, Python, Ruby, Perl, R statistical language, C# in the .NET world, etc) and natively massively parallel – all of which differentiate these SQL extensions from traditional dumb user-defined functions (UDFs).

Ajay- “I am happy with my databases and I don’t need too much diversity or experimentation in my systems”, says a CEO to you.

How do you convince him using quantitative numbers and not marketing adjectives?

Shawn: Aster has dozens of production customers including big-names like MySpace, LinkedIn, Akamai, Full Tilt Poker, comScore, and several yet-to-be-named retail and financial service accounts. We have quantified proof points that show orders of magnitude improvements in scalability, performance, and analytic insights compared to incumbent or competitor solutions. Our highly referenceable customers would be happy to discuss their positive experiences with the CEO.

But taking a step back, there’s a fundamental concept that this CEO needs to first understand. The world is changing – data growth is proliferating due to the digitization of so many applications and the emergence of unstructured data and new data types. Like the book “Competing on Analytics”, the world is shifting to a paradigm where companies that don’t take risks and push the limits on analytics will die like the dinosaurs.

IDC is projecting 10x+ growth in data over the next few years to zetabytes of aggregate data driven by digitization (Internet, digital television, RFID, etc). The data is there and in order to compete effectively and understand your customers more intimately, you need a large-scale analytics solution like the one Aster nCluster offers. If you hold off on experimentation and innovation, it will be too late by the time you realize you have a problem at hand.

Ajay- How important is work life balance for you?

Shawn: Very important. I hang out with my wife most weekends – we do a lot of outdoors activities like hiking and gardening. In Silicon Valley, it’s all too easy to get caught up in the rush of things. Taking breaks, especially during the weekend, is important to recharge and re-energize to be as productive as possible.

Ajay- Are you looking for college interns and new hires what makes aster exciting for you so you are pumped up every day to go to work?

Shawn: We’re always looking for smart, innovative, and entrepreneurial new college grads and interns, especially on the technical side. So if you are a computer science major or recent grad or graduate student, feel free to contact us for opportunities.

What makes Aster exciting is 2 things –

first, the people. Everyone is very smart and innovative so you learn a tremendous amount, which is personally gratifying and professionally useful long-term.

Second, Aster is changing the world!

Distributed systems computing focused on big data processing and analytics – these are massive game-changers that will fundamentally change the landscape in data warehousing and analytics. Traditional databases have been a oligopoly for over a generation – they haven’t been challenged and so the 1970’s based technology has stuck around. The emergence of big data and low-cost commodity hardware has created a unique opportunity to carve out a brand new market…

what gets me pumped every day is I have the ability to contribute to a pioneer that is quickly becoming Silicon Valley’s next great success story!

Biography-

Over the past decade, Shawn has led product management for some of Silicon Valley’s most successful and innovative technology companies. Most recently, he spent nearly 5 years at Network Appliance leading Core Systems storage product management, where he oversaw the development of high availability software and Storage Systems hardware products that grew in annual revenue from $200M to nearly $800M. Prior to NetApp, Shawn held senior product management and corporate strategy roles at Oracle Corporation and Ariba Inc.

Shawn holds an M.S. in Management Science and engineering from Stanford University, where he was awarded the Valentine Fellowship (endowed by Don Valentine of Sequoia Capital). He also received a B.A. with high honors from Princeton University.

About Aster

Aster Data Systems is a proven leader in high-performance database systems for data warehousing and analytics – the first DBMS to tightly integrate SQL with MapReduce – providing deep insights on data analyzed on clusters of low-cost commodity hardware. The AsternCluster database cost-effectively powers frontline analytic applications for companies such as MySpace, aCerno (an Akamai company), and ShareThis.

Running on low-cost off-the-shelf hardware, and providing ‘hands-free’ administration, Aster enables enterprises to meet their data warehousing needs within their budget. Aster is headquartered in San Carlos, California and is backed by Sequoia Capital, JAFCO Ventures, IVP, Cambrian Ventures, and First-Round Capital, as well as industry visionaries including David Cheriton and Ron Conway.

Please share:

Overview

Overview

The Structure of Dryad Jobs

The Dryad Software Stack

Publications

Why does Dryad use a DAG?

Please share:

Please share:

Please share:

Please share:

Please share:

Please share: