SQL and Hadoop: What is this cloud thing

Here is a very good ,in fact brilliant post from Joe Hellerstein, a Professor of Computer Science at UC Berkeley at http://radar.oreilly.com/2008/11/the-commoditization-of-massive.html

It explains the difference between the two databases type.


Enterprise IT camp tends to favor relational databases and the SQL language, while the web upstarts have rallied around the MapReduce programming model popularized at Google, and cloned in open source as Apache Hadoop. Hadoop is in wide use at companies like Yahoo! and Facebook, and gets a lot of attention in tech blogs as the next big open source project. But if you mention Hadoop in a corporate IT shop you are often met with blank stares — SQL is ubiquitous in those environments


Setting aside the trash talk, the usual cases made for the two technologies can be summarized as follows:

Relational Databases

  • multipurpose: useful for analysis and data update, batch and interactive tasks
  • high data integrity via ACID transactions
  • lots of compatible tools, e.g. for loading, management, reporting, data visualization and mining
  • support for SQL, the most widely-used language for data analysis
  • automatic SQL query optimization, which can radically improve performance
  • integration of SQL with familiar programming languages via connectivity protocols, mapping layers and user-defined functions

MapReduce (Hadoop)

  • designed for large clusters: 1000+ computers
  • very high availability, keeping long jobs running efficiently even when individual computers break or slow down
  • data is accessed in "native format" from a filesystem — no need to transform data into tables at load time
  • no special query language; programmers use familiar languages like Java, Python, and Perl
  • programmers retain control over performance, rather than counting on a query optimizer
  • the open-source Hadoop implementation is funded by corporate donors, and will mature over time as Linux and Apache did

Hadoop is still relatively young, and by all reports much slower and more resource intensive than Google’s MapReduce implementation.

What I liked about the article was explaining Hadoop in simple terms to corporate SQL types like me.

It’s interesting how Hadoop would be configured on the NVidia Tesla supercomputer ( at 10000 USD)


– Update – Mathematica is already being modified for the GPU versus CPU system, and there was an interesting discussion in R _help list today on this.

Mathematica is launching a version working with Nvidia GPUs. It is claimed that it’d make it
~10-100x faster!

Author: Ajay Ohri


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s