Jill Dyche on 2012

In part 3 of the series for predictions for 2012, here is Jill Dyche, Baseline Consulting/DataFlux.

Part 2 was Timo Elliot, SAP at http://www.decisionstats.com/timo-elliott-on-2012/ and Part 1 was Jim Kobielus, Forrester at http://www.decisionstats.com/jim-kobielus-on-2012/

Ajay: What are the top trends you saw happening in 2011?


Well, I hate to say I saw them coming, but I did. A lot of managers committed some pretty predictable mistakes in 2011. Here are a few we witnessed in 2011 live and up close:


1.       In the spirit of “size matters,” data warehouse teams continued to trumpet the volumes of stored data on their enterprise data warehouses. But a peek under the covers of these warehouses reveals that the data isn’t integrated. Essentially this means a variety of heterogeneous virtual data marts co-located on a single server. Neat. Big. Maybe even worthy of a magazine article about how many petabytes you’ve got. But it’s not efficient, and hardly the example of data standardization and re-use that everyone expects from analytical platforms these days.


2.       Development teams still didn’t factor data integration and provisioning into their project plans in 2011. So we saw multiple projects spawn duplicate efforts around data profiling, cleansing, and standardization, not to mention conflicting policies and business rules for the same information. Bummer, since IT managers should know better by now. The problem is that no one owns the problem. Which brings me to the next mistake…


3.       No one’s accountable for data governance. Yeah, there’s a council. And they meet. And they talk. Sometimes there’s lunch. And then nothing happens because no one’s really rewarded—or penalized for that matter—on data quality improvements or new policies. And so the reports spewing from the data mart are still fraught and no one trusts the resulting decisions.


But all is not lost since we’re seeing some encouraging signs already in 2012. And yes, I’d classify some of them as bona-fide trends.


Ajay: What are some of those trends?


Job descriptions for data stewards, data architects, Chief Data Officers, and other information-enabling roles are becoming crisper, and the KPIs for these roles are becoming more specific. Data management organizations are being divorced from specific lines of business and from IT, becoming specialty organizations—okay, COEs if you must—in their own rights. The value proposition for master data management now includes not just the reconciliation of heterogeneous data elements but the support of key business strategies. And C-level executives are holding the data people accountable for improving speed to market and driving down costs—not just delivering cleaner data. In short, data is becoming a business enabler. Which, I have to just say editorially, is better late than never!


Ajay: Anything surprise you, Jill?


I have to say that Obama mentioning data management in his State of the Union speech was an unexpected but pretty powerful endorsement of the importance of information in both the private and public sector.


I’m also sort of surprised that data governance isn’t being driven more frequently by the need for internal and external privacy policies. Our clients are constantly asking us about how to tightly-couple privacy policies into their applications and data sources. The need to protect PCI data and other highly-sensitive data elements has made executives twitchy. But they’re still not linking that need to data governance.


I should also mention that I’ve been impressed with the people who call me who’ve had their “aha!” moment and realize that data transcends analytic systems. It’s operational, it’s pervasive, and it’s dynamic. I figured this epiphany would happen in a few years once data quality tools became a commodity (they’re far from it). But it’s happening now. And that’s good for all types of businesses.



Jill Dyché has written three books and numerous articles on the business value of information technology. She advises clients and executive teams on leveraging technology and information to enable strategic business initiatives. Last year her company Baseline Consulting was acquired by DataFlux Corporation, where she is currently Vice President of Thought Leadership. Find her blog posts on www.dataroundtable.com.

Jump to JMP- the best statistical GUI software as per Google Search

This book just won an international award

producing graphs alongside results. In most cases, each page or two-page spread completes a JMP task, which maximizes the book’s utility as a reference.

Continue reading “Jump to JMP- the best statistical GUI software as per Google Search”

SAS to R Challenge: Unique benchmarking

Flag of Town of Cary
Image via Wikipedia

An interesting announcemnet from Revolution Analytics promises to convert your legacy code in SAS language not only cheaper but faster. It’ s a very very interesting challenge and I wonder how SAS users ,corporates, customers as well as the Institute itself reacts


Take the SAS to R Challenge

Are you paying for expensive software licenses and hardware to run time-consuming statistical analyses on big data sets?

If you’re doing linear regressions, logistic regressions, predictions, or multivariate crosstabulations* there’s something you should know: Revolution Analytics can get the same results for a substantially lower cost and faster than SAS®.

For a limited time only, Revolution Analytics invites you take the SAS to R Challenge. Let us prove that we can deliver on our promise of replicating your results in R, faster and cheaper than SAS.

Take the challenge

Here’s how it works:

Fill out the short form below, and one of our conversion experts will contact you to discuss the SAS code you want to convert. If we think Revolution R Enterprise can get the same results faster than SAS, we’ll convert your code to R free of charge. Our goal is to demonstrate that Revolution R Enterprise will produce the same results in less time. There’s no obligation, but if you choose to convert, we guarantee that your license cost for Revolution R Enterprise will be less than half what you’re currently paying for the equivalent SAS software.**

It’s that simple.

We’ll show you that you don’t need expensive hardware and software to do high quality statistical analysis of big data. And we’ll show that you don’t need to tie up your computing resources with long running operations. With Revolution R Enterprise, you can run analyses on commodity hardware using Linux or Windows, scale to terabyte-class data problems and do it at processing speeds you would never have thought possible.

Sign up now, and we will be in touch shortly.

Take the challenge



SAS is a registered trademark of the SAS Institute, Cary, NC, in the US and other countries.

*Additional statistical algorithms are being rapidly added to Revolution R Enterprise. Custom development services are also available.

**Revolution Analytics retains the right to determine eligibility for this offer. Offer available until March 31, 2011.

Data Visualization: Central Banks

Iron Ore Company of Canada
Image via Wikipedia

Trying to compare the transparency of central banks via the data visualization of two very different central banks.

One is Reserve Bank of India and the other is Federal Reserve Bank of New York

Here are some points-

1) The federal bank gives you a huge clutter of charts to choose from and sometimes gives you very difficult to understand charts.

see http://www.newyorkfed.org/research/global_economy/usecon_charts.html

and http://www.newyorkfed.org/research/directors_charts/us18chart.pdf


2) The Reserve bank of India choose Business Objects and gives you a proper drilldown kind  of  graph and tables. ( thats a lot of heavy metal and iron ore China needs from India 😉 😉

Foreign Trade – Export      Time-line: ALL

2010:07 (JUL) – P China IRON ORE (Units: TON) 205.06 1878456
2010:06 (JUN) – P China IRON ORE (Units: TON) 427.68 6808528
2010:05 (MAY) – P China IRON ORE (Units: TON) 550.67 5290450
2010:04 (APR) – P China IRON ORE (Units: TON) 922.46 9931500
2010:03 (MAR) – P China IRON ORE (Units: TON) 829.75 13177672
2010:02 (FEB) – P China IRON ORE (Units: TON) 706.04 10141259
2010:01 (JAN) – P China IRON ORE (Units: TON) 577.13 8498784
2009:12 (DEC) – P China IRON ORE (Units: TON) 545.68 9264544
2009:11 (NOV) – P China IRON ORE (Units: TON) 508.17 9509213
2009:10 (OCT) – P China IRON ORE (Units: TON) 422.6 7691652
2009:09 (SEP) – P China IRON ORE (Units: TON) 278.04 4577943
2009:08 (AUG) – P China IRON ORE (Units: TON) 276.96 4371847
2009:07 (JUL) China IRON ORE (Units: TON) 266.11 4642237
2009:06 (JUN) China IRON ORE (Units: TON) 241.08 4584354

Source : DGCI & S, Ministry of Commerce & Industry, GoI


You can see the screenshots of the various visualization tools of the New York Fed Reserve Bank and Indian Reserve Bank- if the US Fed is serious about cutting the debt maybe it should start publishing better visuals

Sector/ Sphere – Faster than Hadoop/Mapreduce at Terasort

Here is a preview of a relatively young software Sector and Sphere- which are claimed to be better than Hadoop /MapReduce at TeraSort Benchmark among others.


System Overview

The Sector/Sphere stack consists of the Sector distributed file system and the Sphere parallel data processing framework. The objective is to support highly effective and efficient large data storage and processing over commodity computer clusters.

Sector/Sphere Architecture

Sector consists of 4 parts, as shown in the above diagram. The Security server maintains the system security configurations such as user accounts, data IO permissions, and IP access control lists. The master servers maintain file system metadata, schedule jobs, and respond users’ requests. Sector supports multiple active masters that can join and leave at run time and they all actively respond users’ requests. The slave nodes are racks of computers that store and process data. The slaves nodes can be located within a single data center to across multiple data centers with high speed network connections. Finally, the client includes tools and programming APIs to access and process Sector data.

Sphere: Parallel Data Processing Framework

Sphere allows developers to write parallel data processing applications with a very simple set of API. It applies user-defined functions (UDF) on all input data segments in parallel. In a Sphere application, both inputs and outputs are Sector files. Multiple Sphere processing can be combined to support more complicated applications, with inputs/outputs exchanged/shared via the Sector file system.

Data segments are processed at their storage locations whenever possible (data locality). Failed data segments may be restarted on other nodes to achieve fault tolerance.

The Sphere framework can be compared to MapReduce as they both enforce data locality and provide simplified programming interfaces. In fact, Sphere can simulate any MapReduce operations, but Sphere is more efficient and flexible. Sphere can provide better data locality for applications that process files or multiple files as minimum input units and for applications that involve with iterative/combinative processing, which requires coordination of multiple UDFs to obtain the final result.

A Sphere application includes two parts: the client program that organizes inputs (including certain parameters), outputs, and UDFs; and the UDFs that process data segments. Data segmentation, load balancing, and fault tolerance are transparent to developers.

Space: Column-based Distbuted Data Table

Space stores data tables in Sector and uses Sphere for parallel query processing. Space is similar to BigTable. Table is stored by columns and is segmented on to multiple slave nodes. Tables are independent and no relationship between tables are supported. A reduced set of SQL operations is supported, including but not limited to table creation and modification, key-value update and lookup, and select operations based on UDF.

Supported by the Sector data placement mechanism and the Sphere parallel processing framework, Space can support efficient key-value lookup and certain SQL queries on very large data tables.

Space is currently still in development.

and just when you thought Hadoop was the only way to be on the cloud.


The Terasort Benchmark

The table below lists the performance (total processing time in seconds) of the Terasort benchmark of both Sphere and Hadoop. (Terasort benchmark: suppose there are N nodes in the system, the benchmark generates a 10GB file on each node and sorts the total N*10GB data. Data generation time is excluded.) Note that it is normal to see a longer processing time for more nodes because the total amount of data also increases proportionally.

The performance value listed in this page was achieved using the Open Cloud Testbed. Currently the testbed consists of 4 racks. Each rack has 32 nodes, including 1 NFS server, 1 head node, and 30 compute/slave nodes. The head node is a Dell 1950, dual dual-core Xeon 3.0GHz, 16GB RAM. The compute nodes are Dell 1435s, single dual core AMD Opteron 2.0GHz, 4GB RAM, and 1TB single disk. The 4 racks are located in JHU (Baltimore), StarLight (Chicago), UIC (Chicago), and Calit2(San Diego). The inter-rack bandwidth is 10GE, supported by CiscoWave deployed over National Lambda Rail.

Hadoop (3 replicas)
Hadoop (1 replica)
1265 2889 2252
UIC + StarLight
1361 2896 2617
UIC + StarLight + Calit2
1430 4341 3069
UIC + StarLight + Calit2 + JHU
1526 6675 3702

The benchmark uses the testfs/testdc examples of Sphere and randomwriter/sort examples of Hadoop. Hadoop parameters were tuned to reach good results.

Updated on Sep. 22, 2009: We have benchmarked the most recent versions of Sector/Sphere (1.24a) and Hadoop (0.20.1) on a new set of servers. Each server node costs $2,200 and consits of a single Intel Xeon E5410 2.4GHz CPU, 16GB RAM, 4*1TB RAID0 disk, and 1Gb/s NIC. The 120 nodes are hosted on 4 racks within the same data center and the inter-rack bandwidth is 20Gb/s.

The table below lists the performance of sorting 1TB data using Sector/Sphere version 1.24a and Hadoop 0.20.1. Related Hadoop parameters have been tuned for better performance (e.g., big block size), while Sector/Sphere does not require tuning. In addition, to achieve the highest performance, replication is disabled in both systems (note that replication does not afftect the performance of Sphere but will significantly decrease the performance of Hadoop).

Number of Racks
28m 25s 85m 49s
15m 20s 37m 0s
10m 19s 25m 14s
7m 56s 17m 45s

AsterData releases nCluster 4.6

From the press release

Aster Data nCluster 4.6, which includes a column data store, making Aster Data nCluster 4.6 the first platform with a unified SQL-MapReduce analytic framework on a hybrid row and column massively parallel processing (MPP) database management system (DBMS). The unified SQL-MapReduce analytic framework and Aster Data’s suite of 1000+ MapReduce-ready analytic functions, delivers a substantial breakthrough in richer, high performance analytics on large data volumes where data can be stored in either a row or column format.

With Aster Data nCluster 4.6, customers can choose the data format best suited to their needs and benefit from the power of Aster Data’s SQL-MapReduce analytic capabilities, providing maximum query performance by leveraging row-only, column-only, or hybrid storage strategies. Aster Data makes selection of the appropriate storage strategy easy with the new Data Model Express tool that determines the optimal data model based on a customer’s query workloads.  Both row and column stores in Aster Data nCluster 4.6 benefit from platform-level services including Online Precision Scaling™ on commodity hardware, dynamic workload management, and always-on availability, all of which now operate on both row and column stores. All 1000+ MapReduce-ready analytic functions released previously through Aster Data Analytic Foundation — a powerful suite of pre-built MapReduce analytic software building blocks — now run on a hybrid row and column architecture.  Aster Data nCluster 4.6 also includes new pre-built analytic functions, including decision trees and histograms. For custom analytic application development, the Aster Data IDE, Aster Data Developer Express, also fully and seamlessly supports the hybrid row and column store in Aster DatanCluster 4.6.

More advanced analytics infrastructure.

Mapreduce Book

Here is a new book on learning MapReduce and it has a free downloadable version as well.

Data-Intensive Text Processing with MapReduce

Jimmy Lin and Chris Dyer


Our world is being revolutionized by data-driven methods: access to large amounts of data has generated new insights and opened exciting new opportunities in commerce, science, and computing applications. Processing the enormous quantities of data necessary for these advances requires large clusters, making distributed computing paradigms more crucial than ever. MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-level details, ranging from scheduling to synchronization to fault tolerance. This book focuses on MapReduce algorithm design, with an emphasis on text processing algorithms common in natural language processing, information retrieval, and machine learning. We introduce the notion of MapReduce design patterns, which represent general reusable solutions to commonly occurring problems across a variety of problem domains. This book not only intends to help the reader “think in MapReduce”, but also discusses limitations of the programming model as well.

You can download the book here

This book is part of the Morgan & Claypool Synthesis Lectures on Human Language Technologies. If you’re at a university, your institution may already subscribe to the series, in which case you can access the electronic version directly without cost (see this page for a list of institutional subscribers). Otherwise, to purchase:

Quite explicitly, this book focuses on MapReduce algorithm design, not Hadoop programming. Tom White’s Hadoop: The Definitive Guide is a great resource for learning Hadoop.

Want to be notified of updates? Interested in MapReduce algorithm design? Follow @lintool on Twitter here!