MapReduce Analytics Apps- AsterData's Developer Express Plugin

AsterData continues to wow with it’s efforts on bridging MapReduce and Analytics, with it’s new Developer Express plug-in for Eclipse. As any Eclipse user knows, that greatly improves ability to write code or develop ( similar to creating Android apps if you have tried to). I did my winter internship at AsterData last December last year in San Carlos, and its an amazing place with giga-level bright people.

Here are some details ( Note I plan to play a bit more on the plugin on my currently downUbuntu on this and let you know)

http://marketplace.eclipse.org/content/aster-data-developer-express-plug-eclipse

Aster Data Developer Express provides an integrated set of tools for development of SQL and MapReduce analytics for Aster Data nCluster, a massively parallel database with an integrated analytics engine.

The Aster Data Developer Express plug-in for Eclipse enables developers to easily create new analytic application projects with the help of an intuitive set of wizards, immediately test their applications on their desktop, and push down their applications into the nCluster database with a single click.

Using Developer Express, analysts can significantly reduce the complexity and time needed to create advanced analytic applications so that they can more rapidly deliver deeper and richer analytic insights from their data.

and from the Press Release

Now, any developer or analyst that is familiar with the Java programming language can complete a rich analytic application in under an hour using the simple yet powerful Aster Data Developer Express environment in Eclipse. Aster Data Developer Express delivers both rapid development and local testing of advanced analytic applications for any project, regardless of size.

The free, downloadable Aster Data Developer Express IDE now brings the power of SQL-MapReduce to any organization that is looking to build richer analytic applications that can leverage massive data volumes. Much of the MapReduce coding, including programming concepts like parallelization and distributed data analysis, is addressed by the IDE without the developer or analyst needing to have expertise in these areas. This simplification makes it much easier for developers to be successful quickly and eliminates the need for them to have any deep knowledge of the MapReduce parallel processing framework. Google first published MapReduce in 2004 for parallel processing of big data sets. Aster Data has coupled SQL with MapReduce and brought SQL-MapReduce to market, making it significantly easier for any organization to leverage the power of MapReduce. The Aster Developer Express IDE simplifies application development even further with an intuitive point-and-click development environment that speeds development of rich analytic applications. Applications can be validated locally on the desktop or ultimately within Aster Data nCluster, a massive parallel processing (MPP) database with a fully integrated analytics engine that is powered by MapReduce—known as a data-analytics server.

Rich analytic applications that can be easily built with Aster Data’s downloadable IDE include:

Iterative Analytics: Uncovering critical business patterns in your data requires hypothesis-driven, iterative analysis.  This class of applications is defined by the exploratory navigation of massive volumes of data in a top-down, deductive manner.  Aster Data’s IDE makes this easy to develop and to validate the algorithms and functions required to deliver these advanced analytic applications.

Prediction and Optimization: For this class of applications, the process is inductive. Rather than starting with a hypothesis, developers and analysts can easily build analytic applications that discover the trends, patterns, and outliers in data sets.  Examples include propensity to churn in telecommunications, proactive product and service recommendations in retail, and pricing and retention strategies in financial services.

Ad Hoc Analysis: Examples of ad hoc analysis that can be performed includes social network analysis, advanced click stream analysis, graph analysis, cluster analysis, and a wide variety of mathematical, trigonometry, and statistical functions.

“Aster Data’s IDE and SQL-MapReduce significantly eases development of advanced analytic applications on big data. We have now built over 350 analytic functions in SQL-MapReduce on Aster Data nCluster that are available for customers to purchase,” said Partha Sen, CEO and Founder of Fuzzy Logix. “Aster Data’s implementation of MapReduce with SQL-MapReduce goes beyond the capabilities of general analytic development APIs and provides us with the excellent control and flexibility needed to implement even the most complex analytic algorithms.”

Richer analytics on big data volumes is the new competitive frontier. Organizations have always generated reports to guide their decision-making. Although reports are important, they are historical sets of information generally arranged around predefined metrics and generated on a periodic basis.

Advanced analytics begins where reporting leaves off. Reporting often answers historical questions such as “what happened?” However, analytics addresses “why it happened” and, increasingly, “what will happen next?” To that end, solutions like Aster Data Developer Express ease the development of powerful ad hoc, predictive analytics and enables analysts to quickly and deeply explore terabytes to petabytes of data.
“We are in the midst of a new age in analytics. Organizations today can harness the power of big data regardless of scale or complexity”, said Don Watters, Chief Data Architect for MySpace. “Solutions like the Aster Data Developer Express visual development environment make it even easier by enabling us to automate aspects of development that currently take days, allowing us to build rich analytic applications significantly faster. Making Developer Express openly available for download opens the power of MapReduce to a broader audience, making big data analytics much faster and easier than ever before.”

“Our delivery of SQL coupled with MapReduce has clearly made it easier for customers to build highly advanced analytic applications that leverage the power of MapReduce. The visual IDE, Aster Data Developer Express, introduced earlier this year, made application development even easier and the great response we have had to it has driven us to make this open and freely available to any organization looking to build rich analytic applications,” said Tasso Argyros, Founder and CTO, Aster Data. “We are excited about today’s announcement as it allows companies of all sizes who need richer analytics to easily build powerful analytic applications and experience the power of MapReduce without having to learn any new skills.”

You can have a look here at http://www.asterdata.com/download_developer_express/

Indian Offshoring IPOs dismal performance

Using Yahoo Finance, I plotted the past three years stock price of Indian Offshores  (Genpact, Wns, Exl) and in comparison with Indian Software companies (Infosys, Wipro, TCS, Sify) and market index.

The following insights emerge-

1) Indian Software companies have constantly created wealth.

2) Indian Offshoring companies have constantly lost market value – perhaps because they were able to dump IPO prices at much higher prices by creating hype.

3) You are much better off investing in Indian stock market or a blue chip Indian software company than take part in an Indian offshorers IPO.

4) SIFY lost most value and its founder CEO is now in jail for fraud. The fraud was he added phantom employees, and phantom revenue to boost balance sheet. Auditors from PwC (were jailed) included a board member of Indian Chartered Accountants and Satyam (SIFY) had won awards for corporate governance. It makes sense to do rigorous cash flow due diligence this side of the pond.

5) I won no stock in any of this companies  (not surprisingly) but do have a portfolio of mutual funds (index).

So the next time you are promised the moon by an Indian IPO- KPO, remember to do the math 😉

Review: Once upon a time in Mumbaai

This is a 70’s era Bollywood movie with two fine actors pitted in a classic genre- stylish mafia drama. An ensemble supporting cast, pretty images to see a classic not so crowded Bombay (as it was called)- it actually draws inspiration from real life gangsters. With fine music and good action as well, this movie can be good for your time-

GNU PSPP- The Open Source SPSS

If you are SPSS user (for statistics/ not data mining) you can also try 0ut GNU PSPP- which is the open source equivalent and quite eerily impressive in performance. It is available at http://www.gnu.org/software/pspp/ or http://pspp.awardspace.com/ and you can also read more at http://en.wikipedia.org/wiki/PSPP

PSPP is a program for statistical analysis of sampled data. It is a Free replacement for the proprietary program SPSS, and appears very similar to it with a few exceptions.

[ Image of Variable Sheet ]The most important of these exceptions are, that there are no “time bombs”; your copy of PSPP will not “expire” or deliberately stop working in the future. Neither are there any artificial limits on the number of cases or variables which you can use. There are no additional packages to purchase in order to get “advanced” functions; all functionality that PSPP currently supports is in the core package.

PSPP can perform descriptive statistics, T-tests, linear regression and non-parametric tests. Its backend is designed to perform its analyses as fast as possible, regardless of the size of the input data. You can use PSPP with its graphical interface or the more traditional syntax commands.

A brief list of some of the features of PSPP follows:

  • Supports over 1 billion cases.
  • Supports over 1 billion variables.
  • Syntax and data files are compatible with SPSS.
  • Choice of terminal or graphical user interface.
  • Choice of text, postscript or html output formats.
  • Inter-operates with GnumericOpenOffice.Org and other free software.
  • Easy data import from spreadsheets, text files and database sources.
  • Fast statistical procedures, even on very large data sets.
  • No license fees.
  • No expiration period.
  • No unethical “end user license agreements”.
  • Fully indexed user manual.
  • Free Software; licensed under GPLv3 or later.
  • Cross platform; Runs on many different computers and many different operating systems.

PSPP is particularly aimed at statisticians, social scientists and students requiring fast convenient analysis of sampled data.

and

Features

This software provides a basic set of capabilities: frequencies, cross-tabs comparison of means (T-tests and one-way ANOVA); linear regression, reliability (Cronbach’s Alpha, not failure or Weibull), and re-ordering data, non-parametric tests, factor analysis and more.

At the user’s choice, statistical output and graphics are done in asciipdfpostscript or html formats. A limited range of statistical graphs can be produced, such as histogramspie-charts and np-charts.

PSPP can import GnumericOpenDocument and Excel spreadsheetsPostgres databasescomma-separated values– and ASCII-files. It can export files in the SPSS ‘portable’ and ‘system’ file formats and to ASCII files. Some of the libraries used by PSPP can be accessed programmatically; PSPP-Perl provides an interface to the libraries used by PSPP.

Origins

The PSPP project (originally called “Fiasco”) is a free, open-source alternative to the proprietary statistics package SPSS. SPSS is closed-source and includes a restrictive licence anddigital rights management. The author of PSPP considered this ethically unacceptable, and decided to write a program which might with time become functionally identical to SPSS, except that there would be no licence expiry, and everyone would be permitted to copy, modify and share the program.

Release history

  • 0.7.5 June 2010 http://pspp.awardspace.com/
  • 0.6.2 October 2009
  • 0.6.1 October 2008
  • 0.6.0 June 2008
  • 0.4.0.1 August 2007
  • 0.4.0 August 2005
  • 0.3.0 April 2004
  • 0.2.4 January 2000
  • 0.1.0 August 1998

Third Party Reviews

In the book “SPSS For Dummies“, the author discusses PSPP under the heading of “Ten Useful Things You Can Find on the Internet” [1]. In 2006, the South African Statistical Association presented a conference which included an analysis of how PSPP can be used as a free replacement to SPSS [2].

Citation-

Please send FSF & GNU inquiries to gnu@gnu.org. There are also other ways to contact the FSF. Please send broken links and other corrections (or suggestions) to bug-gnu-pspp@gnu.org.

Copyright © 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007 Free Software Foundation, Inc., 51 Franklin St – Suite 330, Boston, MA 02110, USA – Verbatim copying and distribution of this entire article are permitted worldwide, without royalty, in any medium, provided this notice, and the copyright notice, are preserved.

Big Data and R: New Product Release by Revolution Analytics

Press Release by the Guys in Revolution Analytics- this time claiming to enable terabyte level analytics with R. Interesting stuff but techie details are awaited.

Revolution Analytics Brings

Big Data Analysis to R

The world’s most powerful statistics language can now tackle terabyte-class data sets using

Revolution R Enterpriseat a fraction of the cost of legacy analytics products


JSM 2010 – VANCOUVER (August 3, 2010) — Revolution Analytics today introduced ‘Big Data’ analysis to its Revolution R Enterprise software, taking the popular R statistics language to unprecedented new levels of capacity and performance for analyzing very large data sets. For the first time, R users will be able to process, visualize and model terabyte-class data sets in a fraction of the time of legacy products—without employing expensive or specialized hardware.

The new version of Revolution R Enterprise introduces an add-on package called RevoScaleR that provides a new framework for fast and efficient multi-core processing of large data sets. It includes:

  • The XDF file format, a new binary ‘Big Data’ file format with an interface to the R language that provides high-speed access to arbitrary rows, blocks and columns of data.
  • A collection of widely-used statistical algorithms optimized for Big Data, including high-performance implementations of Summary Statistics, Linear Regression, Binomial Logistic Regressionand Crosstabs—with more to be added in the near future.
  • Data Reading & Transformation tools that allow users to interactively explore and prepare large data sets for analysis.
  • Extensibility, expert R users can develop and extend their own statistical algorithms to take advantage of Revolution R Enterprise’s new speed and scalability capabilities.

“The R language’s inherent power and extensibility has driven its explosive adoption as the modern system for predictive analytics,” said Norman H. Nie, president and CEO of Revolution Analytics. “We believe that this new Big Data scalability will help R transition from an amazing research and prototyping tool to a production-ready platform for enterprise applications such as quantitative finance and risk management, social media, bioinformatics and telecommunications data analysis.”

Sage Bionetworks is the nonprofit force behind the open-source collaborative effort, Sage Commons, a place where data and disease models can be shared by scientists to better understand disease biology. David Henderson, Director of Scientific Computing at Sage, commented: “At Sage Bionetworks, we need to analyze genomic databases hundreds of gigabytes in size with R. We’re looking forward to using the high-speed data-analysis features of RevoScaleR to dramatically reduce the times it takes us to process these data sets.”

Take Hadoop and Other Big Data Sources to the Next Level

Revolution R Enterprise fits well within the modern ‘Big Data’ architecture by leveraging popular sources such as Hadoop, NoSQL or key value databases, relational databases and data warehouses. These products can be used to store, regularize and do basic manipulation on very large datasets—while Revolution R Enterprise now provides advanced analytics at unparalleled speed and scale: producing speed on speed.

“Together, Hadoop and R can store and analyze massive, complex data,” said Saptarshi Guha, developer of the popular RHIPE R package that integrates the Hadoop framework with R in an automatically distributed computing environment. “Employing the new capabilities of Revolution R Enterprise, we will be able to go even further and compute Big Data regressions and more.”

Platforms and Availability

The new RevoScaleR package will be delivered as part of Revolution R Enterprise 4.0, which will be available for 32-and 64-bit Microsoft Windows in the next 30 days. Support for Red Hat Enterprise Linux (RHEL 5) is planned for later this year.

On its website (http://www.revolutionanalytics.com/bigdata), Revolution Analytics has published performance and scalability benchmarks for Revolution R Enterprise analyzing a 13.2 gigabyte data set of commercial airline information containing more than 123 million rows, and 29 columns.

Additionally, the company will showcase its new Big Data solution in a free webinar on August 25 at 9:00 a.m. Pacific.

Additional Resources

•      Big Data Benchmark whitepaper

•      The Revolution Analytics Roadmap whitepaper

•      Revolutions Blog

•      Download free academic copy of Revolution R Enterprise

•      Visit Inside-R.org for the most comprehensive set of information on R

•      Spread the word: Add a “Download R!” badge on your website

•      Follow @RevolutionR on Twitter

About Revolution Analytics

Revolution Analytics (http://www.revolutionanalytics.com) is the leading commercial provider of software and support for the popular open source R statistics language. Its Revolution R products help make predictive analytics accessible to every type of user and budget. The company is headquartered in Palo Alto, Calif. and backed by North Bridge Venture Partners and Intel Capital.

Media Contact

Chantal Yang
Page One PR, for Revolution Analytics
Tel: +1 415-875-7494

Email:  revolution@pageonepr.com

Business Analytics Analyst Relations /Ethics/White Papers

Curt Monash, whom I respect and have tried to interview (unsuccessfully) points out suitable ethical dilemmas and gray areas in Analyst Relations in Business Intelligence here at http://www.dbms2.com/2010/07/30/advice-for-some-non-clients/

If you dont know what Analyst Relations are, well it’s like credit rating agencies for BI software. Read Curt and his landscaping of the field here ( I am quoting a summary) at http://www.strategicmessaging.com/the-ethics-of-white-papers/2010/08/01/

Vendors typically pay for

  1. They want to connect with sales prospects.
  2. They want general endorsement from the analyst.
  3. They specifically want endorsement from the analyst for their marketing claims.
  4. They want the analyst to do a better job of explaining something than they think they could do themselves.
  5. They want to give the analyst some money to enhance the relationship,

Merv Adrian (I interviewed Merv here at http://www.dudeofdata.com/?p=2505) has responded well here at http://www.enterpriseirregulars.com/23040/white-paper-sponsorship-and-labeling/

None of the sites I checked clearly identify the work as having been sponsored in any way I found obvious in my (admittefly) quick scan. So this is an issue, but it’s not confined to Oracle.

My 2 cents (not being so well paid 😉 are-

I think Curt was calling out Oracle (which didnt respond) and not Merv ( whose subsequent blog post does much to clarify).

As a comparative new /younger blogger in this field,
I applaud both Curt to try and bell the cat ( or point out what everyone in AR winks at) and for Merv for standing by him.

In the long run, it would strengthen analyst relations as a channel if they separate financial payment of content from bias. An example is credit rating agencies who forgot to do so in BFSI and see what happened.

Customers invest millions of dollars in BI systems trusting marketing collateral/white papers/webinars/tests etc. Perhaps it’s time for an industry association for analysts so that individual analysts don’t knuckle down under vendor pressure.

It is easier for someone of Curt, Merv’s stature to declare editing policy and disclosures before they write a white paper.It is much harder for everyone else who is not so well established.

White papers can take as much as 25,000$ to produce- and I know people who in Business Analytics (as opposed to Business Intelligence) slog on cents per hour cranking books on R, SAS , webinars, trainings but there are almost no white papers in BA. Are there any analytics independent analysts who are not biased by R or SAS or SPSS or etc etc. I am not sure but this looks like a good line to  pursue 😉 – provided ethical checks and balances are established.

Personally I know of many so called analytics communities go all out to please their sponsors so bias in writing does exist (you cant praise SAS on a R Blogging Forum or R USers Meet and you cant write on WPS at SAS Community.org )

– at the same time someone once told me- It is tough to make a living as a writer, and that choice between easy money and credible writing needs to be respected.

Most sponsored white papers I read are pure advertisements, directed at CEOs rather than the techie community at large.

Almost every BI vendor claims to have the fastest database with 5X speed- and benchmarking in technical terms could be something they could do too.

Just like Gadget sites benchmark products, you can not benchmark BI or even BA products as it is written not to do so  in many licensing terms.

Probably that is the reason Billions are spent in BI and the positive claims are doubtful ( except by the sellers). Similarly in Analytics, many vendors would have difficulty justifying their claims or prices if they are subjected to a side by side comparison. Unfortunately the resulting confusion results in shoddy technology coming stronger due to more aggressive marketing.

Protected: SAS Institute lawsuit against WPS Episode 2 The Clone Wars

This content is password protected. To view it please enter your password below: