Open Source Business Intelligence: Pentaho and Jaspersoft

Here are two products that are used widely for Business Intelligence_ They are open source and both have free preview.

Jaspersoft-For the Enterprise version click on the screenshot while for the free community version you can go to

http://jasperforge.org/projects/jasperserver

Interestingly (and not surprisingly) Revolution Analytics is teaming up with Jaspersoft to use R for reporting along with the Jaspersoft BI stack.

ADVANCED ANALYTICS ON DEMAND IN APPLICATIONS, IN DASHBOARDS, AND ON THE WEB

FREE WEBINAR WEDNESDAY, SEPTEMBER 22ND @9AM PACIFIC

DEPLOYING R: ADVANCED ANALYTICS ON DEMAND IN APPLICATIONS, IN DASHBOARDS, AND ON THE WEB

A JOINT WEBINAR FROM REVOLUTION ANALYTICS AND JASPERSOFT

Date: Wednesday, September 22, 2010
Time: 9:00am PDT (12:00pm EDT; 4:00pm GMT)
Presenters: David Smith, Vice President of Marketing, Revolution Analytics
Andrew Lampitt, Senior Director of Technology Alliances, Jaspersoft
Matthew Dahlman, Business Development Engineer, Jaspersoft
Registration: Click here to register now!

R is a popular and powerful system for creating custom data analysis, statistical models, and data visualizations. But how can you make the results of these R-based computations easily accessible to others? A PhD statistician could use R directly to run the forecasting model on the latest sales data, and email a report on request, but then the process is just going to have to be repeated again next month, even if the model hasn’t changed. Wouldn’t it be better to empower the Sales manager to run the model on demand from within the BI application she already uses—daily, even!—and free up the statistician to build newer, better models for others?

In this webinar, David Smith (VP of Marketing, Revolution Analytics) will introduce the new “RevoDeployR” Web Services framework for Revolution R Enterprise, which is designed to make it easy to integrate dynamic R-based computations into applications for business users. RevoDeployR empowers data analysts working in R to publish R scripts to a server-based installation of Revolution R Enterprise. Application developers can then use the RevoDeployR Web Services API to securely and scalably integrate the results of these scripts into any application, without needing to learn the R language. With RevoDeployR, authorized users of hosted or cloud-based interactive Web applications, desktop applications such as Microsoft Excel, and BI applications like Jaspersoft can all benefit from on-demand analytics and visualizations developed by expert R users.

To demonstrate the power of deploying R-based computations to business users, Andrew Lampitt will introduce Jaspersoft commercial open source business intelligence, the world’s most widely used BI software. In a live demonstration, Matt Dahlman will show how to supercharge the BI process by combining Jaspersoft and Revolution R Enterprise, giving business users on-demand access to advanced forecasts and visualizations developed by expert analysts.

Click here to register for the webinar.

Speaker Biographies:

David Smith is the Vice President of Marketing at Revolution Analytics, the leading commercial provider of software and support for the open source “R” statistical computing language. David is the co-author (with Bill Venables) of the official R manual An Introduction to R. He is also the editor of Revolutions (http://blog.revolutionanalytics.com), the leading blog focused on “R” language, and one of the originating developers of ESS: Emacs Speaks Statistics. You can follow David on Twitter as @revodavid.

Andrew Lampitt is Senior Director of Technology Alliances at Jaspersoft. Andrew is responsible for strategic initiatives and partnerships including cloud business intelligence, advanced analytics, and analytic databases. Prior to Jaspersoft, Andrew held other business positions with Sunopsis (Oracle), Business Objects (SAP), and Sybase (SAP). Andrew earned a BS in engineering from the University of Illinois at Urbana Champaign.

Matthew Dahlman is Jaspersoft’s Business Development Engineer, responsible for technical aspects of technology alliances and regional business development. Matt has held a wide range of technical positions including quality assurance, pre-sales, and technical evangelism with enterprise software companies including Sybase, Netonomy (Comverse), and Sunopsis (Oracle). Matt earned a BA in mathematics from Carleton College in Northfield, Minnesota.


The second widely used BI stack in open source is Pentaho.

You can download it here to evaluate it or click on screenshot to read more at

http://community.pentaho.com/

http://sourceforge.net/projects/pentaho/files/Business%20Intelligence%20Server/

Dryad- Microsoft's answer to MR

While reading across the internet I came across Microsoft’s version to MapReduce called Dryad- which has been around for some time, but has not generated quite the buzz that Hadoop or MapReduce are doing.

http://research.microsoft.com/en-us/projects/dryadlinq/

DryadLINQ

DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC clusters.

Overview

New! An academic release of Dryad/DryadLINQ is now available for public download.

The goal of DryadLINQ is to make distributed computing on large compute cluster simple enough for every programmers. DryadLINQ combines two important pieces of Microsoft technology: the Dryad distributed execution engine and the .NET Language Integrated Query (LINQ).

Dryad provides reliable, distributed computing on thousands of servers for large-scale data parallel applications. LINQ enables developers to write and debug their applications in a SQL-like query language, relying on the entire .NET library and using Visual Studio.

DryadLINQ translates LINQ programs into distributed Dryad computations:

  • C# and LINQ data objects become distributed partitioned files.
  • LINQ queries become distributed Dryad jobs.
  • C# methods become code running on the vertices of a Dryad job.

DryadLINQ has the following features:

  • Declarative programming: computations are expressed in a high-level language similar to SQL
  • Automatic parallelization: from sequential declarative code the DryadLINQ compiler generates highly parallel query plans spanning large computer clusters. For exploiting multi-core parallelism on each machine DryadLINQ relies on the PLINQ parallelization framework.
  • Integration with Visual Studio: programmers in DryadLINQ take advantage of the comprehensive VS set of tools: Intellisense, code refactoring, integrated debugging, build, source code management.
  • Integration with .Net: all .Net libraries, including Visual Basic, and dynamic languages are available.
  • and
  • Conciseness: the following line of code is a complete implementation of the Map-Reduce computation framework in DryadLINQ:
    • public static IQueryable<R>
      MapReduce<S,M,K,R>(this IQueryable<S> source,
      Expression<Func<S,IEnumerable<M>>> mapper,
      Expression<Func<M,K>> keySelector,
      Expression<Func<K,IEnumerable<M>,R>> reducer)
      {
      return source.SelectMany(mapper).GroupBy(keySelector, reducer);
      }

    and http://research.microsoft.com/en-us/projects/dryad/

    Dryad

    The Dryad Project is investigating programming models for writing parallel and distributed programs to scale from a small cluster to a large data-center.

    Overview

    New! An academic release of DryadLINQ is now available for public download.

    Dryad is an infrastructure which allows a programmer to use the resources of a computer cluster or a data center for running data-parallel programs. A Dryad programmer can use thousands of machines, each of them with multiple processors or cores, without knowing anything about concurrent programming.

    The Structure of Dryad Jobs

    A Dryad programmer writes several sequential programs and connects them using one-way channels. The computation is structured as a directed graph: programs are graph vertices, while the channels are graph edges. A Dryad job is a graph generator which can synthesize any directed acyclic graph. These graphs can even change during execution, in response to important events in the computation.

    Dryad is quite expressive. It completely subsumes other computation frameworks, such as Google’s map-reduce, or the relational algebra. Moreover, Dryad handles job creation and management, resource management, job monitoring and visualization, fault tolerance, re-execution, scheduling, and accounting.

    The Dryad Software Stack

    As a proof of Dryad’s versatility, a rich software ecosystem has been built on top Dryad:

    • SSIS on Dryad executes many instances of SQL server, each in a separate Dryad vertex, taking advantage of Dryad’s fault tolerance and scheduling. This system is currently deployed in a live production system as part of one of Microsoft’s AdCenter log processing pipelines.
    • DryadLINQ generates Dryad computations from the LINQ Language-Integrated Query extensions to C#.
    • The distributed shell is a generalization of the pipe concept from the Unix shell in three ways. If Unix pipes allow the construction of one-dimensional (1-D) process structures, the distributed shell allows the programmer to build 2-D structures in a scripting language. The distributed shell generalizes Unix pipes in three ways:
      1. It allows processes to easily connect multiple file descriptors of each process — hence the 2-D aspect.
      2. It allows the construction of pipes spanning multiple machines, across a cluster.
      3. It virtualizes the pipelines, allowing the execution of pipelines with many more processes than available machines, by time-multiplexing processors and buffering results.
    • Several languages are compiled to distributed shell processes. PSQL is an early version, recently replaced with Scope.

    Publications

    Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
    Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly
    European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007

    Video of a presentation on Dryad at the Google Campus, given by Michael Isard, Nov 1, 2007.

    Also interesting to read-

    Why does Dryad use a DAG?

    he basic computational model we decided to adopt for Dryad is the directed-acyclic graph (DAG). Each node in the graph is a computation, and each edge in the graph is a stream of data traveling in the direction of the edge. The amount of data on any given edge is assumed to be finite, the computations are assumed to be deterministic, and the inputs are assumed to be immutable. This isn’t by any means a new way of structuring a distributed computation (for example Condor had DAGMan long before Dryad came along), but it seemed like a sweet spot in the design space given our other constraints.

    So, why is this a sweet spot? A DAG is very convenient because it induces an ordering on the nodes in the graph. That makes it easy to design scheduling policies, since you can define a node to be ready when its inputs are available, and at any time you can choose to schedule as many ready nodes as you like in whatever order you like, and as long as you always have at least one scheduled you will continue to make progress and never deadlock. It also makes fault-tolerance easy, since given our determinism and immutability assumptions you can backtrack as far as you want in the DAG and re-execute as many nodes as you like to regenerate intermediate data that has been lost or is unavailable due to cluster failures.

    from

    http://blogs.msdn.com/b/dryad/archive/2010/07/23/why-does-dryad-use-a-dag.aspx

      Mapreduce Book

      Here is a new book on learning MapReduce and it has a free downloadable version as well.

      Data-Intensive Text Processing with MapReduce

      Jimmy Lin and Chris Dyer

      ABSTRACT

      Our world is being revolutionized by data-driven methods: access to large amounts of data has generated new insights and opened exciting new opportunities in commerce, science, and computing applications. Processing the enormous quantities of data necessary for these advances requires large clusters, making distributed computing paradigms more crucial than ever. MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-level details, ranging from scheduling to synchronization to fault tolerance. This book focuses on MapReduce algorithm design, with an emphasis on text processing algorithms common in natural language processing, information retrieval, and machine learning. We introduce the notion of MapReduce design patterns, which represent general reusable solutions to commonly occurring problems across a variety of problem domains. This book not only intends to help the reader “think in MapReduce”, but also discusses limitations of the programming model as well.

      You can download the book here

      This book is part of the Morgan & Claypool Synthesis Lectures on Human Language Technologies. If you’re at a university, your institution may already subscribe to the series, in which case you can access the electronic version directly without cost (see this page for a list of institutional subscribers). Otherwise, to purchase:

      Quite explicitly, this book focuses on MapReduce algorithm design, not Hadoop programming. Tom White’s Hadoop: The Definitive Guide is a great resource for learning Hadoop.

      Want to be notified of updates? Interested in MapReduce algorithm design? Follow @lintool on Twitter here!

      IPSUR – A Free R Textbook

      Here is a free R textbook called IPSUR-

      http://ipsur.r-forge.r-project.org/book/index.php

      IPSUR stands for Introduction to Probability and Statistics Using R, ISBN: 978-0-557-24979-4, which is a textbook written for an undergraduate course in probability and statistics. The approximate prerequisites are two or three semesters of calculus and some linear algebra in a few places. Attendees of the class include mathematics, engineering, and computer science majors.

      IPSUR is FREE, in the GNU sense of the word. Hard copies are available for purchase here from Lulu and will be available (coming soon) from the other standard online retailers worldwide. The price of the book is exactly the manufacturing cost plus the retailers’ markup. You may be able to get it even cheaper by downloading an electronic copy and printing it yourself, but if you elect this route then be sure to get the publisher-quality PDF from theDownloads page. And double check the price. It was cheaper for my students to buy a perfect-bound paperback from Lulu and have it shipped to their door than it was to upload the PDF to Fed-Ex Kinkos and Xerox a coil-bound copy (and on top of that go pick it up at the store).

      If you are going to buy from anywhere other than Lulu then be sure to check the time-stamp on the copyright page. There is a 6 to 8 week delay from Lulu to Amazon and you may not be getting the absolute latest version available.

      Refer to the Installation page for instructions to install an electronic copy of IPSUR on your personal computer. See the Feedback page for guidance about questions or comments you may have about IPSUR.

      Also see http://ipsur.r-forge.r-project.org/rcmdrplugin/index.php for the R Cmdr Plugin

      This plugin for the R Commander accompanies the text Introduction to Probability and Statistics Using R by G. Jay Kerns. The plugin contributes functions unique to the book as well as specific configuration and functionality to R Commander, the pioneering work by John Fox of McMaster University.

      RcmdrPlugin.IPSUR’s primary goal is to provide a user-friendly graphical user interface (GUI) to the open-source and freely available R statistical computing environment. RcmdrPlugin.IPSUR is equipped to handle many of the statistical analyses and graphical displays usually encountered by upper division undergraduate mathematics, statistics, and engineering majors. Available features are comparable to many expensive commercial packages such as Minitab, SPSS, and JMP-IN.

      Since the audience of RcmdrPlugin.IPSUR is slightly different than Rcmdr’s, certain functionality has been added and selected error-checks have been disabled to permit the student to explore alternative regions of the statistical landscape. The resulting benefit of increased flexibility is balanced by somewhat increased vulnerability to syntax errors and misuse; the instructor should keep this and the academic audience in mind when usingRcmdrPlugin.IPSUR in the classroom

      Big Data and R: New Product Release by Revolution Analytics

      Press Release by the Guys in Revolution Analytics- this time claiming to enable terabyte level analytics with R. Interesting stuff but techie details are awaited.

      Revolution Analytics Brings

      Big Data Analysis to R

      The world’s most powerful statistics language can now tackle terabyte-class data sets using

      Revolution R Enterpriseat a fraction of the cost of legacy analytics products


      JSM 2010 – VANCOUVER (August 3, 2010) — Revolution Analytics today introduced ‘Big Data’ analysis to its Revolution R Enterprise software, taking the popular R statistics language to unprecedented new levels of capacity and performance for analyzing very large data sets. For the first time, R users will be able to process, visualize and model terabyte-class data sets in a fraction of the time of legacy products—without employing expensive or specialized hardware.

      The new version of Revolution R Enterprise introduces an add-on package called RevoScaleR that provides a new framework for fast and efficient multi-core processing of large data sets. It includes:

      • The XDF file format, a new binary ‘Big Data’ file format with an interface to the R language that provides high-speed access to arbitrary rows, blocks and columns of data.
      • A collection of widely-used statistical algorithms optimized for Big Data, including high-performance implementations of Summary Statistics, Linear Regression, Binomial Logistic Regressionand Crosstabs—with more to be added in the near future.
      • Data Reading & Transformation tools that allow users to interactively explore and prepare large data sets for analysis.
      • Extensibility, expert R users can develop and extend their own statistical algorithms to take advantage of Revolution R Enterprise’s new speed and scalability capabilities.

      “The R language’s inherent power and extensibility has driven its explosive adoption as the modern system for predictive analytics,” said Norman H. Nie, president and CEO of Revolution Analytics. “We believe that this new Big Data scalability will help R transition from an amazing research and prototyping tool to a production-ready platform for enterprise applications such as quantitative finance and risk management, social media, bioinformatics and telecommunications data analysis.”

      Sage Bionetworks is the nonprofit force behind the open-source collaborative effort, Sage Commons, a place where data and disease models can be shared by scientists to better understand disease biology. David Henderson, Director of Scientific Computing at Sage, commented: “At Sage Bionetworks, we need to analyze genomic databases hundreds of gigabytes in size with R. We’re looking forward to using the high-speed data-analysis features of RevoScaleR to dramatically reduce the times it takes us to process these data sets.”

      Take Hadoop and Other Big Data Sources to the Next Level

      Revolution R Enterprise fits well within the modern ‘Big Data’ architecture by leveraging popular sources such as Hadoop, NoSQL or key value databases, relational databases and data warehouses. These products can be used to store, regularize and do basic manipulation on very large datasets—while Revolution R Enterprise now provides advanced analytics at unparalleled speed and scale: producing speed on speed.

      “Together, Hadoop and R can store and analyze massive, complex data,” said Saptarshi Guha, developer of the popular RHIPE R package that integrates the Hadoop framework with R in an automatically distributed computing environment. “Employing the new capabilities of Revolution R Enterprise, we will be able to go even further and compute Big Data regressions and more.”

      Platforms and Availability

      The new RevoScaleR package will be delivered as part of Revolution R Enterprise 4.0, which will be available for 32-and 64-bit Microsoft Windows in the next 30 days. Support for Red Hat Enterprise Linux (RHEL 5) is planned for later this year.

      On its website (http://www.revolutionanalytics.com/bigdata), Revolution Analytics has published performance and scalability benchmarks for Revolution R Enterprise analyzing a 13.2 gigabyte data set of commercial airline information containing more than 123 million rows, and 29 columns.

      Additionally, the company will showcase its new Big Data solution in a free webinar on August 25 at 9:00 a.m. Pacific.

      Additional Resources

      •      Big Data Benchmark whitepaper

      •      The Revolution Analytics Roadmap whitepaper

      •      Revolutions Blog

      •      Download free academic copy of Revolution R Enterprise

      •      Visit Inside-R.org for the most comprehensive set of information on R

      •      Spread the word: Add a “Download R!” badge on your website

      •      Follow @RevolutionR on Twitter

      About Revolution Analytics

      Revolution Analytics (http://www.revolutionanalytics.com) is the leading commercial provider of software and support for the popular open source R statistics language. Its Revolution R products help make predictive analytics accessible to every type of user and budget. The company is headquartered in Palo Alto, Calif. and backed by North Bridge Venture Partners and Intel Capital.

      Media Contact

      Chantal Yang
      Page One PR, for Revolution Analytics
      Tel: +1 415-875-7494

      Email:  revolution@pageonepr.com

      R Excel :Updated

      It was really nice to see the latest version of R Excel at http://rcom.univie.ac.at/ and bundled together in an aptly named package called R and Friends.

      The look and feel of the package as well as ease of installing are really professional. I also liked the commercial equivalent at http://www.statconn.com/

      However much older-guardians and  die- hards of command line,  feel that GUI is like putting lipstick on a pig, but we respectfully demur.

      What does R Excel do? Well for one it can put the R Commander Interface INSIDE your Excel Spreadsheet. That makes it easy to use and a familiar interface even if you are newbie to R- (assuming you have done some Excel)

      Download the latest version here

      RAndFriends

      This package will automatically install and configure

      • R 2.11.1
      • rscproxy 1.3-1
      • rcom 2.2-1

      It will also download and install a suitable version of the statconnDCOM server and of RExcel during installation. Therefore you will need a working Internet connection during the installation process.
      This version of RAndFriends was created 20100516.

      Download RAndFriendsSetup2111V3.1-5-1

      We also give you information how to download all sources for R and the R packages included in RAndFriends.

      Also read a paper on R and SAS interoperability (using HMisc package from Dr Harrell) at Holland Numerics

      http://www.hollandnumerics.co.uk/pdf/SAS2R2SAS_paper.pdf