#Rstats for Business Intelligence

This is a short list of several known as well as lesser known R ( #rstats) language codes, packages and tricks to build a business intelligence application. It will be slightly Messy (and not Messi) but I hope to refine it someday when the cows come home.

It assumes that BI is basically-

a Database, a Document Database, a Report creation/Dashboard pulling software as well unique R packages for business intelligence.

What is business intelligence?

Seamless dissemination of data in the organization. In short let it flow- from raw transactional data to aggregate dashboards, to control and test experiments, to new and legacy data mining models- a business intelligence enabled organization allows information to flow easily AND capture insights and feedback for further action.

BI software has lately meant to be just reporting software- and Business Analytics has meant to be primarily predictive analytics. the terms are interchangeable in my opinion -as BI reports can also be called descriptive aggregated statistics or descriptive analytics, and predictive analytics is useless and incomplete unless you measure the effect in dashboards and summary reports.

Data Mining- is a bit more than predictive analytics- it includes pattern recognizability as well as black box machine learning algorithms. To further aggravate these divides, students mostly learn data mining in computer science, predictive analytics (if at all) in business departments and statistics, and no one teaches metrics , dashboards, reporting  in mainstream academia even though a large number of graduates will end up fiddling with spreadsheets or dashboards in real careers.

Using R with

1) Databases-

I created a short list of database connectivity with R here at https://rforanalytics.wordpress.com/odbc-databases-for-r/ but R has released 3 new versions since then.

The RODBC package remains the package of choice for connecting to SQL Databases.

http://cran.r-project.org/web/packages/RODBC/RODBC.pdf

Details on creating DSN and connecting to Databases are given at  https://rforanalytics.wordpress.com/odbc-databases-for-r/

For document databases like MongoDB and CouchDB

( what is the difference between traditional RDBMS and NoSQL if you ever need to explain it in a cocktail conversation http://dba.stackexchange.com/questions/5/what-are-the-differences-between-nosql-and-a-traditional-rdbms

Basically dispensing with the relational setup, with primary and foreign keys, and with the additional overhead involved in keeping transactional safety, often gives you extreme increases in performance

NoSQL is a kind of database that doesn’t have a fixed schema like a traditional RDBMS does. With the NoSQL databases the schema is defined by the developer at run time. They don’t write normal SQL statements against the database, but instead use an API to get the data that they need.

instead relating data in one table to another you store things as key value pairs and there is no database schema, it is handled instead in code.)

I believe any corporation with data driven decision making would need to both have atleast one RDBMS and one NoSQL for unstructured data-Ajay. This is a sweeping generic statement 😉 , and is an opinion on future technologies.

  • Use RMongo

From- http://tommy.chheng.com/2010/11/03/rmongo-accessing-mongodb-in-r/

http://plindenbaum.blogspot.com/2010/09/connecting-to-mongodb-database-from-r.html

Connecting to a MongoDB database from R using Java

http://nsaunders.wordpress.com/2010/09/24/connecting-to-a-mongodb-database-from-r-using-java/

Also see a nice basic analysis using R Mongo from

http://pseudofish.com/blog/2011/05/25/analysis-of-data-with-mongodb-and-r/

For CouchDB

please see https://github.com/wactbprot/R4CouchDB and

http://digitheadslabnotebook.blogspot.com/2010/10/couchdb-and-r.html

  • First install RCurl and RJSONIO. You’ll have to download the tar.gz’s if you’re on a Mac. For the second part, we’ll need to installR4CouchDB,

2) External Report Creating Software-

Jaspersoft- It has good integration with R and is a certified Revolution Analytics partner (who seem to be the only ones with a coherent #Rstats go to market strategy- which begs the question – why is the freest and finest stats software having only ONE vendor- if it was so great lots of companies would make exclusive products for it – (and some do -see https://rforanalytics.wordpress.com/r-business-solutions/ and https://rforanalytics.wordpress.com/using-r-from-other-software/)

From

http://www.jaspersoft.com/sites/default/files/downloads/events/Analytics%20-Jaspersoft-SEP2010.pdf

we see

http://jasperforge.org/projects/rrevodeployrbyrevolutionanalytics

RevoConnectR for JasperReports Server

RevoConnectR for JasperReports Server RevoConnectR for JasperReports Server is a Java library interface between JasperReports Server and Revolution R Enterprise’s RevoDeployR, a standardized collection of web services that integrates security, APIs, scripts and libraries for R into a single server. JasperReports Server dashboards can retrieve R charts and result sets from RevoDeployR.

http://jasperforge.org/plugins/esp_frs/optional_download.php?group_id=409

 

Using R and Pentaho
Extending Pentaho with R analytics”R” is a popular open source statistical and analytical language that academics and commercial organizations alike have used for years to get maximum insight out of information using advanced analytic techniques. In this twelve-minute video, David Reinke from Pentaho Certified Partner OpenBI provides an overview of R, as well as a demonstration of integration between R and Pentaho.
and from
R and BI – Integrating R with Open Source Business
Intelligence Platforms Pentaho and Jaspersoft
David Reinke, Steve Miller
Keywords: business intelligence
Increasingly, R is becoming the tool of choice for statistical analysis, optimization, machine learning and
visualization in the business world. This trend will only escalate as more R analysts transition to business
from academia. But whereas in academia R is often the central tool for analytics, in business R must coexist
with and enhance mainstream business intelligence (BI) technologies. A modern BI portfolio already includes
relational databeses, data integration (extract, transform, load – ETL), query and reporting, online analytical
processing (OLAP), dashboards, and advanced visualization. The opportunity to extend traditional BI with
R analytics revolves on the introduction of advanced statistical modeling and visualizations native to R. The
challenge is to seamlessly integrate R capabilities within the existing BI space. This presentation will explain
and demo an initial approach to integrating R with two comprehensive open source BI (OSBI) platforms –
Pentaho and Jaspersoft. Our efforts will be successful if we stimulate additional progress, transparency and
innovation by combining the R and BI worlds.
The demonstration will show how we integrated the OSBI platforms with R through use of RServe and
its Java API. The BI platforms provide an end user web application which include application security,
data provisioning and BI functionality. Our integration will demonstrate a process by which BI components
can be created that prompt the user for parameters, acquire data from a relational database and pass into
RServer, invoke R commands for processing, and display the resulting R generated statistics and/or graphs
within the BI platform. Discussion will include concepts related to creating a reusable java class library of
commonly used processes to speed additional development.

If you know Java- try http://ramanareddyg.blog.com/2010/07/03/integrating-r-and-pentaho-data-integration/

 

and I like this list by two venerable powerhouses of the BI Open Source Movement

http://www.openbi.com/demosarticles.html

Open Source BI as disruptive technology

http://www.openbi.biz/articles/osbi_disruption_openbi.pdf

Open Source Punditry

TITLE AUTHOR COMMENTS
Commercial Open Source BI Redux Dave Reinke & Steve Miller An review and update on the predictions made in our 2007 article focused on the current state of the commercial open source BI market. Also included is a brief analysis of potential options for commercial open source business models and our take on their applicability.
Open Source BI as Disruptive Technology Dave Reinke & Steve Miller Reprint of May 2007 DM Review article explaining how and why Commercial Open Source BI (COSBI) will disrupt the traditional proprietary market.

Spotlight on R

TITLE AUTHOR COMMENTS
R You Ready for Open Source Statistics? Steve Miller R has become the “lingua franca” for academic statistical analysis and modeling, and is now rapidly gaining exposure in the commercial world. Steve examines the R technology and community and its relevancy to mainstream BI.
R and BI (Part 1): Data Analysis with R Steve Miller An introduction to R and its myriad statistical graphing techniques.
R and BI (Part 2): A Statistical Look at Detail Data Steve Miller The usage of R’s graphical building blocks – dotplots, stripplots and xyplots – to create dashboards which require little ink yet tell a big story.
R and BI (Part 3): The Grooming of Box and Whiskers Steve Miller Boxplots and variants (e.g. Violin Plot) are explored as an essential graphical technique to summarize data distributions by categories and dimensions of other attributes.
R and BI (Part 4): Embellishing Graphs Steve Miller Lattices and logarithmic data transformations are used to illuminate data density and distribution and find patterns otherwise missed using classic charting techniques.
R and BI (Part 5): Predictive Modelling Steve Miller An introduction to basic predictive modelling terminology and techniques with graphical examples created using R.
R and BI (Part 6) :
Re-expressing Data
Steve Miller How do you deal with highly skewed data distributions? Standard charting techniques on this “deviant” data often fail to illuminate relationships. This article explains techniques to re-express skewed data so that it is more understandable.
The Stock Market, 2007 Steve Miller R-based dashboards are presented to demonstrate the return performance of various asset classes during 2007.
Bootstrapping for Portfolio Returns: The Practice of Statistical Analysis Steve Miller Steve uses the R open source stats package and Monte Carlo simulations to examine alternative investment portfolio returns…a good example of applied statistics using R.
Statistical Graphs for Portfolio Returns Steve Miller Steve uses the R open source stats package to analyze market returns by asset class with some very provocative embedded trellis charts.
Frank Harrell, Iowa State and useR!2007 Steve Miller In August, Steve attended the 2007 Internation R User conference (useR!2007). This article details his experiences, including his meeting with long-time R community expert, Frank Harrell.
An Open Source Statistical “Dashboard” for Investment Performance Steve Miller The newly launched Dashboard Insight web site is focused on the most useful of BI tools: dashboards. With this article discussing the use of R and trellis graphics, OpenBI brings the realm of open source to this forum.
Unsexy Graphics for Business Intelligence Steve Miller Utilizing Tufte’s philosophy of maximizing the data to ink ratio of graphics, Steve demonstrates the value in dot plot diagramming. The R open source statistical/analytics software is showcased.
I think that the report generation package Brew would also qualify as a BI package, but large scale implementation remains to be seen in
a commercial business environment
  • brew: Creating Repetitive Reports
 brew: Templating Framework for Report Generation

brew implements a templating framework for mixing text and R code for report generation. brew template syntax is similar to PHP, Ruby's erb module, Java Server Pages, and Python's psp module. http://bit.ly/jINmaI
  • Yarr- creating reports in R
to be continued ( when I have more time and the temperature goes down from 110F in Delhi, India)

Google unleashes Fusion Tables

I just discovered Fusion Tables. There is life beyond the amazing Jeff’s Amazon Ec2/s3 after all!

Check out http://www.google.com/fusiontables/public/tour/index.html

Gather, visualize and share data online

Don’t have a Google Account?
Create one now

  • Visualize and publish your data as maps, timelines and charts
  • Host your data tables online
  • Combine data from multiple people

data table turns into map

Google Fusion Tables is a modern data management and publishing web application that makes it easy
to host, manage, collaborate on, visualize, and publish data tables online.

What can I do with Google Fusion Tables?

Import your own data
Upload data tables from spreadsheets or CSV files, even KML. Developers can use the Fusion Tables API to insert, update, delete and query data programmatically. You can export your data as CSV or KML too.

Visualize it instantly
See the data on a map or as a chart immediately. Use filters for more selective visualizations.

Publish your visualization on other web properties
Now that you’ve got that nice map or chart of your data, you can embed it in a web page or blog post. Or send a link by email or IM. It will always display the latest data values from your table and helps you communicate your story more easily.

Look at the Fusion Tables Example Gallery

at https://sites.google.com/site/fusiontablestalks/stories

If you are worried about data.gov closing down, heres a snapshot of Fusion Table Public datasets.


 

Google Refine

An interesting data cleaning software from Google at

https://code.google.com/p/google-refine/

From the page at

https://code.google.com/p/google-refine/wiki/UserGuide

The Basics

First, although Google Refine might start out looking like a spreadsheet program (Microsoft Excel, Google Spreadsheets, etc.), don’t expect it to work like a spreadsheet program. That’s almost like expecting a database to work like a text editor.

Google Refine is NOT for entering new data one cell at a time. It is NOT for doing accounting.

Google Refine is for applying transformations over many existing cells in bulk, for the purpose of cleaning up the data, extending it with more data from other sources, and getting it to some form that other tools can consume.

To use Google Refine, think in big patterns. For example, to spot errors, think

  • Show me every row where the string length of the customer’s name is longer than 50 characters (because I suspect that the customer’s address is mistakenly included in the name field)
  • Show me every row where the contract fee is less than 1 (because I suspect the fee was entered in unit of thousand dollars rather than dollars)
  • Show me every row where the description field (scraped from some web site) contains “&” (because I suspect it wasn’t decoded properly)

To edit data, think

  • For every row where the contract fee is less than 1, multiply the fee by 1000.
  • For every row where the customer name contains a comma (it has been entered as “last_name, first_name”), split the name by the comma, reverse the array, and join it back with a space (producing “first_name last_name”)

To specify patterns, use filters and facets. Typically, you create a filter or facet on a particular column. For example, you can create a numeric facet on the “contract fee” column and adjust its range selector to select values less than 1. If the default facet doesn’t do what you want, you can configure it (by clicking “change” on the facet’s header). For example, you can create a text facet with on the same “contract fee” column with this expression:

  value < 1

It will show 2 choices: true and false. Just select true. Then, invoke the Transform command on that same column and enter the expression

  value * 1000

That Transform command affects only rows where the “contract fee” cell contains a value less than 1.

You can use several filters and facets together. Only rows that are selected by all facets and filters will be shown in the data table. For example, say you have two text facets, one on the “contract fee” column with the expression

  value < 1

and another on the “state” column (with the default expression). If you select “true” in the first facet and “Nevada” in the second, then you will only see rows for contracts in Nevada with fees less than 1.

Analogies

Databases

If you have programmed databases before (performing SQL queries), then what Google Refine works should be quite familiar to you. Creating filters and facets and selecting something in them is like performing this SELECT statement:

  SELECT *
  WHERE ... constraints determined by selection in facets and filters ...

And invoking the Transform command on a column while having some filters and facets selected is like performing this UPDATE statement

  UPDATE whole_table SET column_X = ... expression ...
  WHERE ... constraints determined by selection in facets and filters ...

The difference between Google Refine and databases is that the facets show you choices that you can select, whereas databases assume that you already know what’s in the data.

 

Data Visualization using Tableau

Image representing Tableau Software as depicte...
Image via CrunchBase

Here is a great piece of software for data visualization– the public version is free.

And you can use it for Desktop Analytics as well as BI /server versions at very low cost.

About Tableau Software

http://www.tableausoftware.com/press_release/tableau-massive-growth-hiring-q3-2010

Tableau was named by Software Magazine as the fastest growing software company in the $10 million to $30 million range in the world, and the second fastest growing software company worldwide overall. The ranking stems from the publication’s 28th annual Software 500 ranking of the world’s largest software service providers.

“We’re growing fast because the market is starving for easy-to-use products that deliver rapid-fire business intelligence to everyone. Our customers want ways to unlock their databases and produce engaging reports and dashboards,” said Christian Chabot CEO and co-founder of Tableau.

http://www.tableausoftware.com/about/who-we-are

History in the Making

Put together an Academy-Award winning professor from the nation’s most prestigious university, a savvy business leader with a passion for data, and a brilliant computer scientist. Add in one of the most challenging problems in software – making databases and spreadsheets understandable to ordinary people. You have just recreated the fundamental ingredients for Tableau.

The catalyst? A Department of Defense (DOD) project aimed at increasing people’s ability to analyze information and brought to famed Stanford professor, Pat Hanrahan. A founding member of Pixar and later its chief architect for RenderMan, Pat invented the technology that changed the world of animated film. If you know Buzz and Woody of “Toy Story”, you have Pat to thank.

Under Pat’s leadership, a team of Stanford Ph.D.s got together just down the hall from the Google folks. Pat and Chris Stolte, the brilliant computer scientist, realized that data visualization could produce large gains in people’s ability to understand information. Rather than analyzing data in text form and then creating visualizations of those findings, Pat and Chris invented a technology called VizQL™ by which visualization is part of the journey and not just the destination. Fast analytics and visualization for everyone was born.

While satisfying the DOD project, Pat and Chris met Christian Chabot, a former data analyst who turned into Jello when he saw what had been invented. The three formed a company and spun out of Stanford like so many before them (Yahoo, Google, VMWare, SUN). With Christian on board as CEO, Tableau rapidly hit one success after another: its first customer (now Tableau’s VP, Operations, Tom Walker), an OEM deal with Hyperion (now Oracle), funding from New Enterprise Associates, a PC Magazine award for “Product of the Year” just one year after launch, and now over 50,000 people in 50+ countries benefiting from the breakthrough.

also see http://www.tableausoftware.com/about/leadership

http://www.tableausoftware.com/about/board

—————————————————————————-

and now  a demo I ran on the Kaggle contest data (it is a csv dataset with 95000 rows)

I found Tableau works extremely good at pivoting data and visualizing it -almost like Excel on  Steroids. Download the free version here ( I dont know about an academic program (see links below) but software is not expensive at all)

http://buy.tableausoftware.com/

Desktop Personal Edition

The Personal Edition is a visual analysis and reporting solution for data stored in Excel, MS Access or Text Files. Available via download.

Product Information

$999*

Desktop Professional Edition

The Professional Edition is a visual analysis and reporting solution for data stored in MS SQL Server, MS Analysis Services, Oracle, IBM DB2, Netezza, Hyperion Essbase, Teradata, Vertica, MySQL, PostgreSQL, Firebird, Excel, MS Access or Text Files. Available via download.

Product Information

$1800*

Tableau Server

Tableau Server enables users of Tableau Desktop Professional to publish workbooks and visualizations to a server where users with web browsers can access and interact with the results. Available via download.

Product Information

Contact Us

* Price is per Named User and includes one year of maintenance (upgrades and support). Products are made available as a download immediately after purchase. You may revisit the download site at any time during your current maintenance period to access the latest releases.

 

 

Awesome new features in Doc Googles

I really liked some awesome new features in Google Docs, and I am mentioning just some of the features I like because they are not there in Windows Office mostly.

Sourcehttp://www.google.com/google-d-s/whatsnew.html

List View and Mobile View Improvements
Now you can see your spreadsheets with all their formatting in List View and on your mobile device, this includes background/foreground colors, borders and text formatting!

Themes for forms
Add a splash of color to your surveys and questionnaires. When you create and edit a form, simply apply one of the 70 themes

  • Forms improvements
    We’ve added a new question type (grid), support for right-to-left languages in forms, and a new color scheme for the forms summary. Also, you can now pre-populate form fields with URL parameters, and if you use Google Apps, you can create forms which require sign-in to access. Learn more

  • Translate document
    You can now translate an entire document into over 40 languages.

    Translate and detect languages in Google spreadsheets
    =GoogleTranslate(“Hola, ¿cómo estás?”,”es”,”en”) gives “Hi, how are you?” (or leave out “en” and we’ll automatically choose the default language of your spreadsheet) What if you don’t know the language? =DetectLanguage(“Hola, ¿cómo estás?”) gives “es”.

    A new curve tool in drawings
    Create smooth curves based on a series of points with this new tool.

    Optical character recognition (OCR)
    You can now upload and convert PDF or image files to text.

     

    You can read the awesome new ones athttp://www.google.com/google-d-s/whatsnew.html but these are the ones I felt were missing in Windows Office.

    Coming up- a Review of newly forked Libre Office

Comparing Big SpreadSheet A to Big SpreadSheet B

Many organizations have pre-fixed formats for their reporting needs.  These formats or Management Information Reports are updated at monthly and quarterly intervals at exactly the same format. However when the spreadsheets become big, analysis becomes tedious in comparing two big spreadsheets due to the sheer number of cells involved.

Using SAS , we can automate this process almost instantly.

We will use proc import to import data from the spreadsheets in such a manner that top row imported consist column headings (sas dataset variables).Note both spreadsheets are exactly in same format.

We will then use proc compare to compare these two datasets.

We can then use the integrated approach to automated reporting in SAS (See Archives- Category Analytics) to further reduce this to a simple batch process.

The relevant codes are –

%let pathfile = “C:\Documents and Settings\” ;
run;

/*CREATING LIBRARY NAME */

libname auto &pathfile;

run;

/*TO CONSERVE SPACE*/

options compress=yes;

/*TO MAKE LOG READABLE */

options macrogen symbolgen;

PROC IMPORT OUT= auto.TEST1
DATAFILE= “C:\Documents and Settings\excel1-full.xls”
DBMS=EXCEL2000 REPLACE;
SHEET=”‘Sales$'”;

/*SPECIFYING WORKSHEET FOR MULTIPLE SHEETS */
GETNAMES=YES;

/*TO TAKE VARIABLE NAMES FROM TOP ROW */

   RANGE=”A4:AB2000″;

/*SPECIFYING RANGE OF CELLS  IN SPREADSHEET TO BE READ */

RUN;

PROC IMPORT OUT= auto.TEST2
DATAFILE= “C:\Documents and Settings\excel2-full.xls”
DBMS=EXCEL2000 REPLACE;
SHEET=”‘Sales$'”;

/*SPECIFYING WORKSHEET FOR MULTIPLE SHEETS */
GETNAMES=YES;

/*TO TAKE VARIABLE NAMES FROM TOP ROW */

   RANGE=”A4:AB2000″;

/*SPECIFYING RANGE OF CELLS  IN SPREADSHEET TO BE READ */

RUN;

/* COMPARING THE TWO SPREADSHEETS */

proc compare base=auto.test1 compare=auto.test2;
var

/*SPECIFYING WHAT VARIABLES TO BE COMPARED */
Applications

Approvals

Disbursals

30dayplus

60dayplus

90dayplus

;
with Branch;

/*SPECIFYING VARIABLE FOR COMPARISON

FOR SAME BRANCH IN THIS CASE */
 run;
The output will simply compare and compute the cell by cell difference.

You can then use ods to ouput this in another big spreadsheet 🙂

This is particularly relevant in telecommunications and banks, where they need to compare a lot of metrics across timely intervals.