## Top Ten Business Analytics Graphs-Line Charts (2/10)

A line chart is one of the most commonly used charts in business analytics and metrics reporting. It basically consists of two variables plotted along the axes with the adjacent points being joined by line segments. Most often used with time series on the x-axis, line charts are simple to understand and use.
Variations on the line graph can include fan charts in time series which include joining line chart of historic data with ranges of future projections. Another common variation is to plot the linear regression or trend line between the two variables  and superimpose it on the graph.
The slope of the line chart shows the rate of change at that particular point , and can also be used to highlight areas of discontinuity or irregular change between two variables.

The basic syntax of line graph is created by first using Plot() function to plot the points and then lines () function to plot the lines between the points.

> str(cars)
‘data.frame’:   50 obs. of  2 variables:
\$ speed: num  4 4 7 7 8 9 10 10 10 11 …
\$ dist : num  2 10 4 22 16 10 18 26 34 17 …
> plot(cars)
> lines(cars,type=”o”, pch=20, lty=2, col=”green”)
> title(main=”Example Automobiles”, col.main=”blue”, font.main=2)

An example of Time Series Forecasting graph  or fan chart is http://addictedtor.free.fr/graphiques/RGraphGallery.php?graph=51

## Open Source Compiler for SAS language/ GNU -DAP

I am still testing this out.

But if you know bit more about make and .compile in Ubuntu check out

http://www.gnu.org/software/dap/

I loved the humorous introduction

Dap is a small statistics and graphics package based on C. Version 3.0 and later of Dap can read SBS programs (based on the utterly famous, industry standard statistics system with similar initials – you know the one I mean)! The user wishing to perform basic statistical analyses is now freed from learning and using C syntax for straightforward tasks, while retaining access to the C-style graphics and statistics features provided by the original implementation. Dap provides core methods of data management, analysis, and graphics that are commonly used in statistical consulting practice (univariate statistics, correlations and regression, ANOVA, categorical data analysis, logistic regression, and nonparametric analyses).

Anyone familiar with the basic syntax of C programs can learn to use the C-style features of Dap quickly and easily from the manual and the examples contained in it; advanced features of C are not necessary, although they are available. (The manual contains a brief introduction to the C syntax needed for Dap.) Because Dap processes files one line at a time, rather than reading entire files into memory, it can be, and has been, used on data sets that have very many lines and/or very many variables.

I wrote Dap to use in my statistical consulting practice because the aforementioned utterly famous, industry standard statistics system is (or at least was) not available on GNU/Linux and costs a bundle every year under a lease arrangement. And now you can run programs written for that system directly on Dap! I was generally happy with that system, except for the graphics, which are all but impossible to use,  but there were a number of clumsy constructs left over from its ancient origins.

http://www.gnu.org/software/dap/#Sample output

• Unbalanced ANOVA
• Crossed, nested ANOVA
• Random model, unbalanced
• Mixed model, balanced
• Mixed model, unbalanced
• Split plot
• Latin square
• Missing treatment combinations
• Linear regression
• Linear regression, model building
• Ordinal cross-classification
• Stratified 2×2 tables
• Loglinear models
• Logit  model for linear-by-linear association
• Logistic regression
• Copyright © 2001, 2002, 2003, 2004 Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA

sounds too good to be true- GNU /DAP joins WPS workbench and Dulles Open’s Carolina as the third SAS language compiler (besides the now defunct BASS software) see http://en.wikipedia.org/wiki/SAS_language#Controversy

Dap was written to be a free replacement for SAS, but users are assumed to have a basic familiarity with the C programming language in order to permit greater flexibility. Unlike R it has been designed to be used on large data sets.

It has been designed so as to cope with very large data sets; even when the size of the data exceeds the size of the computer’s memory

## Protovis a graphical toolkit for visualization

I just found about a new data visualization tool called Protovis http://vis.stanford.edu/protovis/ex/

Protovis composes custom views of data with simple marks such as bars and dots. Unlike low-level graphics libraries that quickly become tedious for visualization, Protovis defines marks through dynamic properties that encode data, allowing inheritancescales and layouts to simplify construction.

Protovis is free and open-source and is a Stanford project. It has been used in web interface R Node (which I will talk later )

http://squirelove.net/r-node/doku.php

### Conventional

While Protovis is designed for custom visualization, it is still easy to create many standard chart types. These simpler examples serve as an introduction to the language, demonstrating key abstractions such as quantitative and ordinal scales, while hinting at more advanced features, including stack layout.

### Custom

Many charting libraries provide stock chart designs, but offer only limited customization; Protovis excels at custom visualization design through a concise representation and precise control over graphical marks. These examples, including a few recreations of unusual historical designs, demonstrate the language’s expressiveness.

Try Protovis today 🙂 http://vis.stanford.edu/protovis/

It uses JavaScript and SVG for web-native visualizations; no plugin required (though you will need a modern web browser)! Although programming experience is helpful, Protovis is mostly declarative and designed to be learned by example.

## Handling time and date in R

One of the most frustrating things I had to do while working as financial business analysts was working with Data Time Formats in Base SAS. The syntax was simple enough and SAS was quite good with handing queries to the Oracle data base that the client was using, but remembering the different types of formats in SAS language was a challenge (there was a date9. and date6 and mmddyy etc )

Data and Time variables are particularly important variables in financial industry as almost everything is derived variable from the time (which varies) while other inputs are mostly constants. This includes interest as well as late fees and finance fees.

In R, date and time are handled quite simply-

Use the strptime( dataset, format) function to convert the character into string

For example if the variable dob is “01/04/1977) then following will convert into a date object

z=strptime(dob,”%d/%m/%Y”)

and if the same date is 01Apr1977

`z=strptime(dob,"%d%b%Y")`

does the same

For troubleshooting help with date and time, remember to enclose the formats

%d,%b,%m and % Y in the same exact order as the original string- and if there are any delimiters like ” -” or “/” then these delimiters are entered in exactly the same order in the format statement of the strptime

Sys.time() gives you the current date-time while the function difftime(time1,time2) gives you the time intervals( say if you have two columns as date-time variables)

What are the various formats for inputs in date time?

`%a`
Abbreviated weekday name in the current locale. (Also matches full name on input.)
`%A`
Full weekday name in the current locale. (Also matches abbreviated name on input.)
`%b`
Abbreviated month name in the current locale. (Also matches full name on input.)
`%B`
Full month name in the current locale. (Also matches abbreviated name on input.)
`%c`
Date and time. Locale-specific on output, `"%a %b %e %H:%M:%S %Y"` on input.
`%d`
Day of the month as decimal number (01–31).
`%H`
Hours as decimal number (00–23).
`%I`
Hours as decimal number (01–12).
`%j`
Day of year as decimal number (001–366).
`%m`
Month as decimal number (01–12).
`%M`
Minute as decimal number (00–59).
`%p`
AM/PM indicator in the locale. Used in conjunction with `%I` and not with `%H`. An empty string in some locales.
`%S`
Second as decimal number (00–61), allowing for up to two leap-seconds (but POSIX-compliant implementations will ignore leap seconds).
`%U`
Week of the year as decimal number (00–53) using Sunday as the first day 1 of the week (and typically with the first Sunday of the year as day 1 of week 1). The US convention.
`%w`
Weekday as decimal number (0–6, Sunday is 0).
`%W`
Week of the year as decimal number (00–53) using Monday as the first day of week (and typically with the first Monday of the year as day 1 of week 1). The UK convention.
`%x`
Date. Locale-specific on output, `"%y/%m/%d"` on input.
`%X`
Time. Locale-specific on output, `"%H:%M:%S"` on input.
`%y`
Year without century (00–99). Values 00 to 68 are prefixed by 20 and 69 to 99 by 19 – that is the behaviour specified by the 2004 POSIX standard, but it does also say ‘it is expected that in a future version the default century inferred from a 2-digit year will change’.
`%Y`
Year with century.
`%z`
Signed offset in hours and minutes from UTC, so `-0800` is 8 hours behind UTC.
`%Z`
(output only.) Time zone as a character string (empty if not available).

Also to read the helpful documentation (especially for time zone level, and leap year seconds and differences)
http://stat.ethz.ch/R-manual/R-patched/library/base/html/difftime.html
http://stat.ethz.ch/R-manual/R-patched/library/base/html/strptime.html
http://stat.ethz.ch/R-manual/R-patched/library/base/html/Ops.Date.html
http://stat.ethz.ch/R-manual/R-patched/library/base/html/Dates.html

## Challenges of Analyzing a dataset (with R)

Analyzing data can have many challenges associated with it. In the case of business analytics data, these challenges or constraints can have a marked effect on the quality and timeliness of the analysis as well as the expected versus actual payoff from the analytical results.

Challenges of Analytical Data Processing-

1) Data Formats- Reading in complete data, without losing any part (or meta data), or adding in superfluous details (that increase the scope). Technical constraints of data formats are relatively easy to navigate thanks to ODBC and well documented and easily search-able syntax and language.

The costs of additional data augmentation (should we pay for additional credit bureau data to be appended) , time of storing and processing the data (every column needed for analysis can add in as many rows as whole dataset, which can be a time enhancing problem if you are considering an extra 100 variables with a few million rows), but above all that of business relevance and quality guidelines will ensure basic data input and massaging are considerable parts of whole analytical project timeline.

2) Data Quality-Perfect data exists in a perfect world. The price of perfect information is one business will mostly never budget or wait for. To deliver inferences and results based on summaries of data which has missing, invalid, outlier data embedded within it makes the role of an analyst just as important as which ever tool is chosen to remove outliers, replace missing values, or treat invalid data.

3) Project Scope-

How much data? How much Analytical detail versus High Level Summary? Timelines for delivery as well as refresh of data analysis? Checks (statistical as well as business)?

How easy is it to load and implement the new analysis in existing Information Technology Infrastructure? These are some of the outer parameters that can limit both your analytical project scope, your analytical tool choice, and your processing methodology.
4) Output Results vis a vis stakeholder expectation management-

Stakeholders like to see results, not constraints, hypothesis ,assumptions , p-value, or chi -square value. Output results need to be streamlined to a decision management process to justify the investment of human time and effort in an analytical project, choice,training and navigating analytical tool complexities and constraints are subset of it. Optimum use of graphical display is a part of aligning results to a more palatable form to stakeholders, provided graphics are done nicely.

Eg Marketing wants to get more sales so they need a clear campaign, to target certain customers via specific channels with specified collateral. In order to base their business judgement, business analytics needs to validate , cross validate and sometimes invalidate this business decision making with clear transparent methods and processes.

Given a dataset- the basic analytical steps that an analyst will do with R are as follows. This is meant as a note for analysts at a beginner level with R.

Package -specific syntax

install.packages(package1) #This installs a package locally, a one time event
library(package1) #This loads a specified package in the current R session, which needs to be done every R session

CRAN________LOCAL HARD DISK_________R SESSION is the top to bottom hierarchy of package storage and invocation.

ls() #This lists all objects or datasets currently active in the R session

> names(assetsCorr)  #This gives the names of variables within a dataframe
[1] “AssetClass”            “LargeStocksUS”         “SmallStocksUS”
[4] “CorporateBondsUS”      “TreasuryBondsUS”       “RealEstateUS”
[10] “StocksSwitzerland”     “StocksEmergingMarkets”

> str(assetsCorr) #gives complete structure of dataset
‘data.frame’:    12 obs. of  11 variables:
\$ AssetClass           : Factor w/ 12 levels “CorporateBondsUS”,..: 4 5 2 6 1 12 3 7 11 9 …
\$ LargeStocksUS        : num  15.3 16.4 1 0 0 …
\$ SmallStocksUS        : num  13.49 16.64 0.66 1 0 …
\$ CorporateBondsUS     : num  9.26 6.74 0.38 0.46 1 0 0 0 0 0 …
\$ TreasuryBondsUS      : num  8.44 6.26 0.33 0.27 0.95 1 0 0 0 0 …
\$ RealEstateUS         : num  10.6 17.32 0.08 0.59 0.35 …
\$ StocksCanada         : num  10.25 19.78 0.56 0.53 -0.12 …
\$ StocksUK             : num  10.66 13.63 0.81 0.41 0.24 …
\$ StocksGermany        : num  12.1 20.32 0.76 0.39 0.15 …
\$ StocksSwitzerland    : num  15.01 20.8 0.64 0.43 0.55 …
\$ StocksEmergingMarkets: num  16.5 36.92 0.3 0.6 0.12 …

> dim(assetsCorr) #gives dimensions observations and variable number
[1] 12 11

str(Dataset) – This gives the structure of the dataset (note structure gives both the names of variables within dataset as well as dimensions of the dataset)

head(dataset,n1) gives the first n1 rows of dataset while
tail(dataset,n2) gives the last n2 rows of a dataset where n1,n2 are numbers and dataset is the name of the object (here a data frame that is being considered)

summary(dataset) gives you a brief summary of all variables while

library(Hmisc)
describe(dataset) gives a detailed description on the variables

simple graphics can be given by

hist(Dataset1)
and
plot(Dataset1)

As you can see in above cases, there are multiple ways to get even basic analysis about data in R- however most of the syntax commands are intutively understood (like hist for histogram, t.test for t test, plot for plot).

For detailed analysis throughout the scope of analysis, for a business analytics user it is recommended to using multiple GUI, and multiple packages. Even for highly specific and specialized analytical tasks it is recommended to check for a GUI that incorporates the required package.

## Trying out Google Prediction API from R

So I saw the news at NY R Meetup and decided to have a go at Prediction API Package (which first started off as a blog post at

1)My OS was Ubuntu 10.10 Netbook

Ubuntu has a slight glitch plus workaround for installing the RCurl package on which the Google Prediction API is dependent- you need to first install this Ubuntu package for RCurl to install libcurl4-gnutls-dev

Once you install that using Synaptic,

Simply start R

2) Install Packages rjson and Rcurl using install.packages and choosing CRAN

Since GooglePredictionAPI is not yet on CRAN

,

When you start R, simply run

.libPaths()[1]

5) Now the following line works

1. Under R prompt,
2. `> install.packages("googlepredictionapi_0.1.tar.gz", repos=NULL, type="source")`

and thats the Google Storage manager

Notes on Training Data-

Use a csv file

The first column is the score column (like 1,0 or prediction score)

There are no headers- so delete headers from data file and move the dependent variable to the first column  (Note I used data from the kaggle contest for R package recommendation at

6) The good stuff:

It then starts showing you time elapsed for training.

Now you can disconnect and go off (actually I got disconnected by accident before coming back in a say 5 minutes so this is the part where I think this is what happened is why it happened, dont blame me, test it for yourself) –

and when you come back (hopefully before token expires)  you can see status of your request (see below)

```> library(rjson)
> library(RCurl)
> my.model <- PredictionApiTrain(data="gs://numtraindata/training_data")
The request for training has sent, now trying to check if training is completed
Training on numtraindata/training_data: time:2.09 seconds
Training on numtraindata/training_data: time:7.00 seconds```

7)

Note I changed the format from the URL where my data is located- simply go to your Google Storage Manager and right click on the file name for link address  ( https://sandbox.google.com/storage/numtraindata/training_data.csv)

to gs://numtraindata/training_data  (that kind of helps in any syntax error)

8) From the kind of high level instructions at  https://code.google.com/p/google-prediction-api-r-client/, you could also try this on a local file

## Usage

```## Load googlepredictionapi and dependent libraries
library(rjson)
library(RCurl)