Top Ten Business Analytics Graphs-Line Charts (2/10)

A line chart is one of the most commonly used charts in business analytics and metrics reporting. It basically consists of two variables plotted along the axes with the adjacent points being joined by line segments. Most often used with time series on the x-axis, line charts are simple to understand and use.
Variations on the line graph can include fan charts in time series which include joining line chart of historic data with ranges of future projections. Another common variation is to plot the linear regression or trend line between the two variables and superimpose it on the graph.
The slope of the line chart shows the rate of change at that particular point , and can also be used to highlight areas of discontinuity or irregular change between two variables.

The basic syntax of line graph is created by first using Plot() function to plot the points and then lines () function to plot the lines between the points.

> str(cars)
‘data.frame’: 50 obs. of 2 variables:
$ speed: num 4 4 7 7 8 9 10 10 10 11 …
$ dist : num 2 10 4 22 16 10 18 26 34 17 …
> plot(cars)
> lines(cars,type=”o”, pch=20, lty=2, col=”green”)
> title(main=”Example Automobiles”, col.main=”blue”, font.main=2)

An example of Time Series Forecasting graph or fan chart is http://addictedtor.free.fr/graphiques/RGraphGallery.php?graph=51

Additional tutorials for creating line graphs are at http://www.statmethods.net/graphs/line.html ,http://www.phaget4.org/R/plot.html and http://www.harding.edu/fmccown/r/#linecharts

Top Ten Graphs for Business Analytics -Pie Charts (1/10) (decisionstats.com)

Open Source Compiler for SAS language/ GNU -DAP

I am still testing this out.

But if you know bit more about make and .compile in Ubuntu check out

http://www.gnu.org/software/dap/

I loved the humorous introduction

Dap is a small statistics and graphics package based on C. Version 3.0 and later of Dap can read SBS programs (based on the utterly famous, industry standard statistics system with similar initials – you know the one I mean)! The user wishing to perform basic statistical analyses is now freed from learning and using C syntax for straightforward tasks, while retaining access to the C-style graphics and statistics features provided by the original implementation. Dap provides core methods of data management, analysis, and graphics that are commonly used in statistical consulting practice (univariate statistics, correlations and regression, ANOVA, categorical data analysis, logistic regression, and nonparametric analyses).

Anyone familiar with the basic syntax of C programs can learn to use the C-style features of Dap quickly and easily from the manual and the examples contained in it; advanced features of C are not necessary, although they are available. (The manual contains a brief introduction to the C syntax needed for Dap.) Because Dap processes files one line at a time, rather than reading entire files into memory, it can be, and has been, used on data sets that have very many lines and/or very many variables.

I wrote Dap to use in my statistical consulting practice because the aforementioned utterly famous, industry standard statistics system is (or at least was) not available on GNU/Linux and costs a bundle every year under a lease arrangement. And now you can run programs written for that system directly on Dap! I was generally happy with that system, except for the graphics, which are all but impossible to use, but there were a number of clumsy constructs left over from its ancient origins.

http://www.gnu.org/software/dap/#Sample output

Unbalanced ANOVA

Crossed, nested ANOVA

Random model, unbalanced

Mixed model, balanced

Mixed model, unbalanced

Split plot

Latin square

Missing treatment combinations

Linear regression

Linear regression, model building

Ordinal cross-classification

Stratified 2×2 tables

Loglinear models

Logit model for linear-by-linear association

Logistic regression

sounds too good to be true- GNU /DAP joins WPS workbench and Dulles Open’s Carolina as the third SAS language compiler (besides the now defunct BASS software) see http://en.wikipedia.org/wiki/SAS_language#Controversy

Also see http://en.wikipedia.org/wiki/DAP_(software)

Dap was written to be a free replacement for SAS, but users are assumed to have a basic familiarity with the C programming language in order to permit greater flexibility. Unlike R it has been designed to be used on large data sets.

It has been designed so as to cope with very large data sets; even when the size of the data exceeds the size of the computer’s memory

R courses from Statistics.com (revolutionanalytics.com)
Categorical Data Analysis for the Behavioral and Social Sciences (psypress.com)
Skills of a good data miner (zyxo.wordpress.com)
GNU Octave 3.4 has just been released (gnu.org)
Revolution R Enterprise 4.2 now available (revolutionanalytics.com)

Protovis a graphical toolkit for visualization

I just found about a new data visualization tool called Protovis http://vis.stanford.edu/protovis/ex/

Protovis composes custom views of data with simple marks such as bars and dots. Unlike low-level graphics libraries that quickly become tedious for visualization, Protovis defines marks through dynamic properties that encode data, allowing inheritance, scales and layouts to simplify construction.

Protovis is free and open-source and is a Stanford project. It has been used in web interface R Node (which I will talk later )

http://squirelove.net/r-node/doku.php

Conventional

While Protovis is designed for custom visualization, it is still easy to create many standard chart types. These simpler examples serve as an introduction to the language, demonstrating key abstractions such as quantitative and ordinal scales, while hinting at more advanced features, including stack layout.

Custom

Many charting libraries provide stock chart designs, but offer only limited customization; Protovis excels at custom visualization design through a concise representation and precise control over graphical marks. These examples, including a few recreations of unusual historical designs, demonstrate the language’s expressiveness.

Try Protovis today 🙂 http://vis.stanford.edu/protovis/

It uses JavaScript and SVG for web-native visualizations; no plugin required (though you will need a modern web browser)! Although programming experience is helpful, Protovis is mostly declarative and designed to be learned by example.

Linking Petterson – Visualising FRBR data with Protovis (home.hio.no)
The Stanford Visualization Group Debuts Visual Tool for Cleaning Up Data (readwriteweb.com)
Roll your own JavaScript lambda syntax (strobe.cc)

Handling time and date in R

One of the most frustrating things I had to do while working as financial business analysts was working with Data Time Formats in Base SAS. The syntax was simple enough and SAS was quite good with handing queries to the Oracle data base that the client was using, but remembering the different types of formats in SAS language was a challenge (there was a date9. and date6 and mmddyy etc )

Data and Time variables are particularly important variables in financial industry as almost everything is derived variable from the time (which varies) while other inputs are mostly constants. This includes interest as well as late fees and finance fees.

In R, date and time are handled quite simply-

Use the strptime( dataset, format) function to convert the character into string

For example if the variable dob is “01/04/1977) then following will convert into a date object

z=strptime(dob,”%d/%m/%Y”)

and if the same date is 01Apr1977

z=strptime(dob,"%d%b%Y")

does the same

For troubleshooting help with date and time, remember to enclose the formats

%d,%b,%m and % Y in the same exact order as the original string- and if there are any delimiters like ” -” or “/” then these delimiters are entered in exactly the same order in the format statement of the strptime

Sys.time() gives you the current date-time while the function difftime(time1,time2) gives you the time intervals( say if you have two columns as date-time variables)

What are the various formats for inputs in date time?

%a: Abbreviated weekday name in the current locale. (Also matches full name on input.)
%A: Full weekday name in the current locale. (Also matches abbreviated name on input.)
%b: Abbreviated month name in the current locale. (Also matches full name on input.)
%B: Full month name in the current locale. (Also matches abbreviated name on input.)
%c: Date and time. Locale-specific on output, "%a %b %e %H:%M:%S %Y" on input.
%d: Day of the month as decimal number (01–31).
%H: Hours as decimal number (00–23).
%I: Hours as decimal number (01–12).
%j: Day of year as decimal number (001–366).
%m: Month as decimal number (01–12).
%M: Minute as decimal number (00–59).
%p: AM/PM indicator in the locale. Used in conjunction with %I and not with %H. An empty string in some locales.
%S: Second as decimal number (00–61), allowing for up to two leap-seconds (but POSIX-compliant implementations will ignore leap seconds).

%U: Week of the year as decimal number (00–53) using Sunday as the first day 1 of the week (and typically with the first Sunday of the year as day 1 of week 1). The US convention.
%w: Weekday as decimal number (0–6, Sunday is 0).
%W: Week of the year as decimal number (00–53) using Monday as the first day of week (and typically with the first Monday of the year as day 1 of week 1). The UK convention.
%x: Date. Locale-specific on output, "%y/%m/%d" on input.
%X: Time. Locale-specific on output, "%H:%M:%S" on input.
%y: Year without century (00–99). Values 00 to 68 are prefixed by 20 and 69 to 99 by 19 – that is the behaviour specified by the 2004 POSIX standard, but it does also say ‘it is expected that in a future version the default century inferred from a 2-digit year will change’.
%Y: Year with century.
%z: Signed offset in hours and minutes from UTC, so -0800 is 8 hours behind UTC.
%Z: (output only.) Time zone as a character string (empty if not available).; Also to read the helpful documentation (especially for time zone level, and leap year seconds and differences); http://stat.ethz.ch/R-manual/R-patched/library/base/html/difftime.html; http://stat.ethz.ch/R-manual/R-patched/library/base/html/strptime.html; http://stat.ethz.ch/R-manual/R-patched/library/base/html/Ops.Date.html; http://stat.ethz.ch/R-manual/R-patched/library/base/html/Dates.html

Mark-up Your Events Online with Microformats (seomoz.org)
How do you convert octal numbers to decimal numbers (wiki.answers.com)
The Rollover of Doom: a Trap for Good Programmers (esr.ibiblio.org)
Formatting Dates, Times and Numbers in ASP.NET (4guysfromrolla.com)
JavaScript Date Format (stevenlevithan.com)
Rcpp 0.9.0 and RcppClassic 0.9.0 (dirk.eddelbuettel.com)
Comparing times and dates in Ruby (nofluffjuststuff.com)
C#: Programatically Convert between ASCII, Decimal, and Hexidecimal (lockergnome.com)
Scale and Scalability: Rethinking the Most Overused IT System Selling Point for the Cloud Era (itexpertvoice.com)
Coding Horror: A Visual Explanation of SQL Joins (codinghorror.com)

Challenges of Analyzing a dataset (with R)

GIF-animation showing a moving echocardiogram;... — Image via Wikipedia

Analyzing data can have many challenges associated with it. In the case of business analytics data, these challenges or constraints can have a marked effect on the quality and timeliness of the analysis as well as the expected versus actual payoff from the analytical results.

Challenges of Analytical Data Processing-

1) Data Formats- Reading in complete data, without losing any part (or meta data), or adding in superfluous details (that increase the scope). Technical constraints of data formats are relatively easy to navigate thanks to ODBC and well documented and easily search-able syntax and language.

The costs of additional data augmentation (should we pay for additional credit bureau data to be appended) , time of storing and processing the data (every column needed for analysis can add in as many rows as whole dataset, which can be a time enhancing problem if you are considering an extra 100 variables with a few million rows), but above all that of business relevance and quality guidelines will ensure basic data input and massaging are considerable parts of whole analytical project timeline.

2) Data Quality-Perfect data exists in a perfect world. The price of perfect information is one business will mostly never budget or wait for. To deliver inferences and results based on summaries of data which has missing, invalid, outlier data embedded within it makes the role of an analyst just as important as which ever tool is chosen to remove outliers, replace missing values, or treat invalid data.

3) Project Scope-

How much data? How much Analytical detail versus High Level Summary? Timelines for delivery as well as refresh of data analysis? Checks (statistical as well as business)?

How easy is it to load and implement the new analysis in existing Information Technology Infrastructure? These are some of the outer parameters that can limit both your analytical project scope, your analytical tool choice, and your processing methodology.
4) Output Results vis a vis stakeholder expectation management-

Stakeholders like to see results, not constraints, hypothesis ,assumptions , p-value, or chi -square value. Output results need to be streamlined to a decision management process to justify the investment of human time and effort in an analytical project, choice,training and navigating analytical tool complexities and constraints are subset of it. Optimum use of graphical display is a part of aligning results to a more palatable form to stakeholders, provided graphics are done nicely.

Eg Marketing wants to get more sales so they need a clear campaign, to target certain customers via specific channels with specified collateral. In order to base their business judgement, business analytics needs to validate , cross validate and sometimes invalidate this business decision making with clear transparent methods and processes.

Given a dataset- the basic analytical steps that an analyst will do with R are as follows. This is meant as a note for analysts at a beginner level with R.

Package -specific syntax

update.packages() #This updates all packages
install.packages(package1) #This installs a package locally, a one time event
library(package1) #This loads a specified package in the current R session, which needs to be done every R session

CRAN________LOCAL HARD DISK_________R SESSION is the top to bottom hierarchy of package storage and invocation.

ls() #This lists all objects or datasets currently active in the R session

> names(assetsCorr) #This gives the names of variables within a dataframe
[1] “AssetClass”            “LargeStocksUS”         “SmallStocksUS”
[4] “CorporateBondsUS”      “TreasuryBondsUS”       “RealEstateUS”
[7] “StocksCanada”          “StocksUK”              “StocksGermany”
[10] “StocksSwitzerland”     “StocksEmergingMarkets”

> str(assetsCorr) #gives complete structure of dataset
‘data.frame’:    12 obs. of 11 variables:
$ AssetClass           : Factor w/ 12 levels “CorporateBondsUS”,..: 4 5 2 6 1 12 3 7 11 9 …
$ LargeStocksUS        : num 15.3 16.4 1 0 0 …
$ SmallStocksUS        : num 13.49 16.64 0.66 1 0 …
$ CorporateBondsUS     : num 9.26 6.74 0.38 0.46 1 0 0 0 0 0 …
$ TreasuryBondsUS      : num 8.44 6.26 0.33 0.27 0.95 1 0 0 0 0 …
$ RealEstateUS         : num 10.6 17.32 0.08 0.59 0.35 …
$ StocksCanada         : num 10.25 19.78 0.56 0.53 -0.12 …
$ StocksUK             : num 10.66 13.63 0.81 0.41 0.24 …
$ StocksGermany        : num 12.1 20.32 0.76 0.39 0.15 …
$ StocksSwitzerland    : num 15.01 20.8 0.64 0.43 0.55 …
$ StocksEmergingMarkets: num 16.5 36.92 0.3 0.6 0.12 …

> dim(assetsCorr) #gives dimensions observations and variable number
[1] 12 11

str(Dataset) – This gives the structure of the dataset (note structure gives both the names of variables within dataset as well as dimensions of the dataset)

head(dataset,n1) gives the first n1 rows of dataset while
tail(dataset,n2) gives the last n2 rows of a dataset where n1,n2 are numbers and dataset is the name of the object (here a data frame that is being considered)

summary(dataset) gives you a brief summary of all variables while

library(Hmisc)
describe(dataset) gives a detailed description on the variables

simple graphics can be given by

hist(Dataset1)
and
plot(Dataset1)

As you can see in above cases, there are multiple ways to get even basic analysis about data in R- however most of the syntax commands are intutively understood (like hist for histogram, t.test for t test, plot for plot).

For detailed analysis throughout the scope of analysis, for a business analytics user it is recommended to using multiple GUI, and multiple packages. Even for highly specific and specialized analytical tasks it is recommended to check for a GUI that incorporates the required package.

The data analysis path is built on curiosity, followed by action (radar.oreilly.com)
Using Datasets in KRL (Flickr RSS) (code.kynetx.com)
R interface to Google Chart Tools (r-bloggers.com)
How To Get Experience Working With Large Datasets (highscalability.com)
A portal for European government data: PublicData.eu plans (onlinejournalismblog.com)
5 Datasets You Can Buy and Use for SEO (and a few for free!) (seomoz.org)
Integrated Longitudinal Database Available in Census Centers (kauffman.org)

Quick-R and Statmethods.net

Image via Wikipedia

I was searching for some basic syntax in R (basically cross tabs and density plots) and I came across the Quick R site.

http://www.statmethods.net/

Its really a nice site for R beginners and anyone trying to remember some syntax.

R syntax can be very simple- a histoigram is just hist(), boxplot is just boxplot() and t test is just t.test(dataset)

Here is an example from the site-

http://www.statmethods.net/graphs/density.html

# Simple Histogram hist(mtcars$mpg)

click to view

# Colored Histogram with Different Number of Bins hist(mtcars$mpg, breaks=12, col="red")

click to view

# Add a Normal Curve (Thanks to Peter Dalgaard) x <- mtcars$mpg h<-hist(x, breaks=10, col="red", xlab="Miles Per Gallon", main="Histogram with Normal Curve") xfit<-seq(min(x),max(x),length=40) yfit<-dnorm(xfit,mean=mean(x),sd=sd(x)) yfit <- yfit*diff(h$mids[1:2])*length(x) lines(xfit, yfit, col="blue", lwd=2)

click to view

Histograms can be a poor method for determining the shape of a distribution because it is so strongly affected by the number of bins used.

KERNEL DENSITY PLOTS

Kernal density plots are usually a much more effective way to view the distribution of a variable. Create the plot using plot(density(x)) where x is a numeric vector.

# Kernel Density Plot d <- density(mtcars$mpg) # returns the density data plot(d) # plots the results

click to view

# Filled Density Plot d <- density(mtcars$mpg) plot(d, main="Kernel Density of Miles Per Gallon") polygon(d, col="red", border="blue")

click to view

COMPARING GROUPS VIA KERNAL DENSITY

The sm.density.compare( ) function in the sm package allows you to superimpose the kernal density plots of two or more groups. The format is sm.density.compare(x, factor) where x is a numeric vector and factor is the grouping variable.

# Compare MPG distributions for cars with # 4,6, or 8 cylinders library(sm) attach(mtcars)


# create value labels

cyl.f <- factor(cyl, levels= c(4,6,8),

labels = c("4 cylinder", "6 cylinder", "8 cylinder"))
# plot densities

sm.density.compare(mpg, cyl, xlab="Miles Per Gallon")

title(main="MPG Distribution by Car Cylinders")

# add legend via mouse click colfill<-c(2:(2+length(levels(cyl.f)))) legend(locator(1), levels(cyl.f), fill=colfill)

click to view

It is not as exhaustive as http://cran.r-project.org/doc/manuals/R-intro.html

but it is much more simpler and easy to follow.

The site is created by Robert I. Kabacoff, Ph.D.

and he is working on a book called “R in Action”

I have received numerous requests for a hardcopy version of this site, so over the past year I have been writing a book that takes the material here and significantly expands upon it. If you are interested, early access is available.

If you have not been to that website, I recommend it highly (though the tagline or logo of R for SAS/SPSS/Stata users seems a bit familiar)-http://www.statmethods.net/index.html

Quick-R

for SAS/SPSS/Stata Users

Two Thoughts on Lisp Syntax. (kazimirmajorinc.blogspot.com)
Some Basics about Stats (psipsychologytutor.org)
Bone Density Tests: A Clue to Your Future (webmd.com)
Net Access Corporation Unveils 50,000 Square Foot, State-of-the-Art Data Center in Parsippany, New Jersey (prweb.com)
programming languages – What makes lisp macros so special – Stack Overflow (stackoverflow.com)
Thinking about Syntax (latenightpc.com)
Our minds use syntax to understand actions, just like with language [Mad Psychology] (io9.com)
Syntax highlighting for Django using Pygments (ofbrooklyn.com)
People of HTML5 – Bruce Lawson (hacks.mozilla.org)
Haskell syntax vs. Lisp syntax | LispCast (lispcast.com)

Trying out Google Prediction API from R

So I saw the news at NY R Meetup and decided to have a go at Prediction API Package (which first started off as a blog post at

http://onertipaday.blogspot.com/2010/11/r-wrapper-for-google-prediction-api.html

1)My OS was Ubuntu 10.10 Netbook

Ubuntu has a slight glitch plus workaround for installing the RCurl package on which the Google Prediction API is dependent- you need to first install this Ubuntu package for RCurl to install libcurl4-gnutls-dev

Once you install that using Synaptic,

Simply start R

2) Install Packages rjson and Rcurl using install.packages and choosing CRAN

Since GooglePredictionAPI is not yet on CRAN

3) Download that package from

https://code.google.com/p/google-prediction-api-r-client/downloads/detail?name=googlepredictionapi_0.1.tar.gz&can=2&q=

You need to copy this downloaded package to your “first library ” folder

When you start R, simply run

.libPaths()[1]

and thats the folder you copy the GooglePredictionAPI package you downloaded.

5) Now the following line works

Under R prompt,

> install.packages("googlepredictionapi_0.1.tar.gz", repos=NULL, type="source")

6) Uploading data to Google Storage using the GUI (rather than gs util)

Just go to https://sandbox.google.com/storage/

and thats the Google Storage manager

Notes on Training Data-

Use a csv file

The first column is the score column (like 1,0 or prediction score)

There are no headers- so delete headers from data file and move the dependent variable to the first column (Note I used data from the kaggle contest for R package recommendation at

http://kaggle.com/R?viewtype=data )

6) The good stuff:

Once you type in the basic syntax, the first time it will ask for your Google Credentials (email and password)

It then starts showing you time elapsed for training.

Now you can disconnect and go off (actually I got disconnected by accident before coming back in a say 5 minutes so this is the part where I think this is what happened is why it happened, dont blame me, test it for yourself) –

and when you come back (hopefully before token expires) you can see status of your request (see below)

> library(rjson)
> library(RCurl)
Loading required package: bitops
> library(googlepredictionapi)
> my.model <- PredictionApiTrain(data="gs://numtraindata/training_data")
The request for training has sent, now trying to check if training is completed
Training on numtraindata/training_data: time:2.09 seconds
Training on numtraindata/training_data: time:7.00 seconds

Note I changed the format from the URL where my data is located- simply go to your Google Storage Manager and right click on the file name for link address ( https://sandbox.google.com/storage/numtraindata/training_data.csv)

to gs://numtraindata/training_data (that kind of helps in any syntax error)

8) From the kind of high level instructions at https://code.google.com/p/google-prediction-api-r-client/, you could also try this on a local file

Usage

## Load googlepredictionapi and dependent libraries
library(rjson)
library(RCurl)
library(googlepredictionapi)

## Make a training call to the Prediction API against data in the Google Storage.
## Replace MYBUCKET and MYDATA with your data.
my.model <- PredictionApiTrain(data="gs://MYBUCKET/MYDATA")

## Alternatively, make a training call against training data stored locally as a CSV file.
## Replace MYPATH and MYFILE with your data.
my.model <- PredictionApiTrain(data="MYPATH/MYFILE.csv")

At the time of writing my data was still getting trained, so I will keep you posted on what happens.

An R interface to the Google Prediction API (revolutionanalytics.com)
Google Prediction Goes to the Movies (technoverseblog.com)
11 new APIs: Google Predictions, Amazon User Management (programmableweb.com)
R at Google (r-bloggers.com)
Google API Console Opens Up Millions of Queries Daily (programmableweb.com)
Canonical Design Team: So, you want to provide an API for the world to use? (design.canonical.com)

Related Articles

Please share:

Related Articles

Please share:

Conventional

Custom

Related Articles

Please share:

Related Articles

Please share:

Related Articles

Please share:

KERNEL DENSITY PLOTS

COMPARING GROUPS VIA KERNAL DENSITY

Quick-R

for SAS/SPSS/Stata Users

Related Articles

Please share:

Usage

Related Articles

Please share: