Graphs in Statistical Analysis

One of the seminal papers establishing the importance of data visualization (as it is now called) was the 1973 paper by F J Anscombe in http://www.sjsu.edu/faculty/gerstman/StatPrimer/anscombe1973.pdf

It has probably the most elegant introduction to an advanced statistical analysis paper that I have ever seen-

1. Usefulness of graphs

Most textbooks on statistical methods, and most statistical computer programs, pay too little attention to graphs. Few of us escape being indoctrinated with these notions:

(1) numerical calculations are exact, but graphs are rough;

(2) for any particular kind of statistical data there is just one set of calculations constituting a correct statistical analysis;

(3) performing intricate calculations is virtuous, whereas actually looking at the data is cheating.

A computer should make both calculations and graphs. Both sorts of output should be studied; each will contribute to understanding.

Of course the dataset makes it very very interesting for people who dont like graphical analysis too much.

From http://en.wikipedia.org/wiki/Anscombe%27s_quartet

The x values are the same for the first three datasets.

Anscombe’s Quartet
I		II		III		IV
x	y	x	y	x	y	x	y
10.0	8.04	10.0	9.14	10.0	7.46	8.0	6.58
8.0	6.95	8.0	8.14	8.0	6.77	8.0	5.76
13.0	7.58	13.0	8.74	13.0	12.74	8.0	7.71
9.0	8.81	9.0	8.77	9.0	7.11	8.0	8.84
11.0	8.33	11.0	9.26	11.0	7.81	8.0	8.47
14.0	9.96	14.0	8.10	14.0	8.84	8.0	7.04
6.0	7.24	6.0	6.13	6.0	6.08	8.0	5.25
4.0	4.26	4.0	3.10	4.0	5.39	19.0	12.50
12.0	10.84	12.0	9.13	12.0	8.15	8.0	5.56
7.0	4.82	7.0	7.26	7.0	6.42	8.0	7.91
5.0	5.68	5.0	4.74	5.0	5.73	8.0	6.89

For all four datasets:

Property	Value
Mean of x in each case	9 exact
Variance of x in each case	11 exact
Mean of y in each case	7.50 (to 2 decimal places)
Variance of y in each case	4.122 or 4.127 (to 3 d.p.)
Correlation between x and y in each case	0.816 (to 3 d.p.)
Linear regression line in each case	y = 3.00 + 0.500x (to 2 d.p. and 3 d.p. resp.)

But see the graphical analysis –

While R has always been great in emphasizing graphical analysis, thanks in part due to work by H Wickham and others, SAS products and language has also modified its approach at http://www.sas.com/technologies/analytics/statistics/datadiscovery/

SAS Visual Data Discovery combines top-selling SAS products (Base SAS, SAS/STAT® and SAS/GRAPH®), along with two interfaces (SAS® Enterprise Guide® for guided tasks and batch analysis and JMP® software for discovery and exploratory analysis).

and ODS Statistical Graphs at

http://support.sas.com/resources/papers/76822_ODSGraph2011.pdf

While ODS Statistical graphs is still not as smooth as say R’s GGPLOT2 http://tinyurl.com/ggplot2-book, it still is a progressive step

Pretty graphs make for better decisions too !