Writing Task 1 (a report)
Three pie charts about young Australians secondary school leavers in years 1980, 1990 and 2000. Each pie showed the proportion of school leavers that continued studying, were employed or unemployed. Write a report to a university lecturer describing the pie charts below.
Here are some broad guidelines for Graphs from EIA.gov , so you can say these are the official graphical guidelines of USA Gov
They can be really useful for sites planning to get into the Tableau Software/NYT /Guardian Infographic mode- or even for communities of blogs that have recurrent needs to display graphical plots- particularly since communication, statistical and design specialists are different areas/expertise/people.
Energy Information Administration Standard 2009-25
Title: Statistical Graphs
Superseded Version: Standard 2002-25
Purpose: To ensure the utility (usefulness to intended users) and objectivity (accuracy, clarity, completeness, and lack of bias) of energy information presented in statistical graphs.
Applicability: All EIA information products.
Graphs should be used to show and compare changes, trends and/or relationships, and to assist users in visualizing the conclusions drawn from the data represented.
Bar Charts and Histograms-Bar Charts are one of the most widely used types of Business Charts. Even the ever popular histograms are special cases of bar charts (but showing frequencies). Histograms are the not the same as bar charts, they are simply bar charts of frequencies.
Basically a bar chart shows rectangular bars with length proportional to the quantities being described. It helps to see relative quantities between various category types.
The barplot() command is used for making Bar Plots, while hist() is used for histograms. You can also use the plot() command with type=h to create histograms-The official R manual also suggests that Dot plots using dotchart () are a reasonable substitute for bar plots.
A very simple easy to understand tutorial for basic bar plots is at http://msenux.redwoods.edu/math/R/barplot.php
The difference between the three main functions that can be used for these charts are shown below-
A line chart is one of the most commonly used charts in business analytics and metrics reporting. It basically consists of two variables plotted along the axes with the adjacent points being joined by line segments. Most often used with time series on the x-axis, line charts are simple to understand and use.
Variations on the line graph can include fan charts in time series which include joining line chart of historic data with ranges of future projections. Another common variation is to plot the linear regression or trend line between the two variables and superimpose it on the graph.
The slope of the line chart shows the rate of change at that particular point , and can also be used to highlight areas of discontinuity or irregular change between two variables.
The basic syntax of line graph is created by first using Plot() function to plot the points and then lines () function to plot the lines between the points.
I just checked out this new software for making PMML models. It is called Augustus and is created by the Open Data Group (http://opendatagroup.com/) , which is headed by Robert Grossman, who was the first proponent of using R on Amazon Ec2.
Probably someone like Zementis ( http://adapasupport.zementis.com/ ) can use this to further test , enhance or benchmark on the Ec2. They did have a joint webinar with Revolution Analytics recently.
Augustus is a PMML 4-compliant scoring engine that works with segmented models. Augustus is designed for use with statistical and data mining models. The new release provides Baseline, Tree and Naive-Bayes producers and consumers.
There is also a version for use with PMML 3 models. It is able to produce and consume models with 10,000s of segments and conforms to a PMML draft RFC for segmented models and ensembles of models. It supports Baseline, Regression, Tree and Naive-Bayes.
Augustus is written in Python and is freely available under the GNU General Public License, version 2.
Predictive Model Markup Language (PMML) is an XML mark up language to describe statistical and data mining models. PMML describes the inputs to data mining models, the transformations used to prepare data for data mining, and the parameters which define the models themselves. It is used for a wide variety of applications, including applications in finance, e-business, direct marketing, manufacturing, and defense. PMML is often used so that systems which create statistical and data mining models (“PMML Producers”) can easily inter-operate with systems which deploy PMML models for scoring or other operational purposes (“PMML Consumers”).
Change Detection using Augustus
For information regarding using Augustus with Change Detection and Health and Status Monitoring, please see change-detection.
Open Data Group provides management consulting services, outsourced analytical services, analytic staffing, and expert witnesses broadly related to data and analytics. It has experience with customer data, supplier data, financial and trading data, and data from internal business processes.
It has staff in Chicago and San Francisco and clients throughout the U.S. Open Data Group began operations in 2002.
The above example contains plots generated in R of scoring results from Augustus. Each point on the graph represents a use of the scoring engine and a chart is an aggregation of multiple Augustus runs. A Baseline (Change Detection) model was used to score data with multiple segments.
Augustus is typically used to construct models and score data with models. Augustus includes a dedicated application for creating, or producing, predictive models rendered as PMML-compliant files. Scoring is accomplished by consuming PMML-compliant files describing an appropriate model. Augustus provides a dedicated application for scoring data with four classes of models, Baseline (Change Detection) Models, Tree Models, Regression Models and Naive Bayes Models. The typical model development and use cycle with Augustus is as follows:
Identify suitable data with which to construct a new model.
Provide a model schema which proscribes the requirements for the model.
Run the Augustus producer to obtain a new model.
Run the Augustus consumer on new data to effect scoring.
Separate consumer and producer applications are supplied for Baseline (Change Detection) models, Tree models, Regression models and for Naive Bayes models. The producer and consumer applications require configuration with XML-formatted files. The specification of the configuration files and model schema are detailed below. The consumers provide for some configurability of the output but users will often provide additional post-processing to render the output according to their needs. A variety of mechanisms exist for transmitting data but user’s may need to provide their own preprocessing to accommodate their particular data source.
In addition to the producer and consumer applications, Augustus is conceptually structured and provided with libraries which are relevant to the development and use of Predictive Models. Broadly speaking, these consist of components that address the use of PMML and components that are specific to Augustus.
Augustus can accommodate a post-processing step. While not necessary, it is often useful to
Re-normalize the scoring results or performing an additional transformation.
Supplements the results with global meta-data such as timestamps.
Formatting of the results.
Select certain interesting values from the results.
Restructure the data for use with other applications.
Import your own data
Upload data tables from spreadsheets or CSV files, even KML. Developers can use the Fusion Tables API to insert, update, delete and query data programmatically. You can export your data as CSV or KML too.
Visualize it instantly
See the data on a map or as a chart immediately. Use filters for more selective visualizations.
Publish your visualization on other web properties
Now that you’ve got that nice map or chart of your data, you can embed it in a web page or blog post. Or send a link by email or IM. It will always display the latest data values from your table and helps you communicate your story more easily.
I have not been really posting or writing worthwhile on the website for some time, as I am still busy writing ” R for Business Analytics” which I hope to get out before year end. However while doing research for that, I came across many types of graphs and what struck me is the actual usage of some kinds of graphs is very different in business analytics as compared to statistical computing.
The criterion of top ten graphs is as follows-
1) Usage-The order in which they appear is not strictly in terms of desirability but actual frequency of usage. So a frequently used graph like box plot would be recommended above say a violin plot.
2) Adequacy- Data Visualization paradigms change over time- but the need for accurate conveying of maximum information in a minium space without overwhelming reader or misleading data perceptions.
3) Ease of creation- A simpler graph created by a single function is more preferrable to writing 4-5 lines of code to create an elaborate graph.
4) Aesthetics– Aesthetics is relative and in addition studies have shown visual perception varies across cultures and geographies. However , beauty is universally appreciated and a pretty graph is sometimes and often preferred over a not so pretty graph. Here being pretty is in both visual appeal without compromising perceptual inference from graphical analysis.
so When do we use a bar chart versus a line graph versus a pie chart? When is a mosaic plot more handy and when should histograms be used with density plots? The list tries to capture most of these practicalities.
Let me elaborate on some specific graphs-
1) Pie Chart- While Pie Chart is not really used much in stats computing, and indeed it is considered a misleading example of data visualization especially the skewed or two dimensional charts. However when it comes to evaluating market share at a particular instance, a pie chart is simple to understand. At the most two pie charts are needed for comparing two different snapshots, but three or more pie charts on same data at different points of time is definitely a bad case.
In R you can create piechart, by just using pie(dataset$variable)
As per official documentation, pie charts are not recommended at all.
Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.
Cleveland (1985), page 264: “Data that can be shown by pie charts always can be shown by a dot chart. This means that judgements of position along a common scale can be made instead of the less accurate angle judgements.” This statement is based on the empirical investigations of Cleveland and McGill as well as investigations by perceptual psychologists.
Despite this, pie charts are frequently used as an important metric they inevitably convey is market share. Market share remains an important analytical metric for business.
The pie3D( ) function in the plotrix package provides 3D exploded pie charts.An exploded pie chart remains a very commonly used (or misused) chart.
pie(rep(1,24), col=rainbow(24), radius=0.9)
title(main="Color Wheel", cex.main=1.4, font.main=3)
title(xlab="(test)", cex.lab=0.8, font.lab=3)
(Note adding a grey background is quite easy in the basic graphics device as well without using an advanced graphical package)