Graphs in Statistical Analysis

One of the seminal papers establishing the importance of data visualization (as it is now called) was the 1973 paper by F J Anscombe in http://www.sjsu.edu/faculty/gerstman/StatPrimer/anscombe1973.pdf

It has probably the most elegant introduction to an advanced statistical analysis paper that I have ever seen-

1. Usefulness of graphs

Most textbooks on statistical methods, and most statistical computer programs, pay too little attention to graphs. Few of us escape being indoctrinated with these notions:

(1) numerical calculations are exact, but graphs are rough;

(2) for any particular kind of statistical data there is just one set of calculations constituting a correct statistical analysis;

(3) performing intricate calculations is virtuous, whereas actually looking at the data is cheating.

A computer should make both calculations and graphs. Both sorts of output should be studied; each will contribute to understanding.

Of course the dataset makes it very very interesting for people who dont like graphical analysis too much.

From http://en.wikipedia.org/wiki/Anscombe%27s_quartet

The x values are the same for the first three datasets.

Anscombe’s Quartet
I		II		III		IV
x	y	x	y	x	y	x	y
10.0	8.04	10.0	9.14	10.0	7.46	8.0	6.58
8.0	6.95	8.0	8.14	8.0	6.77	8.0	5.76
13.0	7.58	13.0	8.74	13.0	12.74	8.0	7.71
9.0	8.81	9.0	8.77	9.0	7.11	8.0	8.84
11.0	8.33	11.0	9.26	11.0	7.81	8.0	8.47
14.0	9.96	14.0	8.10	14.0	8.84	8.0	7.04
6.0	7.24	6.0	6.13	6.0	6.08	8.0	5.25
4.0	4.26	4.0	3.10	4.0	5.39	19.0	12.50
12.0	10.84	12.0	9.13	12.0	8.15	8.0	5.56
7.0	4.82	7.0	7.26	7.0	6.42	8.0	7.91
5.0	5.68	5.0	4.74	5.0	5.73	8.0	6.89

For all four datasets:

Property	Value
Mean of x in each case	9 exact
Variance of x in each case	11 exact
Mean of y in each case	7.50 (to 2 decimal places)
Variance of y in each case	4.122 or 4.127 (to 3 d.p.)
Correlation between x and y in each case	0.816 (to 3 d.p.)
Linear regression line in each case	y = 3.00 + 0.500x (to 2 d.p. and 3 d.p. resp.)

But see the graphical analysis –

While R has always been great in emphasizing graphical analysis, thanks in part due to work by H Wickham and others, SAS products and language has also modified its approach at http://www.sas.com/technologies/analytics/statistics/datadiscovery/

SAS Visual Data Discovery combines top-selling SAS products (Base SAS, SAS/STAT® and SAS/GRAPH®), along with two interfaces (SAS® Enterprise Guide® for guided tasks and batch analysis and JMP® software for discovery and exploratory analysis).

and ODS Statistical Graphs at

http://support.sas.com/resources/papers/76822_ODSGraph2011.pdf

While ODS Statistical graphs is still not as smooth as say R’s GGPLOT2 http://tinyurl.com/ggplot2-book, it still is a progressive step

Pretty graphs make for better decisions too !

US-CERT Incident Reporting System

Here are some resources if your cyber resources have been breached. Note the form doesnot use CAPTCHA at all

US-CERT Incident Reporting System (their head Randy Vickers quit last week)

https://forms.us-cert.gov/report/

Using the US-CERT Incident Reporting SystemIn order for us to respond appropriately, please answer the questions as completely and accurately as possible. Questions that must be answered are labeled “Required”. As always, we will protect your sensitive information. This web site uses Secure Sockets Layer (SSL) to provide secure communications. Your browser must allow at least 40-bit encryption. This method of communication is much more secure than unencrypted email. Continue reading

RStudio 3- Making R as simple as possible but no simpler

From the nice shiny blog at http://blog.rstudio.org/, a shiny new upgraded software (and I used the Cobalt theme)–this is nice!

awesome coding!!!

http://www.rstudio.org/download/

Download RStudio v0.94

If you run R on your desktop:

OR

If you run R on a Linux server and want to enable users to remotely access RStudio using a web browser:

RStudio v0.94 — Release Notes

June 15th, 2011

New Features and Enhancements

Source Editor and Console

Run code:
- Run all lines in source file
- Run to current line
- Run from current line
- Redefine current function
- Re-run previous region
- Code is now run line-by-line in the console
Brace, paren, and quote matching
Improved cursor placement after newlines
Support for regex find and replace
Optional syntax highlighting for console input
Press F1 for help on current selection
Function navigation / jump to function
Column and line number display
Manually set/switch document type
New themes: Solarized and Solarized Dark

Plots

Improved image export:
- Formats: PNG, JPEG, TIFF, SVG, BMP, Metafile, and Postscript
- Dynamic resize with preview
- Option to maintain aspect ratio when resizing
- Copy to clipboard as bitmap or metafile
Improved PDF export:
- Specify custom sizes
- Preview before exporting
Remove individual plots from history
Resizable plot zoom window

History

History tab synced to loaded .Rhistory file
New commands:
- Load and save history
- Remove individual items from history
- Clear all history
New options:
- Load history from working directory or global history file
- Save history always or only when saving .RData
- Remove duplicate entries in history
Shortcut keys for inserting into console or source

Packages

Check for package updates
Filter displayed packages
Install multiple packages
Remove packages
New options:
- Install from repository or local archive file
- Target library
- Install dependencies

Miscellaneous

Find text within help topic
Sort file listing by name, type, size, or modified
Set working directory based on source file, files pane, or browsed for directory.
Console titlebar button to view current working directory in files pane
Source file menu command
Replace space and dash with dot (.) in import dataset generated variable names
Add decimal separator preference for import dataset
Added .tar.gz (Linux) and .zip (Windows) distributions for non-admin installs
Read /etc/paths.d on OS X to ensure RStudio has the same path as terminal sessions do
Added manifest to rsession.exe to prevent unwanted program files and registry virtualization

Server

Break PAM auth into its own binary for improved compatibility with 3rd party PAM authorization modules.
Ensure that AppArmor profile is enforced even after reboot
Ability to add custom LD library path for all sessions
Improved R discovery:
- Use which R then fallback to scanning for R script
- Run R discovery unconfined then switch into restricted profile
Default to uncompressed save.image output if the administrator or user hasn’t specified their own options (improved suspend/resume performance)
Ensure all running sessions are automatically updated during server version upgrade
Added verify-installation command to rstudio-server utility for easily capturing configuration and startup related errors

Bug Fixes

Source Editor

Undo to unedited state clears now dirty bit
Extract function now captures free variables used on lhs
Selected variable highlight now visible in all themes
Syncing to source file updates made outside of RStudio now happens immediately at startup and does not cause a scroll to the bottom of the document.
Fixed various issues related to copying and pasting into word processors
Fixed incorrect syntax highlighting issues in .Rd files
Make sure font size for printed source files matches current editor setting
Eliminate conflict with Ctrl+F shortcut key on OS X
Zoomed Google Chrome browser no longer causes cursor position to be off
Don’t prevent opening of unknown file types in the editor

Console

Fixed sporadic missing underscores (and other bottom clipping of text) in console
Make sure console history is never displayed offscreen
Page Up and Page Down now work properly in the console
Substantially improved console performance for both rapid output and large quantities of output

Miscellaneous

Install successfully on Windows with special characters in home directory name
make install more tolerant of configurations where it can’t write into /usr/share
Eliminate spurious stderr output in forked children of multicore package
Ensure that file modified times always update in the files pane after a save
Always default to installing packages into first writeable path of .libPaths()
Ensure that LaTeX log files are always preserved after compilePdf
Fix conflicts with zap function from epicalc package
Eliminate shortcut key conflicts with Ubuntu desktop workspace switching shortcuts
Always prompt when attempting to save files of the same name
Maximized main window now properly restored when reopening RStudio
PAM authorization works correctly even if account has password expiration warning
Correct display of manipulate panel when Plots pane is on the left

Previous Release Notes

RStudio v0.93 — April 11th, 2011

LibreOffice Stable Release launched

Non Oracle Open Office completes important milestone- from the press release

The Document Foundation launches LibreOffice 3.3

The first stable release of the free office suite is available for download

The Internet, January 25, 2011 – The Document Foundation launches LibreOffice 3.3, the first stable release of the free office suite developed by the community. In less than four months, the number of developers hacking LibreOffice has grown from less than twenty in late September 2010, to well over one hundred today. This has allowed us to release ahead of the aggressive schedule set by the project.

Not only does it ship a number of new and original features, LibreOffice 3.3 is also a significant achievement for a number of reasons:

– the developer community has been able to build their own and independent process, and get up and running in a very short time (with respect to the size of the code base and the project’s strong ambitions);

– thanks to the high number of new contributors having been attracted into the project, the source code is quickly undergoing a major clean-up to provide a better foundation for future development of LibreOffice;

– the Windows installer, which is going to impact the largest and most diverse user base, has been integrated into a single build containing all language versions, thus reducing the size for download sites from 75 to 11GB, making it easier for us to deploy new versions more rapidly and lowering the carbon footprint of the entire infrastructure.

Caolán McNamara from RedHat, one of the developer community leaders, comments, “We are excited: this is our very first stable release, and therefore we are eager to get user feedback, which will be integrated as soon as possible into the code, with the first enhancements being released in February. Starting from March, we will be moving to a real time-based, predictable, transparent and public release schedule, in accordance with Engineering Steering Committee’s goals and users’ requests”. The LibreOffice development roadmap is available at http://wiki.documentfoundation.org/ReleasePlan

LibreOffice 3.3 brings several unique new features. The 10 most-popular among community members are, in no particular order:

the ability to import and work with SVG files;
an easy way to format title pages and their numbering in Writer;
a more-helpful Navigator Tool for Writer;
improved ergonomics in Calc for sheet and cell management;
and Microsoft Works and Lotus Word Pro document import filters.

In addition, many great extensions are now bundled, providing

PDF import,

a slide-show presenter console,

a much improved report builder, and more besides.

A more-complete and detailed list of all the new features offered by LibreOffice 3.3 is viewable on the following web page: http://www.libreoffice.org/download/new-features-and-fixes/

LibreOffice 3.3 also provides all the new features of OpenOffice.org 3.3, such as new custom properties handling; embedding of standard PDF fonts in PDF documents; new Liberation Narrow font; increased document protection in Writer and Calc; auto decimal digits for “General” format in Calc; 1 million rows in a spreadsheet; new options for CSV import in Calc; insert drawing objects in Charts; hierarchical axis labels for Charts; improved slide layout handling in Impress; a new easier-to-use print interface; more options for changing case; and colored sheet tabs in Calc. Several of these new features were contributed by members of the LibreOffice team prior to the formation of The Document Foundation.

LibreOffice hackers will be meeting at FOSDEM in Brussels on February 5 and 6, and will be presenting their work during a one-day workshop on February 6, with speeches and hacking sessions coordinated by several members of the project.

The home of The Document Foundation is at http://www.documentfoundation.org

The home of LibreOffice is at http://www.libreoffice.org where the download page has been redesigned by the community to be more user-friendly.

*** About The Document Foundation

The Document Foundation has the mission of facilitating the evolution of the OOo Community into a new, open, independent, and meritocratic organization within the next few months. An independent Foundation is a better reflection of the values of our contributors, users and supporters, and will enable a more effective, efficient and transparent community. TDF will protect past investments by building on the achievements of the first decade, will encourage wide participation within the community, and will co-ordinate activity across the community.

*** Media Contacts for TDF

Florian Effenberger (Germany)

Mobile: +49 151 14424108 – E-mail: floeff@documentfoundation.org

Olivier Hallot (Brazil)

Mobile: +55 21 88228812 – E-mail: olivier.hallot@documentfoundation.org

Charles H. Schulz (France)

Mobile: +33 6 98655424 – E-mail: charles.schulz@documentfoundation.org

Italo Vignoli (Italy)

Mobile: +39 348 5653829 – E-mail: italo.vignoli@documentfoundation.org

LibreOffice now default Office Suite in Ubuntu 11.04 (omgubuntu.co.uk)
Ubuntu 11.04 switches to LibreOffice in latest daily builds (downloadsquad.switched.com)
Ubuntu Ditches OpenOffice For LibreOffice (informationweek.com)
Ubuntu opts for LibreOffice over Oracle’s OpenOffice (zdnet.com)
LibreOffice Is Taking Shape With Third Beta (pcworld.com)
Ubuntu 11 Switches To Libre Office (lockergnome.com)

Handling time and date in R

One of the most frustrating things I had to do while working as financial business analysts was working with Data Time Formats in Base SAS. The syntax was simple enough and SAS was quite good with handing queries to the Oracle data base that the client was using, but remembering the different types of formats in SAS language was a challenge (there was a date9. and date6 and mmddyy etc )

Data and Time variables are particularly important variables in financial industry as almost everything is derived variable from the time (which varies) while other inputs are mostly constants. This includes interest as well as late fees and finance fees.

In R, date and time are handled quite simply-

Use the strptime( dataset, format) function to convert the character into string

For example if the variable dob is “01/04/1977) then following will convert into a date object

z=strptime(dob,”%d/%m/%Y”)

and if the same date is 01Apr1977

z=strptime(dob,"%d%b%Y")

does the same

For troubleshooting help with date and time, remember to enclose the formats

%d,%b,%m and % Y in the same exact order as the original string- and if there are any delimiters like ” -” or “/” then these delimiters are entered in exactly the same order in the format statement of the strptime

Sys.time() gives you the current date-time while the function difftime(time1,time2) gives you the time intervals( say if you have two columns as date-time variables)

What are the various formats for inputs in date time?

%a: Abbreviated weekday name in the current locale. (Also matches full name on input.)
%A: Full weekday name in the current locale. (Also matches abbreviated name on input.)
%b: Abbreviated month name in the current locale. (Also matches full name on input.)
%B: Full month name in the current locale. (Also matches abbreviated name on input.)
%c: Date and time. Locale-specific on output, "%a %b %e %H:%M:%S %Y" on input.
%d: Day of the month as decimal number (01–31).
%H: Hours as decimal number (00–23).
%I: Hours as decimal number (01–12).
%j: Day of year as decimal number (001–366).
%m: Month as decimal number (01–12).
%M: Minute as decimal number (00–59).
%p: AM/PM indicator in the locale. Used in conjunction with %I and not with %H. An empty string in some locales.
%S: Second as decimal number (00–61), allowing for up to two leap-seconds (but POSIX-compliant implementations will ignore leap seconds).

%U: Week of the year as decimal number (00–53) using Sunday as the first day 1 of the week (and typically with the first Sunday of the year as day 1 of week 1). The US convention.
%w: Weekday as decimal number (0–6, Sunday is 0).
%W: Week of the year as decimal number (00–53) using Monday as the first day of week (and typically with the first Monday of the year as day 1 of week 1). The UK convention.
%x: Date. Locale-specific on output, "%y/%m/%d" on input.
%X: Time. Locale-specific on output, "%H:%M:%S" on input.
%y: Year without century (00–99). Values 00 to 68 are prefixed by 20 and 69 to 99 by 19 – that is the behaviour specified by the 2004 POSIX standard, but it does also say ‘it is expected that in a future version the default century inferred from a 2-digit year will change’.
%Y: Year with century.
%z: Signed offset in hours and minutes from UTC, so -0800 is 8 hours behind UTC.
%Z: (output only.) Time zone as a character string (empty if not available).; Also to read the helpful documentation (especially for time zone level, and leap year seconds and differences); http://stat.ethz.ch/R-manual/R-patched/library/base/html/difftime.html; http://stat.ethz.ch/R-manual/R-patched/library/base/html/strptime.html; http://stat.ethz.ch/R-manual/R-patched/library/base/html/Ops.Date.html; http://stat.ethz.ch/R-manual/R-patched/library/base/html/Dates.html

Mark-up Your Events Online with Microformats (seomoz.org)
How do you convert octal numbers to decimal numbers (wiki.answers.com)
The Rollover of Doom: a Trap for Good Programmers (esr.ibiblio.org)
Formatting Dates, Times and Numbers in ASP.NET (4guysfromrolla.com)
JavaScript Date Format (stevenlevithan.com)
Rcpp 0.9.0 and RcppClassic 0.9.0 (dirk.eddelbuettel.com)
Comparing times and dates in Ruby (nofluffjuststuff.com)
C#: Programatically Convert between ASCII, Decimal, and Hexidecimal (lockergnome.com)
Scale and Scalability: Rethinking the Most Overused IT System Selling Point for the Cloud Era (itexpertvoice.com)
Coding Horror: A Visual Explanation of SQL Joins (codinghorror.com)

PySpread Magic

Image via Wikipedia

Just working with PySpread- and worked on a 1 million by 1 million spreadsheet- Python sure looks promising for the way ahead for stat computing ( you need to

sudo apt-get install python-numpy python-rpy python-scipy python-gmpy wxpython*,

cd to the untarred bz2 file from http://pyspread.sourceforge.net/download.html, (like

:~/Downloads$ cd pyspread-0.1.2

:~/Downloads/pyspread-0.1.2

sudo python setup.py install

)

http://pyspread.sourceforge.net/

by Martin Manns

about

Pyspread is a cross-platform Python spreadsheet application. It is based on and written in the programming language Python.

Instead of spreadsheet formulas, Python expressions are entered into the spreadsheet cells. Each expression returns a Python object that can be accessed from other cells. These objects can represent anything including lists or matrices.

features

Three dimensional grid with up to 85,899,345 rows and 14,316,555 columns (64 bit systems, depends on row height and column width). Note that a million cells require about 500 MB of memory.
Complex data types such as lists, trees or matrices within a single cell.
Macros for functionalities that are too complex for a single Python expression.
Python module access from each cell, which allows:
- Arbitrary size rational numbers (via gmpy),
- Fixed point decimal numbers for business calculations, (via the decimal module from the standard library)
- Advanced statistics including plotting functions (via RPy)
- Much more via <your favourite module>.
CSV import and export
Clipboard access

warning

The concept of pyspread allows doing everything from each cell that a Python script can do. This powerful feature has its drawbacks. A spreadsheet may very well delete your hard drive or send your data via the Internet. Of course this is a non-issue if you sandbox properly or if you only use self developed spreadsheets.

Since this is not the case for everyone (see discussion at lwn.net), a GPG signature based trust model for spreadsheet files has been introduced. It ensures that only your own trusted files are executed on loading. Untrusted files are displayed in safe mode. You can approve a file manually. Inspect carefully.

Python Package Index : PyPI (pypi.python.org)
SciPy – (scipy.org)
Top Ten Articles of 2010 (blog.pythonlibrary.org)
Ride the snake: Calling Python libraries from Haskell (john-millikin.com)
PyPy 1.4: Ouroboros in practice (morepypy.blogspot.com)
pyFLTK Home Page (pyfltk.sourceforge.net)
PyPy 1.4.1 (morepypy.blogspot.com)
python -me : a silly but useful command line trick (voidspace.org.uk)
PyPM Index for Python Developers (descentintodarkness.wordpress.com)
Python Extension Packages for Windows – Christoph Gohlke (lfd.uci.edu)
Ruby, Python, and Science (johndcook.com)
Compiling Python Code (effbot.org)
Deep end is deep (ask.metafilter.com)

Here comes PySpread- 85,899,345 rows and 14,316,555 columns

Whats new/ One more open source analytics package. Built like a spreadsheet with an ability to import a million cells-

From http://pyspread.sourceforge.net/index.html

about	Pyspread is a cross-platform Python spreadsheet application. It is based on and written in the programming language Python. Instead of spreadsheet formulas, Python expressions are entered into the spreadsheet cells. Each expression returns a Python object that can be accessed from other cells. These objects can represent anything including lists or matrices.
features	In pyspread, cells expect Python expressions and return Python objects. Therefore, complex data types such as lists, trees or matrices can be handled within a single cell. Macros can be used for functions that are too complex for a single expression. Since Python modules can be easily used without external scripts, arbitrary size rational numbers (via gmpy), fixed point decimal numbers for business calculations, (via the decimal module from the standard library) and advanced statistics including plotting functions (via RPy) can be used in the spreadsheet. Everything is directly available from each cell. Just use the grid Data can be imported and exported using csv files or the clipboard. Other forms of data exchange is possible using external Python modules. In order to simplify sparse matrix editing, pyspread features a three dimensional grid that can be sized up to 85,899,345 rows and 14,316,555 columns (64 bit-systems, depends on row height and column width). Note that importing a million cells requires about 500 MB of memory. The concept of pyspread allows doing everything from each cell that a Python script can do. This may very well include deleting your hard drive or sending your data via the Internet. Of course this is a non-issue if you sandbox properly or if you only use self developed spreadsheets. Since this is not the case for everyone (see the discussion at lwn.net), a GPG signature based trust model for spreadsheet files has been introduced. It ensures that only your own trusted files are executed on loading. Untrusted files are displayed in safe mode. You can trust a file manually. Inspect carefully.
requirements	Pyspread runs on Linux, Windows and *nix platforms with GTK+ support. There are reports that it works with MacOS X as well. If you would like to contribute by testing on OS X please contact me. Dependencies Python >=2.4 <3.0, numpy >=1.1.0 and wxPython >=2.8.10.1. Highly recommended for full functionality PyMe >=0.8.1, Note for Windows™ users: If you want to use signatures without compiling PyMe try out Gpg4win. gmpy >=1.1.0 and rpy >=1.0.3.
maturity	Pyspread is in early Beta release. This means that the core functionality is fully implemented but the program needs testing and polish.

and from the wiki

http://sourceforge.net/apps/mediawiki/pyspread/index.php?title=Main_Page

a spreadsheet with more powerful functions and data structures that are accessible inside each cell. Something like Python that empowers you to do things quickly. And yes, it should be free and it should run on Linux as well as on Windows. I looked around and found nothing that suited me. Therefore, I started pyspread.

Concept

Each cell accepts any input that works in a Python command line.
The inputs are parsed and evaluated by Python’s eval command.
The result objects are accessible via a 3D numpy object array.
String representations of the result objects are displayed in the cells.

Benefits

Each cell returns a Python object. This object can be anything including arrays and third party library objects.
Generator expressions can be used efficiently for data manipulation.
Efficient numpy slicing is used.
numpy methods are accessible for the data.

Installation

Download the pyspread tarball or zip and unzip at a convenient place
In case you do not have it already get and install Python, wxpython and numpy

If you want the examples to work, install gmpy, R and rpy

Really do check the version requirements that are mentioned on http://pyspread.sf.net

Get install privileges (e.g. become root)
Change into the directory and type

python setup.py install

Windows: Replace “python” with your Python interpreter (absolute path)

Become normal user again
Start pyspread by typing

pyspread

Enjoy

Tag: Decimal

Graphs in Statistical Analysis

RStudio 3- Making R as simple as possible but no simpler

http://www.rstudio.org/download/