Home » Analytics » Data Frame in Python

Data Frame in Python

Software

R in the Cloud

Train in R

Exploring some Python Packages and R packages to move /work with both Python and R without melting your brain or exceeding your project deadline

—————————————

If you liked the data.frame structure in R, you have some way to work with them at a faster processing speed in Python.

Here are three packages that enable you to do so-

(1) pydataframe http://code.google.com/p/pydataframe/

An implemention of an almost R like DataFrame object. (install via Pypi/Pip: “pip install pydataframe”)

Usage:

        u = DataFrame( { "Field1": [1, 2, 3],
                        "Field2": ['abc', 'def', 'hgi']},
                        optional:
                         ['Field1', 'Field2']
                         ["rowOne", "rowTwo", "thirdRow"])

A DataFrame is basically a table with rows and columns.

Columns are named, rows are numbered (but can be named) and can be easily selected and calculated upon. Internally, columns are stored as 1d numpy arrays. If you set row names, they're converted into a dictionary for fast access. There is a rich subselection/slicing API, see help(DataFrame.get_item) (it also works for setting values). Please note that any slice get's you another DataFrame, to access individual entries use get_row(), get_column(), get_value().

DataFrames also understand basic arithmetic and you can either add (multiply,...) a constant value, or another DataFrame of the same size / with the same column names, like this:

#multiply every value in ColumnA that is smaller than 5 by 6.
my_df[my_df[:,'ColumnA'] < 5, 'ColumnA'] *= 6

#you always need to specify both row and column selectors, use : to mean everything
my_df[:, 'ColumnB'] = my_df[:,'ColumnA'] + my_df[:, 'ColumnC']

#let's take every row that starts with Shu in ColumnA and replace it with a new list (comprehension)
select = my_df.where(lambda row: row['ColumnA'].startswith('Shu'))
my_df[select, 'ColumnA'] = [row['ColumnA'].replace('Shu', 'Sha') for row in my_df[select,:].iter_rows()]

Dataframes talk directly to R via rpy2 (rpy2 is not a prerequiste for the library!)

 

(2) pandas http://pandas.pydata.org/

Library Highlights

  • A fast and efficient DataFrame object for data manipulation with integrated indexing;
  • Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
  • Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
  • Flexible reshaping and pivoting of data sets;
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
  • Columns can be inserted and deleted from data structures for size mutability;
  • Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
  • High performance merging and joining of data sets;
  • Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
  • Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
  • The library has been ruthlessly optimized for performance, with critical code paths compiled to C;
  • Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

Why not R?

First of all, we love open source R! It is the most widely-used open source environment for statistical modeling and graphics, and it provided some early inspiration for pandas features. R users will be pleased to find this library adopts some of the best concepts of R, like the foundational DataFrame (one user familiar with R has described pandas as “R data.frame on steroids”). But pandas also seeks to solve some frustrations common to R users:

  • R has barebones data alignment and indexing functionality, leaving much work to the user. pandas makes it easy and intuitive to work with messy, irregularly indexed data, like time series data. pandas also provides rich tools, like hierarchical indexing, not found in R;
  • R is not well-suited to general purpose programming and system development. pandas enables you to do large-scale data processing seamlessly when developing your production applications;
  • Hybrid systems connecting R to a low-productivity systems language like Java, C++, or C# suffer from significantly reduced agility and maintainability, and you’re still stuck developing the system components in a low-productivity language;
  • The “copyleft” GPL license of R can create concerns for commercial software vendors who want to distribute R with their software under another license. Python and pandas use more permissive licenses.

(3) datamatrix http://pypi.python.org/pypi/datamatrix/0.8

datamatrix 0.8

A Pythonic implementation of R's data.frame structure.

Latest Version: 0.9

This module allows access to comma- or other delimiter separated files as if they were tables, using a dictionary-like syntax. DataMatrix objects can be manipulated, rows and columns added and removed, or even transposed

-----------------------------------------------------------------

Modeling in Python

Also if you like to model, then the python package patsy can really help if you are a R user.

(4) http://patsy.readthedocs.org/en/latest/overview.html

patsy is a Python package for describing statistical models and building design matrices. It is closely inspired by and compatible with the ‘formula’ mini-language used in R and S.

For instance, if we have some variable y, and we want to regress it against some other variables x, a, b, and the interaction of a and b, then we simply write:

patsy.dmatrices("y ~ x + a + b + a:b", data)

and Patsy takes care of building appropriate matrices. Furthermore, it:

  • Allows data transformations to be specified using arbitrary Python code: instead of x, we could have written log(x), (x > 0), or even log(x) if x > 1e-5 else log(1e-5),
  • Provides a range of convenient options for coding categorical variables, including automatic detection and removal of redundancies,
  • Knows how to apply ‘the same’ transformation used on original data to new data, even for tricky transformations like centering or standardization (critical if you want to use your model to make predictions),
  • Has an incremental mode to handle data sets which are too large to fit into memory at one time,
  • Provides a language for symbolic, human-readable specification of linear constraint matrices,
  • Has a thorough test suite and solid underlying theory, allowing it to correctly handle corner cases that even R gets wrong, and
  • Features a simple API for integration into statistical packages.

 

For more models, you should use statmodels python package

(5) Statsmodels http://statsmodels.sourceforge.net/

Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator. Researchers across fields may find that statsmodels fully meets their needs for statistical computing and data analysis in Python. Features include:

  • Linear regression models
  • Generalized linear models
  • Discrete choice models
  • Robust linear models
  • Many models and functions for time series analysis
  • Nonparametric estimators
  • A collection of datasets for examples
  • A wide range of statistical tests
  • Input-output tools for producing tables in a number of formats (Text, LaTex, HTML) and for reading Stata files into NumPy and Pandas.
  • Plotting functions
  • Extensive unit tests to ensure correctness of results
  • Many more models and extensions in development

-----------------------------------------------------------------------

Graphing

For graphing you can see the wiki at http://wiki.python.org/moin/NumericAndScientific/Plotting

Apparently the best options are Python Packages (6) matplotlib http://matplotlib.sourceforge.net/ (or better see the gallery at http://matplotlib.sourceforge.net/gallery.html

------------------------------------------------------------------------------

Using R and Python together

(7) just switch to R using Rpy2 (and use ggplot) http://rpy.sourceforge.net/rpy2.html

rpy2 is a redesign and rewrite of rpy. It is providing a low-level interface to R, a proposed high-level interface, including wrappers to graphical libraries, as well as R-like structures and functions.

 

(8) Also if you want to switch between numpy and R, you can use Dirk E 's new package at http://cran.r-project.org/web/packages/RcppCNPy/RcppCNPy.pdf

(9) If you are totally new to Python you may wonder what is numpy and scipy

http://numpy.scipy.org/ NumPy is the fundamental package for scientific computing with Python. It contains among other things:

  • a powerful N-dimensional array object
  • sophisticated (broadcasting) functions
  • tools for integrating C/C++ and Fortran code
  • useful linear algebra, Fourier transform, and random number capabilities

and

(10) SciPy: umbrella project which includes a variety of high level science and engineering modules together as a single package. SciPy includes modules for linear algebra (including wrappers to BLAS and LAPACK), optimization, integration, special functions, FFTs, signal and image processing, ODE solvers, and others.

Also check  out http://www.scipy.org/Topical_Software if you are into scientific research

-----------------------------------------------------------------------

You can read some excellent reference cards at http://mathesaurus.sourceforge.net

 

Matlab Python Rstats Xref

NumPy for R (and S-Plus) users

Concatenation (vectors)

R/S-Plus Python Description
c(a,a) concatenate((a,a)) Concatenate two vectors
c(1:4,a) concatenate((range(1,5),a), axis=1)

Repeating

R/S-Plus Python Description
rep(a,times=2) concatenate((a,a)) 1 2 3, 1 2 3
rep(a,each=3) a.repeat(3) or 1 1 1, 2 2 2, 3 3 3
rep(a,a) a.repeat(a) or 1, 2 2, 3 3 3

Miss those elements out

R/S-Plus Python Description
a[-1] a[1:] miss the first element
a[-10] miss the tenth element
a[-seq(1,50,3)] miss 1,4,7, ...
a[-1] last element
a[-2:] last two elements

Maximum and minimum

R/S-Plus Python Description
pmax(a,b) maximum(a,b) pairwise max
max(a,b) concatenate((a,b)).max() max of all values in two vectors
v <- max(a) ; i <- which.max(a) v,i = a.max(0),a.argmax(0)

Vector multiplication

R/S-Plus Python Description
a*a a*a Multiply two vectors
dot(u,v) Vector dot product, $u cdot v$

Matrices

R/S-Plus Python Description
rbind(c(2,3),c(4,5))
array(c(2,3,4,5), dim=c(2,2))
a = array([[2,3],[4,5]]) Define a matrix

Concatenation (matrices); rbind and cbind

R/S-Plus Python Description
rbind(a,b) concatenate((a,b), axis=0)
vstack((a,b))
Bind rows
cbind(a,b) concatenate((a,b), axis=1)
hstack((a,b))
Bind columns
concatenate((a,b), axis=2)
dstack((a,b))
Bind slices (three-way arrays)
concatenate((a,b), axis=None) Concatenate matrices into one vector
rbind(1:4,1:4) concatenate((r_[1:5],r_[1:5])).reshape(2,-1)
vstack((r_[1:5],r_[1:5]))
Bind rows (from vectors)
cbind(1:4,1:4) Bind columns (from vectors)

Array creation

R/S-Plus Python Description
matrix(0,3,5) or array(0,c(3,5)) zeros((3,5),Float) 0 filled array
zeros((3,5)) 0 filled array of integers
matrix(1,3,5) or array(1,c(3,5)) ones((3,5),Float) 1 filled array
matrix(9,3,5) or array(9,c(3,5)) Any number filled array
diag(1,3) identity(3) Identity matrix
diag(c(4,5,6)) diag((4,5,6)) Diagonal
a = empty((3,3)) Empty array

Reshape and flatten matrices

R/S-Plus Python Description
matrix(1:6,nrow=3,byrow=T) arange(1,7).reshape(2,-1)
a.setshape(2,3)
Reshaping (rows first)
matrix(1:6,nrow=2)
array(1:6,c(2,3))
arange(1,7).reshape(-1,2).transpose() Reshaping (columns first)
as.vector(t(a)) a.flatten() or Flatten to vector (by rows, like comics)
as.vector(a) a.flatten(1) Flatten to vector (by columns)
a[row(a) <= col(a)] Flatten upper triangle (by columns)

Shared data (slicing)

R/S-Plus Python Description
b = a b = a.copy() Copy of a

Indexing and accessing elements (Python: slicing)

R/S-Plus Python Description
a <- rbind(c(11, 12, 13, 14),
c(21, 22, 23, 24),
c(31, 32, 33, 34))
a = array([[ 11, 12, 13, 14 ],
[ 21, 22, 23, 24 ],
[ 31, 32, 33, 34 ]])
Input is a 3,4 array
a[2,3] a[1,2] Element 2,3 (row,col)
a[1,] a[0,] First row
a[,1] a[:,0] First column
a.take([0,2]).take([0,3], axis=1) Array as indices
a[-1,] a[1:,] All, except first row
a[-2:,] Last two rows
a[::2,:] Strides: Every other row
a[...,2] Third in last dimension (axis)
a[-2,-3] All, except row,column (2,3)
a[,-2] a.take([0,2,3],axis=1) Remove one column
a.diagonal(offset=0) Diagonal

Assignment

R/S-Plus Python Description
a[,1] <- 99 a[:,0] = 99
a[,1] <- c(99,98,97) a[:,0] = array([99,98,97])
a[a>90] <- 90 (a>90).choose(a,90)
a.clip(min=None, max=90)
Clipping: Replace all elements over 90
a.clip(min=2, max=5) Clip upper and lower values

Transpose and inverse

R/S-Plus Python Description
t(a) a.conj().transpose() Transpose
a.transpose() Non-conjugate transpose
det(a) linalg.det(a) or Determinant
solve(a) linalg.inv(a) or Inverse
ginv(a) linalg.pinv(a) Pseudo-inverse
norm(a) Norms
eigen(a)$values linalg.eig(a)[0] Eigenvalues
svd(a)$d linalg.svd(a) Singular values
linalg.cholesky(a) Cholesky factorization
eigen(a)$vectors linalg.eig(a)[1] Eigenvectors
rank(a) rank(a) Rank

Sum

R/S-Plus Python Description
apply(a,2,sum) a.sum(axis=0) Sum of each column
apply(a,1,sum) a.sum(axis=1) Sum of each row
sum(a) a.sum() Sum of all elements
a.trace(offset=0) Sum along diagonal
apply(a,2,cumsum) a.cumsum(axis=0) Cumulative sum (columns)

Sorting

R/S-Plus Python Description
a = array([[4,3,2],[2,8,6],[1,4,7]]) Example data
t(sort(a)) a.ravel().sort() or Flat and sorted
apply(a,2,sort) a.sort(axis=0) or msort(a) Sort each column
t(apply(a,1,sort)) a.sort(axis=1) Sort each row
a[a[:,0].argsort(),] Sort rows (by first row)
order(a) a.ravel().argsort() Sort, return indices
a.argsort(axis=0) Sort each column, return indices
a.argsort(axis=1) Sort each row, return indices

Plotting

Basic x-y plots

R/S-Plus Python Description
plot(a, type="l") plot(a) 1d line plot
plot(x[,1],x[,2]) plot(x[:,0],x[:,1],'o') 2d scatter plot
plot(x1,y1,'bo', x2,y2,'go') Two graphs in one plot
plot(x1,y1)
matplot(x2,y2,add=T)
plot(x1,y1,'o')
plot(x2,y2,'o')
show() # as normal
Overplotting: Add new plots to current
subplot(211) subplots
plot(x,y,type="b",col="red") plot(x,y,'ro-') Plotting symbols and color

Axes and titles

R/S-Plus Python Description
grid() grid() Turn on grid lines
plot(c(1:10,10:1), asp=1) figure(figsize=(6,6)) 1:1 aspect ratio
plot(x,y, xlim=c(0,10), ylim=c(0,5)) axis([ 0, 10, 0, 5 ]) Set axes manually
plot(1:10, main="title",
xlab="x-axis", ylab="y-axis")
Axis labels and titles
text(2,25,'hello') Insert text

Log plots

R/S-Plus Python Description
plot(x,y, log="y") semilogy(a) logarithmic y-axis
plot(x,y, log="x") semilogx(a) logarithmic x-axis
plot(x,y, log="xy") loglog(a) logarithmic x and y axes

Filled plots and bar plots

R/S-Plus Python Description
plot(t,s, type="n", xlab="", ylab="")
polygon(t,s, col="lightblue")
polygon(t,c, col="lightgreen")
fill(t,s,'b', t,c,'g', alpha=0.2) Filled plot
stem(x[,3]) Stem-and-Leaf plot

Functions

R/S-Plus Python Description
f <- function(x) sin(x/3) - cos(x/5) Defining functions
plot(f, xlim=c(0,40), type='p') x = arrayrange(0,40,.5)
y = sin(x/3) - cos(x/5)
plot(x,y, 'o')
Plot a function for given range

Polar plots

R/S-Plus Python Description
theta = arange(0,2*pi,0.001)
r = sin(2*theta)
polar(theta, rho)

Histogram plots

R/S-Plus Python Description
hist(rnorm(1000))
hist(rnorm(1000), breaks= -4:4)
hist(rnorm(1000), breaks=c(seq(-5,0,0.25), seq(0.5,5,0.5)), freq=F)
plot(apply(a,1,sort),type="l")

3d data

Contour and image plots

R/S-Plus Python Description
contour(z) levels, colls = contour(Z, V,
origin='lower', extent=(-3,3,-3,3))
clabel(colls, levels, inline=1,
fmt='%1.1f', fontsize=10)
Contour plot
filled.contour(x,y,z,
nlevels=7, color=gray.colors)
contourf(Z, V,
cmap=cm.gray,
origin='lower',
extent=(-3,3,-3,3))
Filled contour plot
image(z, col=gray.colors(256)) im = imshow(Z,
interpolation='bilinear',
origin='lower',
extent=(-3,3,-3,3))
Plot image data
# imshow() and contour() as above Image with contours
quiver() Direction field vectors

Perspective plots of surfaces over the x-y plane

R/S-Plus Python Description
f <- function(x,y) x*exp(-x^2-y^2)
n <- seq(-2,2, length=40)
z <- outer(n,n,f)
n=arrayrange(-2,2,.1)
[x,y] = meshgrid(n,n)
z = x*power(math.e,-x**2-y**2)
persp(x,y,z,
theta=30, phi=30, expand=0.6,
ticktype='detailed')
Mesh plot
persp(x,y,z,
theta=30, phi=30, expand=0.6,
col='lightblue', shade=0.75, ltheta=120,
ticktype='detailed')
Surface plot

Scatter (cloud) plots

R/S-Plus Python Description
cloud(z~x*y) 3d scatter plot

Save plot to a graphics file

R/S-Plus Python Description
postscript(file="foo.eps")
plot(1:10)
dev.off()
savefig('foo.eps') PostScript
pdf(file='foo.pdf') savefig('foo.pdf') PDF
devSVG(file='foo.svg') savefig('foo.svg') SVG (vector graphics for www)
png(filename = "Rplot%03d.png" savefig('foo.png') PNG (raster graphics)

Data analysis

Set membership operators

R/S-Plus Python Description
a <- c(1,2,2,5,2)
b <- c(2,3,4)
a = array([1,2,2,5,2])
b = array([2,3,4])
a = set([1,2,2,5,2])
b = set([2,3,4])
Create sets
unique(a) unique1d(a)
unique(a)
set(a)
Set unique
union(a,b) union1d(a,b)
a.union(b)
Set union
intersect(a,b) intersect1d(a)
a.intersection(b)
Set intersection
setdiff(a,b) setdiff1d(a,b)
a.difference(b)
Set difference
setdiff(union(a,b),intersect(a,b)) setxor1d(a,b)
a.symmetric_difference(b)
Set exclusion
is.element(2,a) or 2 %in% a 2 in a
setmember1d(2,a)
contains(a,2)
True for set member

Statistics

R/S-Plus Python Description
apply(a,2,mean) a.mean(axis=0)
mean(a [,axis=0])
Average
apply(a,2,median) median(a) or median(a [,axis=0]) Median
apply(a,2,sd) a.std(axis=0) or std(a [,axis=0]) Standard deviation
apply(a,2,var) a.var(axis=0) or var(a) Variance
cor(x,y) correlate(x,y) or corrcoef(x,y) Correlation coefficient
cov(x,y) cov(x,y) Covariance

Interpolation and regression

R/S-Plus Python Description
z <- lm(y~x)
plot(x,y)
abline(z)
(a,b) = polyfit(x,y,1)
plot(x,y,'o', x,a*x+b,'-')
Straight line fit
solve(a,b) linalg.lstsq(x,y) Linear least squares $y = ax + b$
polyfit(x,y,3) Polynomial fit

Loops

R/S-Plus Python Description
for(i in 1:5) print(i) for i in range(1,6): print(i) for-statement
for(i in 1:5) {
print(i)
print(i*2)
}
for i in range(1,6):
print(i)
print(i*2)
Multiline for statements

Conditionals

R/S-Plus Python Description
if (1>0) a <- 100 if 1>0: a=100 if-statement
ifelse(a>0,a,0) Ternary operator (if?true:false)

Debugging

R/S-Plus Python Description
.Last.value Most recent evaluated expression
objects() List variables loaded into memory
rm(x) Clear variable $x$ from memory
print(a) print a Print

©2006 Vidar Bronken Gundersen, /mathesaurus.sf.net
Permission is granted to copy, distribute and/or modify this document as long as the above attribution is retained.


1 Comment

  1. ryusukekenji says:

    Thanks for sharing, its useful…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Conferences

Predictive Analytics- The Book

Books

Follow

Get every new post delivered to your Inbox.

Join 831 other followers