Exploring some Python Packages and R packages to move /work with both Python and R without melting your brain or exceeding your project deadline
—————————————
If you liked the data.frame structure in R, you have some way to work with them at a faster processing speed in Python.
Here are three packages that enable you to do so
(1) pydataframe http://code.google.com/p/pydataframe/
An implemention of an almost R like DataFrame object. (install via Pypi/Pip: “pip install pydataframe”)
Usage:
u = DataFrame( { "Field1": [1, 2, 3], "Field2": ['abc', 'def', 'hgi']}, optional: ['Field1', 'Field2'] ["rowOne", "rowTwo", "thirdRow"])
A DataFrame is basically a table with rows and columns.
Columns are named, rows are numbered (but can be named) and can be easily selected and calculated upon. Internally, columns are stored as 1d numpy arrays. If you set row names, they’re converted into a dictionary for fast access. There is a rich subselection/slicing API, see help(DataFrame.get_item) (it also works for setting values). Please note that any slice get’s you another DataFrame, to access individual entries use get_row(), get_column(), get_value().
DataFrames also understand basic arithmetic and you can either add (multiply,…) a constant value, or another DataFrame of the same size / with the same column names, like this:
#multiply every value in ColumnA that is smaller than 5 by 6. my_df[my_df[:,'ColumnA'] < 5, 'ColumnA'] *= 6 #you always need to specify both row and column selectors, use : to mean everything my_df[:, 'ColumnB'] = my_df[:,'ColumnA'] + my_df[:, 'ColumnC'] #let's take every row that starts with Shu in ColumnA and replace it with a new list (comprehension) select = my_df.where(lambda row: row['ColumnA'].startswith('Shu')) my_df[select, 'ColumnA'] = [row['ColumnA'].replace('Shu', 'Sha') for row in my_df[select,:].iter_rows()]
Dataframes talk directly to R via rpy2 (rpy2 is not a prerequiste for the library!)
(2) pandas http://pandas.pydata.org/
Library Highlights
 A fast and efficient DataFrame object for data manipulation with integrated indexing;
 Tools for reading and writing data between inmemory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
 Intelligent data alignment and integrated handling of missing data: gain automatic labelbased alignment in computations and easily manipulate messy data into an orderly form;
 Flexible reshaping and pivoting of data sets;
 Intelligent labelbased slicing, fancy indexing, and subsetting of large data sets;
 Columns can be inserted and deleted from data structures for size mutability;
 Aggregating or transforming data with a powerful group by engine allowing splitapplycombine operations on data sets;
 High performance merging and joining of data sets;
 Hierarchical axis indexing provides an intuitive way of working with highdimensional data in a lowerdimensional data structure;
 Time seriesfunctionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domainspecific time offsets and join time series without losing data;
 The library has been ruthlessly optimized for performance, with critical code paths compiled to C;
 Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.
Why not R?
First of all, we love open source R! It is the most widelyused open source environment for statistical modeling and graphics, and it provided some early inspiration for pandas features. R users will be pleased to find this library adopts some of the best concepts of R, like the foundational DataFrame (one user familiar with R has described pandas as “R data.frame on steroids”). But pandas also seeks to solve some frustrations common to R users:
 R has barebones data alignment and indexing functionality, leaving much work to the user. pandas makes it easy and intuitive to work with messy, irregularly indexed data, like time series data. pandas also provides rich tools, like hierarchical indexing, not found in R;
 R is not wellsuited to general purpose programming and system development. pandas enables you to do largescale data processing seamlessly when developing your production applications;
 Hybrid systems connecting R to a lowproductivity systems language like Java, C++, or C# suffer from significantly reduced agility and maintainability, and you’re still stuck developing the system components in a lowproductivity language;
 The “copyleft” GPL license of R can create concerns for commercial software vendors who want to distribute R with their software under another license. Python and pandas use more permissive licenses.
(3) datamatrix http://pypi.python.org/pypi/datamatrix/0.8
datamatrix 0.8
A Pythonic implementation of R’s data.frame structure.
Latest Version: 0.9
This module allows access to comma or other delimiter separated files as if they were tables, using a dictionarylike syntax. DataMatrix objects can be manipulated, rows and columns added and removed, or even transposed
—————————————————————–
Modeling in Python
Also if you like to model, then the python package patsy can really help if you are a R user.
(4) http://patsy.readthedocs.org/en/latest/overview.html
patsy is a Python package for describing statistical models and building design matrices. It is closely inspired by and compatible with the ‘formula’ minilanguage used in R and S.
For instance, if we have some variable y, and we want to regress it against some other variables x, a, b, and the interaction of a and b, then we simply write:
patsy.dmatrices("y ~ x + a + b + a:b", data)
and Patsy takes care of building appropriate matrices. Furthermore, it:
 Allows data transformations to be specified using arbitrary Python code: instead of x, we could have written log(x), (x > 0), or even log(x) if x > 1e5 else log(1e5),
 Provides a range of convenient options for coding categorical variables, including automatic detection and removal of redundancies,
 Knows how to apply ‘the same’ transformation used on original data to new data, even for tricky transformations like centering or standardization (critical if you want to use your model to make predictions),
 Has an incremental mode to handle data sets which are too large to fit into memory at one time,
 Provides a language for symbolic, humanreadable specification of linear constraint matrices,
 Has a thorough test suite and solid underlying theory, allowing it to correctly handle corner cases that even R gets wrong, and
 Features a simple API for integration into statistical packages.
For more models, you should use statmodels python package
(5) Statsmodels http://statsmodels.sourceforge.net/
Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator. Researchers across fields may find that statsmodels fully meets their needs for statistical computing and data analysis in Python. Features include:
 Linear regression models
 Generalized linear models
 Discrete choice models
 Robust linear models
 Many models and functions for time series analysis
 Nonparametric estimators
 A collection of datasets for examples
 A wide range of statistical tests
 Inputoutput tools for producing tables in a number of formats (Text, LaTex, HTML) and for reading Stata files into NumPy and Pandas.
 Plotting functions
 Extensive unit tests to ensure correctness of results
 Many more models and extensions in development
———————————————————————–
Graphing
For graphing you can see the wiki at http://wiki.python.org/moin/NumericAndScientific/Plotting
Apparently the best options are Python Packages (6) matplotlib http://matplotlib.sourceforge.net/ (or better see the gallery at http://matplotlib.sourceforge.net/gallery.html
——————————————————————————
Using R and Python together
(7) just switch to R using Rpy2 (and use ggplot) http://rpy.sourceforge.net/rpy2.html
rpy2 is a redesign and rewrite of rpy. It is providing a lowlevel interface to R, a proposed highlevel interface, including wrappers to graphical libraries, as well as Rlike structures and functions.
(8) Also if you want to switch between numpy and R, you can use Dirk E ‘s new package at http://cran.rproject.org/web/packages/RcppCNPy/RcppCNPy.pdf
(9) If you are totally new to Python you may wonder what is numpy and scipy
http://numpy.scipy.org/ NumPy is the fundamental package for scientific computing with Python. It contains among other things:
 a powerful Ndimensional array object
 sophisticated (broadcasting) functions
 tools for integrating C/C++ and Fortran code
 useful linear algebra, Fourier transform, and random number capabilities
and
(10) SciPy: umbrella project which includes a variety of high level science and engineering modules together as a single package. SciPy includes modules for linear algebra (including wrappers to BLAS and LAPACK), optimization, integration, special functions, FFTs, signal and image processing, ODE solvers, and others.
Also check out http://www.scipy.org/Topical_Software if you are into scientific research
———————————————————————–
You can read some excellent reference cards at http://mathesaurus.sourceforge.net
NumPy for R (and SPlus) users
Concatenation (vectors)
R/SPlus  Python  Description 

c(a,a)  concatenate((a,a))  Concatenate two vectors 
c(1:4,a)  concatenate((range(1,5),a), axis=1) 
Repeating
R/SPlus  Python  Description 

rep(a,times=2)  concatenate((a,a))  1 2 3, 1 2 3 
rep(a,each=3)  a.repeat(3) or  1 1 1, 2 2 2, 3 3 3 
rep(a,a)  a.repeat(a) or  1, 2 2, 3 3 3 
Miss those elements out
R/SPlus  Python  Description 

a[1]  a[1:]  miss the first element 
a[10]  miss the tenth element  
a[seq(1,50,3)]  miss 1,4,7, …  
a[1]  last element  
a[2:]  last two elements 
Maximum and minimum
R/SPlus  Python  Description 

pmax(a,b)  maximum(a,b)  pairwise max 
max(a,b)  concatenate((a,b)).max()  max of all values in two vectors 
v < max(a) ; i < which.max(a)  v,i = a.max(0),a.argmax(0) 
Vector multiplication
R/SPlus  Python  Description 

a*a  a*a  Multiply two vectors 
dot(u,v)  Vector dot product, $u cdot v$ 
Matrices
R/SPlus  Python  Description 

rbind(c(2,3),c(4,5)) array(c(2,3,4,5), dim=c(2,2)) 
a = array([[2,3],[4,5]])  Define a matrix 
Concatenation (matrices); rbind and cbind
R/SPlus  Python  Description 

rbind(a,b)  concatenate((a,b), axis=0) vstack((a,b)) 
Bind rows 
cbind(a,b)  concatenate((a,b), axis=1) hstack((a,b)) 
Bind columns 
concatenate((a,b), axis=2) dstack((a,b)) 
Bind slices (threeway arrays)  
concatenate((a,b), axis=None)  Concatenate matrices into one vector  
rbind(1:4,1:4)  concatenate((r_[1:5],r_[1:5])).reshape(2,1) vstack((r_[1:5],r_[1:5])) 
Bind rows (from vectors) 
cbind(1:4,1:4)  Bind columns (from vectors) 
Array creation
R/SPlus  Python  Description 

matrix(0,3,5) or array(0,c(3,5))  zeros((3,5),Float)  0 filled array 
zeros((3,5))  0 filled array of integers  
matrix(1,3,5) or array(1,c(3,5))  ones((3,5),Float)  1 filled array 
matrix(9,3,5) or array(9,c(3,5))  Any number filled array  
diag(1,3)  identity(3)  Identity matrix 
diag(c(4,5,6))  diag((4,5,6))  Diagonal 
a = empty((3,3))  Empty array 
Reshape and flatten matrices
R/SPlus  Python  Description 

matrix(1:6,nrow=3,byrow=T)  arange(1,7).reshape(2,1) a.setshape(2,3) 
Reshaping (rows first) 
matrix(1:6,nrow=2) array(1:6,c(2,3)) 
arange(1,7).reshape(1,2).transpose()  Reshaping (columns first) 
as.vector(t(a))  a.flatten() or  Flatten to vector (by rows, like comics) 
as.vector(a)  a.flatten(1)  Flatten to vector (by columns) 
a[row(a) <= col(a)]  Flatten upper triangle (by columns) 
Shared data (slicing)
R/SPlus  Python  Description 

b = a  b = a.copy()  Copy of a 
Indexing and accessing elements (Python: slicing)
R/SPlus  Python  Description 

a < rbind(c(11, 12, 13, 14), c(21, 22, 23, 24), c(31, 32, 33, 34)) 
a = array([[ 11, 12, 13, 14 ], [ 21, 22, 23, 24 ], [ 31, 32, 33, 34 ]]) 
Input is a 3,4 array 
a[2,3]  a[1,2]  Element 2,3 (row,col) 
a[1,]  a[0,]  First row 
a[,1]  a[:,0]  First column 
a.take([0,2]).take([0,3], axis=1)  Array as indices  
a[1,]  a[1:,]  All, except first row 
a[2:,]  Last two rows  
a[::2,:]  Strides: Every other row  
a[...,2]  Third in last dimension (axis)  
a[2,3]  All, except row,column (2,3)  
a[,2]  a.take([0,2,3],axis=1)  Remove one column 
a.diagonal(offset=0)  Diagonal 
Assignment
R/SPlus  Python  Description 

a[,1] < 99  a[:,0] = 99  
a[,1] < c(99,98,97)  a[:,0] = array([99,98,97])  
a[a>90] < 90  (a>90).choose(a,90) a.clip(min=None, max=90) 
Clipping: Replace all elements over 90 
a.clip(min=2, max=5)  Clip upper and lower values 
Transpose and inverse
R/SPlus  Python  Description 

t(a)  a.conj().transpose()  Transpose 
a.transpose()  Nonconjugate transpose  
det(a)  linalg.det(a) or  Determinant 
solve(a)  linalg.inv(a) or  Inverse 
ginv(a)  linalg.pinv(a)  Pseudoinverse 
norm(a)  Norms  
eigen(a)$values  linalg.eig(a)[0]  Eigenvalues 
svd(a)$d  linalg.svd(a)  Singular values 
linalg.cholesky(a)  Cholesky factorization  
eigen(a)$vectors  linalg.eig(a)[1]  Eigenvectors 
rank(a)  rank(a)  Rank 
Sum
R/SPlus  Python  Description 

apply(a,2,sum)  a.sum(axis=0)  Sum of each column 
apply(a,1,sum)  a.sum(axis=1)  Sum of each row 
sum(a)  a.sum()  Sum of all elements 
a.trace(offset=0)  Sum along diagonal  
apply(a,2,cumsum)  a.cumsum(axis=0)  Cumulative sum (columns) 
Sorting
R/SPlus  Python  Description 

a = array([[4,3,2],[2,8,6],[1,4,7]])  Example data  
t(sort(a))  a.ravel().sort() or  Flat and sorted 
apply(a,2,sort)  a.sort(axis=0) or msort(a)  Sort each column 
t(apply(a,1,sort))  a.sort(axis=1)  Sort each row 
a[a[:,0].argsort(),]  Sort rows (by first row)  
order(a)  a.ravel().argsort()  Sort, return indices 
a.argsort(axis=0)  Sort each column, return indices  
a.argsort(axis=1)  Sort each row, return indices 
Plotting
Basic xy plots
R/SPlus  Python  Description 

plot(a, type="l")  plot(a)  1d line plot 
plot(x[,1],x[,2])  plot(x[:,0],x[:,1],'o')  2d scatter plot 
plot(x1,y1,'bo', x2,y2,'go')  Two graphs in one plot  
plot(x1,y1) matplot(x2,y2,add=T) 
plot(x1,y1,'o') plot(x2,y2,'o') show() # as normal 
Overplotting: Add new plots to current 
subplot(211)  subplots  
plot(x,y,type="b",col="red")  plot(x,y,'ro')  Plotting symbols and color 
Axes and titles
R/SPlus  Python  Description 

grid()  grid()  Turn on grid lines 
plot(c(1:10,10:1), asp=1)  figure(figsize=(6,6))  1:1 aspect ratio 
plot(x,y, xlim=c(0,10), ylim=c(0,5))  axis([ 0, 10, 0, 5 ])  Set axes manually 
plot(1:10, main="title", xlab="xaxis", ylab="yaxis") 
Axis labels and titles  
text(2,25,'hello')  Insert text 
Log plots
R/SPlus  Python  Description 

plot(x,y, log="y")  semilogy(a)  logarithmic yaxis 
plot(x,y, log="x")  semilogx(a)  logarithmic xaxis 
plot(x,y, log="xy")  loglog(a)  logarithmic x and y axes 
Filled plots and bar plots
R/SPlus  Python  Description 

plot(t,s, type="n", xlab="", ylab="") polygon(t,s, col="lightblue") polygon(t,c, col="lightgreen") 
fill(t,s,'b', t,c,'g', alpha=0.2)  Filled plot 
stem(x[,3])  StemandLeaf plot 
Functions
R/SPlus  Python  Description 

f < function(x) sin(x/3)  cos(x/5)  Defining functions  
plot(f, xlim=c(0,40), type='p')  x = arrayrange(0,40,.5) y = sin(x/3)  cos(x/5) plot(x,y, 'o') 
Plot a function for given range 
Polar plots
R/SPlus  Python  Description 

theta = arange(0,2*pi,0.001) r = sin(2*theta) 

polar(theta, rho) 
Histogram plots
R/SPlus  Python  Description 

hist(rnorm(1000))  
hist(rnorm(1000), breaks= 4:4)  
hist(rnorm(1000), breaks=c(seq(5,0,0.25), seq(0.5,5,0.5)), freq=F)  
plot(apply(a,1,sort),type="l") 
3d data
Contour and image plots
R/SPlus  Python  Description 

contour(z)  levels, colls = contour(Z, V, origin='lower', extent=(3,3,3,3)) clabel(colls, levels, inline=1, fmt='%1.1f', fontsize=10) 
Contour plot 
filled.contour(x,y,z, nlevels=7, color=gray.colors) 
contourf(Z, V, cmap=cm.gray, origin='lower', extent=(3,3,3,3)) 
Filled contour plot 
image(z, col=gray.colors(256))  im = imshow(Z, interpolation='bilinear', origin='lower', extent=(3,3,3,3)) 
Plot image data 
# imshow() and contour() as above  Image with contours  
quiver()  Direction field vectors 
Perspective plots of surfaces over the xy plane
R/SPlus  Python  Description 

f < function(x,y) x*exp(x^2y^2) n < seq(2,2, length=40) z < outer(n,n,f) 
n=arrayrange(2,2,.1) [x,y] = meshgrid(n,n) z = x*power(math.e,x**2y**2) 

persp(x,y,z, theta=30, phi=30, expand=0.6, ticktype='detailed') 
Mesh plot  
persp(x,y,z, theta=30, phi=30, expand=0.6, col='lightblue', shade=0.75, ltheta=120, ticktype='detailed') 
Surface plot 
Scatter (cloud) plots
R/SPlus  Python  Description 

cloud(z~x*y)  3d scatter plot 
Save plot to a graphics file
R/SPlus  Python  Description 

postscript(file="foo.eps") plot(1:10) dev.off() 
savefig('foo.eps')  PostScript 
pdf(file='foo.pdf')  savefig('foo.pdf')  
devSVG(file='foo.svg')  savefig('foo.svg')  SVG (vector graphics for www) 
png(filename = "Rplot%03d.png"  savefig('foo.png')  PNG (raster graphics) 
Data analysis
Set membership operators
R/SPlus  Python  Description 

a < c(1,2,2,5,2) b < c(2,3,4) 
a = array([1,2,2,5,2]) b = array([2,3,4]) a = set([1,2,2,5,2]) b = set([2,3,4]) 
Create sets 
unique(a)  unique1d(a) unique(a) set(a) 
Set unique 
union(a,b)  union1d(a,b) a.union(b) 
Set union 
intersect(a,b)  intersect1d(a) a.intersection(b) 
Set intersection 
setdiff(a,b)  setdiff1d(a,b) a.difference(b) 
Set difference 
setdiff(union(a,b),intersect(a,b))  setxor1d(a,b) a.symmetric_difference(b) 
Set exclusion 
is.element(2,a) or 2 %in% a  2 in a setmember1d(2,a) contains(a,2) 
True for set member 
Statistics
R/SPlus  Python  Description 

apply(a,2,mean)  a.mean(axis=0) mean(a [,axis=0]) 
Average 
apply(a,2,median)  median(a) or median(a [,axis=0])  Median 
apply(a,2,sd)  a.std(axis=0) or std(a [,axis=0])  Standard deviation 
apply(a,2,var)  a.var(axis=0) or var(a)  Variance 
cor(x,y)  correlate(x,y) or corrcoef(x,y)  Correlation coefficient 
cov(x,y)  cov(x,y)  Covariance 
Interpolation and regression
R/SPlus  Python  Description 

z < lm(y~x) plot(x,y) abline(z) 
(a,b) = polyfit(x,y,1) plot(x,y,'o', x,a*x+b,'') 
Straight line fit 
solve(a,b)  linalg.lstsq(x,y)  Linear least squares $y = ax + b$ 
polyfit(x,y,3)  Polynomial fit 
Loops
R/SPlus  Python  Description 

for(i in 1:5) print(i)  for i in range(1,6): print(i)  forstatement 
for(i in 1:5) { print(i) print(i*2) } 
for i in range(1,6): print(i) print(i*2) 
Multiline for statements 
Conditionals
R/SPlus  Python  Description 

if (1>0) a < 100  if 1>0: a=100  ifstatement 
ifelse(a>0,a,0)  Ternary operator (if?true:false) 
Debugging
R/SPlus  Python  Description 

.Last.value  Most recent evaluated expression  
objects()  List variables loaded into memory  
rm(x)  Clear variable $x$ from memory  
print(a)  print a 
©2006 Vidar Bronken Gundersen, /mathesaurus.sf.net
Permission is granted to copy, distribute and/or modify this document as long as the above attribution is retained.
i have run the following data code to create a dataframe of daily prices. i have 72 csv files generated for 72 currency pairs. the below code only generates a dataframe of only 67 columns and 5 rows. rows being timeseries and colums different closing prices for the various coins.
code as follows:
crypto_df = pd.DataFrame()
for ticker in tickers:
crypto_df[ticker] = pd.read_csv(ticker+’.csv’, index_col = ‘date’)[‘close’]
crypto_df.dropna(inplace=True)
crypto_df.head()
why is the Time series restricted to only 5 rows? how can i get all data inot the dataframe?
not all files contain the same number of rows but have identical columns. i ran a code to export the data frame to CSV and it replicated the illustrated dataframe with limited data.
your assistance would be appreciated!
Thanks for sharing, its useful…