Data Frame in Python

Exploring some Python Packages and R packages to move /work with both Python and R without melting your brain or exceeding your project deadline

—————————————

If you liked the data.frame structure in R, you have some way to work with them at a faster processing speed in Python.

Here are three packages that enable you to do so-

(1) pydataframe http://code.google.com/p/pydataframe/

An implemention of an almost R like DataFrame object. (install via Pypi/Pip: “pip install pydataframe”)

Usage:

        u = DataFrame( { "Field1": [1, 2, 3],
                        "Field2": ['abc', 'def', 'hgi']},
                        optional:
                         ['Field1', 'Field2']
                         ["rowOne", "rowTwo", "thirdRow"])

A DataFrame is basically a table with rows and columns.

Columns are named, rows are numbered (but can be named) and can be easily selected and calculated upon. Internally, columns are stored as 1d numpy arrays. If you set row names, they’re converted into a dictionary for fast access. There is a rich subselection/slicing API, see help(DataFrame.get_item) (it also works for setting values). Please note that any slice get’s you another DataFrame, to access individual entries use get_row(), get_column(), get_value().

DataFrames also understand basic arithmetic and you can either add (multiply,…) a constant value, or another DataFrame of the same size / with the same column names, like this:

#multiply every value in ColumnA that is smaller than 5 by 6.
my_df[my_df[:,'ColumnA'] < 5, 'ColumnA'] *= 6

#you always need to specify both row and column selectors, use : to mean everything
my_df[:, 'ColumnB'] = my_df[:,'ColumnA'] + my_df[:, 'ColumnC']

#let's take every row that starts with Shu in ColumnA and replace it with a new list (comprehension)
select = my_df.where(lambda row: row['ColumnA'].startswith('Shu'))
my_df[select, 'ColumnA'] = [row['ColumnA'].replace('Shu', 'Sha') for row in my_df[select,:].iter_rows()]

Dataframes talk directly to R via rpy2 (rpy2 is not a prerequiste for the library!)

 

(2) pandas http://pandas.pydata.org/

Library Highlights

  • A fast and efficient DataFrame object for data manipulation with integrated indexing;
  • Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
  • Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
  • Flexible reshaping and pivoting of data sets;
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
  • Columns can be inserted and deleted from data structures for size mutability;
  • Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
  • High performance merging and joining of data sets;
  • Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
  • Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
  • The library has been ruthlessly optimized for performance, with critical code paths compiled to C;
  • Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

Why not R?

First of all, we love open source R! It is the most widely-used open source environment for statistical modeling and graphics, and it provided some early inspiration for pandas features. R users will be pleased to find this library adopts some of the best concepts of R, like the foundational DataFrame (one user familiar with R has described pandas as “R data.frame on steroids”). But pandas also seeks to solve some frustrations common to R users:

  • R has barebones data alignment and indexing functionality, leaving much work to the user. pandas makes it easy and intuitive to work with messy, irregularly indexed data, like time series data. pandas also provides rich tools, like hierarchical indexing, not found in R;
  • R is not well-suited to general purpose programming and system development. pandas enables you to do large-scale data processing seamlessly when developing your production applications;
  • Hybrid systems connecting R to a low-productivity systems language like Java, C++, or C# suffer from significantly reduced agility and maintainability, and you’re still stuck developing the system components in a low-productivity language;
  • The “copyleft” GPL license of R can create concerns for commercial software vendors who want to distribute R with their software under another license. Python and pandas use more permissive licenses.

(3) datamatrix http://pypi.python.org/pypi/datamatrix/0.8

datamatrix 0.8

A Pythonic implementation of R’s data.frame structure.

Latest Version: 0.9

This module allows access to comma- or other delimiter separated files as if they were tables, using a dictionary-like syntax. DataMatrix objects can be manipulated, rows and columns added and removed, or even transposed

—————————————————————–

Modeling in Python

Continue reading “Data Frame in Python”

Different kinds of Clouds

Some slides I liked on cloud computing infrastructure as offered by Amazon, IBM, Google , Windows and Oracle

 

 

Time Series for Web Analytics

I am mostly language agnostic, though I dislike shoddy design in software (like SAS Enterprise Guide), shoddy websites (like the outdated designed of http://www.r-project.org/ site) , and dishonest marketing in inventing buzz words  (or as they say — excessively dishonest marketing).

At the same time I love nicely designed software (Rattle,Rapid Miner, JMP), great websites for software (like http://rstudio.org/ ) and suitably targeted marketing (like IBM’s) and appreciate intellectual honesty in a field where honest men are rare to find ( http://www.nytimes.com/2012/08/12/business/how-big-data-became-so-big-unboxed.html?_r=1&hpw

I digress- Here are some papers I find interesting to read.

RevoDeployR and commercial BI using R and R based cloud computing using Open CPU

Revolution Analytics has of course had RevoDeployR, and in a  webinar strive to bring it back to center spotlight.

BI is a good lucrative market, and visualization is a strength in R, so it is matter of time before we have more R based BI solutions. I really liked the two slides below for explaining RevoDeployR better to newbies like me (and many others!)

Integrating R into 3rd party and Web applications using RevoDeployR

Please click here to download the PDF.

Here are some additional links that may be of interest to you:

 

( I still think someone should make a commercial version of Jeroen Oom’s web interfaces and Jeff Horner’s web infrastructure (see below) for making customized Business Intelligence (BI) /Data Visualization solutions , UCLA and Vanderbilt are not exactly Stanford when it comes to deploying great academic solutions in the startup-tech world). I kind of think Google or someone at Revolution  should atleast dekko OpenCPU as a credible cloud solution in R.

I still cant figure out whether Revolution Analytics has a cloud computing strategy and Google seems to be working mysteriously as usual in broadening access to the Google Compute Cloud to the rest of R Community.

Open CPU  provides a free and open platform for statistical computing in the cloud. It is meant as an open, social analysis environment where people can share and run R functions and objects. For more details, visit the websit: www.opencpu.org

and esp see

https://public.opencpu.org/userapps/opencpu/opencpu.demo/runcode/

Jeff Horner’s

http://rapache.net/

Jerooen Oom’s

Latest R Journal

Including juicy stuff on using a cluster of Apple Machines for grid computing , seasonality forecasting (Yet Another Package For Time Series )

But I kind of liked Sumo too-

https://code.google.com/p/sumo/

Sumo is a fully-functional web application template that exposes an authenticated user’s R session within java server pages.

Sumo: An Authenticating Web Application with an Embedded R Session by Timothy T. Bergsma and Michael S. Smith Abstract Sumo is a web application intended as a template for developers. It is distributed as a Java ‘war’ file that deploys automatically when placed in a Servlet container’s ‘webapps’
directory. If a user supplies proper credentials, Sumo creates a session-specific Secure Shell connection to the host and a user-specific R session over that connection. Developers may write dynamic server pages that make use of the persistent R session and user-specific file space.

and for Apple fanboys-

We created the xgrid package (Horton and Anoke, 2012) to provide a simple interface to this distributed computing system. The package facilitates use of an Apple Xgrid for distributed processing of a simulation with many independent repetitions, by simplifying job submission (or grid stuffing) and collation of results. It provides a relatively thin but useful layer between R and Apple’s ‘xgrid’ shell command, where the user constructs input scripts to be run remotely. A similar set of routines, optimized for parallel estimation of JAGS (just another Gibbs sampler) models is available within the runjags package (Denwood, 2010). However, with the exception of runjags, none of the previously mentioned packages support parallel computation over an Apple Xgrid.

Hmm I guess parallel computing enabled by Wifi on mobile phones would be awesome too ! So would be anything using iOS . See the rest of the R Journal at http://journal.r-project.org/current.html

RJournal_2012-1

SAS and Hadoop

Awesomely informative post on sascom magazine (whose editor I have I interviewed before here at http://www.decisionstats.com/interview-alison-bolen-sas-com/ – )

Great piece by Michael Ames ,SAS Data Integration Product Manager.

http://www.sas.com/news/sascom/hadoop-tips.html

 

Also see SAS’s big data thingys here at

http://www.sas.com/software/high-performance-analytics/in-memory-analytics/index.html

Solutions and Capabilities Using SAS® In-Memory Analytics

  • High-Performance Analytics – Get near-real-time insights with appliance-ready analytics software designed to tackle big data and complex problems.
  • High-Performance Risk – Faster, better risk management decisions based on the most up-to-date views of your overall risk exposure.
  • High-Performance Liquidity Risk Management – Take quick, decisive actions to secure adequate funding, especially in times of volatility.
  • High-Performance Stress Testing – Make faster, more precise decisions to protect the health of the firm.
  • Visual Analytics – Explore big data using in-memory capabilities to better understand all of your data, discover new patterns and publish reports to the Web and iPad®.

(Ajay- I liked the Visual Analytics piece especially for Big Data )

Note-