3rd Meeting for New Delhi R users group #rstats

We continued the 3 rd month without a sponsor for the R User Meetup group. I wish analytics companies in Gurgaon and Mumbai or even Hyderabad can help sponsor R users meetups, as we need to have more knowledge sharing culture here.

I am a bit disappointed by people who RSVP yes, but cancel out without letting us know.

Some additional insights-

1) Never hold meetup on cricket match day

2) location should be very close to Metro Station.

3) Saturdays will be the only meetup day for future. If I am unavailable, Dr Amir will be leading the Meetup.

Some more action points-

1) Ajay and Dr Amir to work together on a blog / site called R for Medical Researchers focused on Indian issues.

2) For Web Application developers, please see shiny package. I have written on the latest version of Shiny- but I would also like you to click on each of the 18 links I have written in this article. Shiny is great for statistically enabling web apps.


3) We will be uploading files and creating a slideshare for Delhi R Users. This will be our  digital archive.

4) To start using R immediately use Rattle ( http://rattle.togaware.com/) especially if you build models or need to use data mining

5) To start using R for GIS and Spatial use Deducer with Spatial Plot Builder Plugin

6) To integrate R with C++ see the RCPP gallery with almost 100 plus packages dependent on it. This will be needed with Developers who are trying to use a faster version of R by using C++ . http://gallery.rcpp.org/

7) Web Analytics using R- is now updated for changes in Google APIs. http://decisionstats.com/2013/01/28/google-analytics-using-rstats-updated/

8) If you are new to programming or Ruby or Python  please go to http://codeacdemy.com .

9) For a beginner in R use http://tryr.codeschool.com

Minutes of Meeting -courtesy Dr Amir

four people (Ajay, myself, Ashish and Nagesh) joined for the 3rd meetup today at around 4 pm at hauz khas village. Ashish and Nagesh are two new members who attended the meetup for the first time. Ashish is in web development with knowledge of python and perl. today he introduced the group about data scraping from web using Perl. Nagesh has a background in management and currently working with insurance company. he is planning to switch to R as it is free. he currently uses SAS. Ajay introduced GUIs like Rattle, Deducer; books (R for business analytics which he himself has authored and others like R for spss and sas users, data mining with r, rattle) and GADM for mapping. I introduced Coursera courses which is currently ongoing and which uses R and also the basics of R like packages, functions and strengths of R.
Ashish has volunteered to find a place for meetup in Green Park near the metro station as we are still struggling to get a decent place for our meetups.
important links:
https://www.coursera.org/course/dataanalysis (coursera course which uses R)
http://www.statmethods.net/index.html  (Quick R site for help)

Also – here is the meetup group for New Delhi R Users


Data Frame in Python

Exploring some Python Packages and R packages to move /work with both Python and R without melting your brain or exceeding your project deadline


If you liked the data.frame structure in R, you have some way to work with them at a faster processing speed in Python.

Here are three packages that enable you to do so-

(1) pydataframe http://code.google.com/p/pydataframe/

An implemention of an almost R like DataFrame object. (install via Pypi/Pip: “pip install pydataframe”)


        u = DataFrame( { "Field1": [1, 2, 3],
                        "Field2": ['abc', 'def', 'hgi']},
                         ['Field1', 'Field2']
                         ["rowOne", "rowTwo", "thirdRow"])

A DataFrame is basically a table with rows and columns.

Columns are named, rows are numbered (but can be named) and can be easily selected and calculated upon. Internally, columns are stored as 1d numpy arrays. If you set row names, they’re converted into a dictionary for fast access. There is a rich subselection/slicing API, see help(DataFrame.get_item) (it also works for setting values). Please note that any slice get’s you another DataFrame, to access individual entries use get_row(), get_column(), get_value().

DataFrames also understand basic arithmetic and you can either add (multiply,…) a constant value, or another DataFrame of the same size / with the same column names, like this:

#multiply every value in ColumnA that is smaller than 5 by 6.
my_df[my_df[:,'ColumnA'] < 5, 'ColumnA'] *= 6

#you always need to specify both row and column selectors, use : to mean everything
my_df[:, 'ColumnB'] = my_df[:,'ColumnA'] + my_df[:, 'ColumnC']

#let's take every row that starts with Shu in ColumnA and replace it with a new list (comprehension)
select = my_df.where(lambda row: row['ColumnA'].startswith('Shu'))
my_df[select, 'ColumnA'] = [row['ColumnA'].replace('Shu', 'Sha') for row in my_df[select,:].iter_rows()]

Dataframes talk directly to R via rpy2 (rpy2 is not a prerequiste for the library!)


(2) pandas http://pandas.pydata.org/

Library Highlights

  • A fast and efficient DataFrame object for data manipulation with integrated indexing;
  • Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
  • Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
  • Flexible reshaping and pivoting of data sets;
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
  • Columns can be inserted and deleted from data structures for size mutability;
  • Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
  • High performance merging and joining of data sets;
  • Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
  • Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
  • The library has been ruthlessly optimized for performance, with critical code paths compiled to C;
  • Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

Why not R?

First of all, we love open source R! It is the most widely-used open source environment for statistical modeling and graphics, and it provided some early inspiration for pandas features. R users will be pleased to find this library adopts some of the best concepts of R, like the foundational DataFrame (one user familiar with R has described pandas as “R data.frame on steroids”). But pandas also seeks to solve some frustrations common to R users:

  • R has barebones data alignment and indexing functionality, leaving much work to the user. pandas makes it easy and intuitive to work with messy, irregularly indexed data, like time series data. pandas also provides rich tools, like hierarchical indexing, not found in R;
  • R is not well-suited to general purpose programming and system development. pandas enables you to do large-scale data processing seamlessly when developing your production applications;
  • Hybrid systems connecting R to a low-productivity systems language like Java, C++, or C# suffer from significantly reduced agility and maintainability, and you’re still stuck developing the system components in a low-productivity language;
  • The “copyleft” GPL license of R can create concerns for commercial software vendors who want to distribute R with their software under another license. Python and pandas use more permissive licenses.

(3) datamatrix http://pypi.python.org/pypi/datamatrix/0.8

datamatrix 0.8

A Pythonic implementation of R’s data.frame structure.

Latest Version: 0.9

This module allows access to comma- or other delimiter separated files as if they were tables, using a dictionary-like syntax. DataMatrix objects can be manipulated, rows and columns added and removed, or even transposed


Modeling in Python

Continue reading