Try and learn R – for Free

A free online course on learning R ( sponsored by O Reilly)

http://tryr.codeschool.com/

Table of Contents

  1. R Syntax: A gentle introduction to R expressions, variables, and functions
  2. Vectors: Grouping values into vectors, then doing arithmetic and graphs with them
  3. Matrices: Creating and graphing two-dimensional data sets
  4. Summary Statistics: Calculating and plotting some basic statistics: mean, median, and standard deviation
  5. Factors: Creating and plotting categorized data
  6. Data Frames: Organizing values into data frames, loading frames from files and merging them
  7. Working With Real-World Data: Testing for correlation between data sets, linear models and installing additional packages

codeschool try R

 

New Delhi R User group meets up

Inspired by David Smith ‘s blog post at http://blog.revolutionanalytics.com/2012/10/r-user-group-sponsorship-applications-open-for-2013.html I set up a meetup group for New Delhi at http://www.meetup.com/New-Delhi-R-UseR-Group/ ( India to my surprise has only 1 R user meetup group before this in Bangalore). The first meeting was awesome, we met in a  cafe, and the plan going forward is to cover cross domain learning and collaboration on tools, startups, mashups and training.

Hopefully we can reach out to analytics enthusiasts in Mumbai and Chennai to help kickstart the R User groups. Indian companies like Mu Sigma have been using R more and more in analytics (offshoring). You can even use the sponsorship from Revolution Analytics to start your meetup group , Meetup.com  gives you a 50% discount if you pay 6 months in advance, and given Oracle’s and IBM/Google\s big Indian presence I hope they lend a hand to User groups for R in India as well.

Running R on Windows Azure #rstats #cloud

Here is a brief tutorial for people to run R on Windows Azure Cloud (OS=Windows in this case , but there are 4 kinds of Linux also available)

There is a free 90 day trial so you can run R for free on the cloud for free (since Google Cloud Compute is still in closed hush hush beta)

Go to https://www.windowsazure.com/en-us/pricing/free-trial/

Data Frame in Python

Exploring some Python Packages and R packages to move /work with both Python and R without melting your brain or exceeding your project deadline

—————————————

If you liked the data.frame structure in R, you have some way to work with them at a faster processing speed in Python.

Here are three packages that enable you to do so-

(1) pydataframe http://code.google.com/p/pydataframe/

An implemention of an almost R like DataFrame object. (install via Pypi/Pip: “pip install pydataframe”)

Usage:

        u = DataFrame( { "Field1": [1, 2, 3],
                        "Field2": ['abc', 'def', 'hgi']},
                        optional:
                         ['Field1', 'Field2']
                         ["rowOne", "rowTwo", "thirdRow"])

A DataFrame is basically a table with rows and columns.

Columns are named, rows are numbered (but can be named) and can be easily selected and calculated upon. Internally, columns are stored as 1d numpy arrays. If you set row names, they’re converted into a dictionary for fast access. There is a rich subselection/slicing API, see help(DataFrame.get_item) (it also works for setting values). Please note that any slice get’s you another DataFrame, to access individual entries use get_row(), get_column(), get_value().

DataFrames also understand basic arithmetic and you can either add (multiply,…) a constant value, or another DataFrame of the same size / with the same column names, like this:

#multiply every value in ColumnA that is smaller than 5 by 6.
my_df[my_df[:,'ColumnA'] < 5, 'ColumnA'] *= 6

#you always need to specify both row and column selectors, use : to mean everything
my_df[:, 'ColumnB'] = my_df[:,'ColumnA'] + my_df[:, 'ColumnC']

#let's take every row that starts with Shu in ColumnA and replace it with a new list (comprehension)
select = my_df.where(lambda row: row['ColumnA'].startswith('Shu'))
my_df[select, 'ColumnA'] = [row['ColumnA'].replace('Shu', 'Sha') for row in my_df[select,:].iter_rows()]

Dataframes talk directly to R via rpy2 (rpy2 is not a prerequiste for the library!)

 

(2) pandas http://pandas.pydata.org/

Library Highlights

  • A fast and efficient DataFrame object for data manipulation with integrated indexing;
  • Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
  • Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
  • Flexible reshaping and pivoting of data sets;
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
  • Columns can be inserted and deleted from data structures for size mutability;
  • Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
  • High performance merging and joining of data sets;
  • Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
  • Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
  • The library has been ruthlessly optimized for performance, with critical code paths compiled to C;
  • Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

Why not R?

First of all, we love open source R! It is the most widely-used open source environment for statistical modeling and graphics, and it provided some early inspiration for pandas features. R users will be pleased to find this library adopts some of the best concepts of R, like the foundational DataFrame (one user familiar with R has described pandas as “R data.frame on steroids”). But pandas also seeks to solve some frustrations common to R users:

  • R has barebones data alignment and indexing functionality, leaving much work to the user. pandas makes it easy and intuitive to work with messy, irregularly indexed data, like time series data. pandas also provides rich tools, like hierarchical indexing, not found in R;
  • R is not well-suited to general purpose programming and system development. pandas enables you to do large-scale data processing seamlessly when developing your production applications;
  • Hybrid systems connecting R to a low-productivity systems language like Java, C++, or C# suffer from significantly reduced agility and maintainability, and you’re still stuck developing the system components in a low-productivity language;
  • The “copyleft” GPL license of R can create concerns for commercial software vendors who want to distribute R with their software under another license. Python and pandas use more permissive licenses.

(3) datamatrix http://pypi.python.org/pypi/datamatrix/0.8

datamatrix 0.8

A Pythonic implementation of R’s data.frame structure.

Latest Version: 0.9

This module allows access to comma- or other delimiter separated files as if they were tables, using a dictionary-like syntax. DataMatrix objects can be manipulated, rows and columns added and removed, or even transposed

—————————————————————–

Modeling in Python

Continue reading “Data Frame in Python”

R and Hadoop #rstats

Lovely ppt from the formidable Jeffrey Bean, whose lucid style in explaining R has made me a big fan of his awesome work!

Take at look at his extensive collection of Big Data with R slides  at http://jeffreybreen.wordpress.com/2012/03/10/big-data-step-by-step-slides/ – they are both very comprehensive and a delightful addition to anyone wishing to go the cloud, hadoop, R  route
His blog at http://jeffreybreen.wordpress.com/ talks of lots of very relevant topics.

JMP Student Edition

I really liked the initiatives at JMP/Academic. Not only they offer the software bundled with a textbook, which is both good common sense as well as business sense given how fast students can get confused

(Rant 1 Bundling with textbooks is something I think is Revolution Analytics should think of doing instead of just offering the academic  version for free downloading- it would be interesting to see the penetration of R academic market with Revolution’s version and the open source version with the existing strategy)

From http://www.jmp.com/academic/textbooks.shtml

Major publishers of introductory statistics textbooks offer a 12-month license to JMP Student Edition, a streamlined version of JMP, with their textbooks.

and a glance through this http://www.jmp.com/academic/pdf/jmp_se_comparison.pdf  shows it is a credible and not extremely whittled down version which would be just dishonest.

And I loved this Reference Card at http://www.jmp.com/academic/pdf/jmp10_se_quick_guide.pdf

 

Oracle, SAP- Hana, Revolution Analytics and even SAS/STAT itself can make more reference cards like this- elegant solutions for students and new learners!

More- creative-rants Honestly why do corporate sites use PDFs anymore when they can use Instapaper , or any of these SlideShare/Scribd formats to show information in a better way without diverting the user from the main webpage.

But I digress, back to JMP

 

Resources for Faculty Using JMP® Student Edition

Faculty who select a JMP Student Edition bundle for their courses may be eligible for additional resources, including course materials and training.

Special JMP® Student Edition for AP Statistics

JMP Student Edition is available in a convenient five-year license for qualified Advanced Placement statistics programs.

Try and have a look yourself at http://www.jmp.com/academic/student.shtml

 

 

 

%d bloggers like this: