SAS for R Users

I recently managed to get a copy of SAS University Edition.

1) Here were some problems I had to resolve- The download size is 1.5 gb of a zipped file ( a virtual machine image). Since I have a internet broadband based in India it led to many failed attempts before I could get it. The unzipped file is almost 3.5 gb. You can get the download file here http://www.sas.com/en_us/software/university-edition/download-software.html.

Secondly the hardware needed is 64 bit, so I basically upgraded my Dell Computer. This was a useful upgrade for me anyway.

2) You can get an Internet Download Manager to resume downloading in case your Internet connection has issues downloading a 1.5 gb file in one go. For Linux you can see http://flareget.com/download/

and for Windows http://www.internetdownloadmanager.com/download.html

3) I chose VM Player for Linux because I am much more comfortable with VM Player ( Desktop free version). I got that from here ~200 MB https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/6_0

4) Finally I installed VM Player and Open an Existing Virtual Machine to boot up SAS University Edition

I was able to open the SAS Studio at the IP Address provided.

I downloaded a Dataset from this collection here

https://archive.ics.uci.edu/ml/datasets/Adult

6) Then I uploaded it to within the SAS Studio System

7) Lastly I was able to run some basic commands

I was really impressed by the enhancements made to the interface, the ability to search command help through a drop down, the color coded editor and of course the case insensitive SAS language (though I am not a fan of the semi colon I loved using Ctrl + / for easy commenting and uncommenting)

For a SAS turned R turned SAS coder- here are some views
SAS has different windows for coding, log and output. R generally has one
SAS is case insensitive while R is case sensitive. This is a blessing especially for variable and dataset names.
SAS deals with Datasets than can be considered the same as Rs Data Frame.
R’s flexibility in data types is not really comparable to SAS as it is quite fast enough.
SAS has a Macro Language for repeatable tasks
SQL is embedded within SAS as Proc SQL and in R through sqldf package
You have to pay for each upgrade in SAS ecosystem. I am not clear on the transparent pricing, which components does what and whether they have a cloud option for renting by the hour. How about one web page that lists product description and price.
SAS University Edition is a OS agnostic tool, for that itself it is quite impressive compared to say Academic Edition of Revolution Analytics.
R is object oriented and uses [] and $ notation for sub objects. SAS is divided into two main parts- data and proc steps, and uses the . notation and var system
SAS language has a few basic procs but many many options.
How good a SAS coder you are often depends on what you can do in data manipulation in SAS Data Step
Graphics is still better in R ggplot. But the SAS speed is thrilling.
RAM is limited in the University Edition to 1 GB but I found that still quite fast. However I can upload only a 10 mb file to the SAS Studio for University Edition which I found reasonable for teaching purposes.

NumFocus- The Python Statistical Community

I really liked the mature design, and foundation of this charitable organization. While it is similar to FOAS in many ways (http://www.foastat.org/projects.html) I like the projects . Excellent projects and some of which I think should be featured in Journal of Statistical Software– (since there is a seperate R Journal) unless it wants to be overtly R focused.

In the same manner I think some non Python projects should try and reach out to NumFocus (if it is not wanting to be so PyFocus-ed)

Here it is NumFocus

NumFOCUS supports and promotes world-class, innovative, open source scientific software. Most individual projects, even the wildly successful ones, find the overhead of a non-profit to be too large for their community to bear. NumFOCUS provides a critical service as an umbrella organization which removes the burden from the projects themselves to raise money.

Money donated through NumFOCUS goes to sponsor things like:

Coding sprints (food and travel)
Technical fellowships (sponsored students and mentors to work on code)
Equipment grants (to developers and projects)
Conference attendance for students (to PyData, SciPy, and other conferences)
Fees for continuous integration and other software engineering tools
Documentation development
Web-page hosting and bandwidth fees for projects

Core Projects

NumPy

static/images/NumPY.png NumPy is the fundamental package needed for scientific computing with Python. Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases. Repositories for NumPy binaries: http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy, a variety of versions – http://sourceforge.net/projects/numpy/files/NumPy/, version 1.6.1 – http://sourceforge.net/projects/numpy/files/NumPy/1.6.1/.

SciPy

static/images/scipy.png SciPy is open-source software for mathematics, science, and engineering. It is also the name of a very popular conference on scientific programming with Python. The SciPy library depends on NumPy, which provides convenient and fast N-dimensional array manipulation. The SciPy library is built to work with NumPy arrays, and provides many user-friendly and efficient numerical routines such as routines for numerical integration and optimization.

Matplotlib

static/images/matplotlib.png 2D plotting library for Python that produces high quality figures that can be used in various hardcopy and interactive environments. matplolib is compatiable with python scripts and the python and ipython shells.

IPython

static/images/ipython.png High quality open source python shell that includes tools for high level and interactive parallel computing.

SymPy

static/images/SymPy2.jpg SymPy is a Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. SymPy is written entirely in Python and does not require any external libraries.

Other Projects

Cython

static/images/cython.png Cython is a language based on Pyrex that makes writing C extensions for Python as easy as writing them in Python itself. Cython supports calling C functions and declaring C types on variables and class attributes, allowing the compiler to generate very efficient C code from Cython code.

pandas

static/images/pandas.png pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

PyTables

PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. PyTables is built on top of the HDF5 library, using the Python language and the NumPy package. It features an Pythonic interface combined with C / Cython extensions for the performance-critical parts of the code. This makes it a fast, yet extremely easy to use tool for very large amounts of data. http://pytables.github.com/

scikit-image

static/images/scikitsimage.png Free high-quality and peer-reviewed volunteer produced collection of algorithms for image processing.

scikit-learn

static/images/scikitslearn.png Module designed for scientific pythons that provides accesible solutions to machine learning problems.

Scikits-Statsmodels

static/images/scikits.png Statsmodels is a Python package that provides a complement to scipy for statistical computations including descriptive statistics and estimation of statistical models.

Spyder

static/images/spyder.png Interactive development environment for Python that features advanced editing, interactive testing, debugging and introspection capabilities, as well as a numerical computing environment made possible through the support of Ipython, NumPy, SciPy, and matplotlib.

Theano

static/images/theano_logo_allblue_200x46.png Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.

Associated Projects

NumFOCUS is currently looking for representatives to enable us to promote the following projects. For information contact us at: info@NumFOCUS.org.

Sage

static/images/sage.png Open source mathematics sofware system that combines existing open-source packages into a Python-based interface.

NetworkX

NetworkX is a Python language software package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

Python(X,Y)

static/images/pythonxy.png Free scientific and engineering development software used for numerical computations, and analysis and visualization of data using the Python programmimg language.

Iris for Big Data #rstats #bigdata

Quote of the Day-

it is impossible to be a data scientist without knowing iris

#Anonymous #Quotes

Revolution Analytics has been nice enough to provide both datasets and code for analyzing Big Data in R.

~~http://www.revolutionanalytics.com/subscriptions/datasets/~~

http://packages.revolutionanalytics.com/datasets/

Site was updated so here are the new links

while the Datasets collection is still elementary, as a R Instructor I find this list extremely useful. However I wish they look at some other repositories and make .xdf and “tidy” csv versions. A little bit of RODBC usage should help, and so will some descriptions. Maybe they should partner with Quandl, DataMarket, or Infochimps on this initiative than do it alone.

Overall there can be a R package (like a Big Data version of the famous datasets package in R)

But a nice and very useful effort

Revolution R Datasets

../
AirOnTime87to12/ 09-Nov-2013 00:46 –
AirOnTimeCSV2012/ 09-Nov-2013 00:30 –
AirOnTime2012.xdf 08-Nov-2013 18:08 190110335
AirOnTime7Pct.xdf 08-Nov-2013 17:42 103317987
AirlineData87to08.tar.gz 03-May-2013 21:05 5521408
AirlineData87to08.zip 09-May-2013 14:59 1802240
AirlineData87to08_11811.tar.gz 08-Nov-2013 03:27 1428527359
AirlineData87to08_83010.zip 08-Nov-2013 06:37 1477052425
AirlineDataSubsample.xdf 08-Nov-2013 07:27 390789536
Census5PCT2000.tar.gz 08-Nov-2013 10:55 871208970
Census5PCT2000.zip 08-Nov-2013 12:52 925929427
CensusUS5Pct2000.xdf 08-Nov-2013 21:27 1204906764
ccFraud.csv 23-Apr-2013 20:57 291737157
ccFraudScore.csv 23-Apr-2013 21:10 273848249
ccFraudScore10_CreateLoadTableQuotedColumns.fas..> 23-Apr-2013 21:10 981
ccFraud_CreateLoadTable_QuotedColumns.fastload 23-Apr-2013 21:10 984
index.php.txt 09-May-2013 22:17 3983
mortDefault.tar.gz 08-Nov-2013 12:59 61585580
mortDefault.zip 08-Nov-2013 13:08 63968310

More code-

http://blog.revolutionanalytics.com/2013/08/big-data-sets-for-r.html

Also a recent project made by a student of mine on Revolution Datasets and using their blog posts.

Forecasting analysis on us flights v1

Note how much more better the above project is than use the mini and super clean datasets within R (like Boston)

Mini project boston housing dataset v1

Hat TIP- R’s very own Mr Smith

Unrelated-

For more on IRIS

Using ifelse in R for creating new variables #rstats #data #manipulation

The ifelse function is simple and powerful and can help in data manipulation within R. Here I create a categoric variable from specific values in a numeric variable

> data(iris)

> iris$Type=ifelse(iris$Sepal.Length<5.8,”Small Flower”,”Big Flower”)
> table(iris$Type)
Big Flower Small Flower
77 73

The parameters of ifelse is quite simple

Usage

ifelse(test, yes, no)
Arguments

test
an object which can be coerced to logical mode.

yes
return values for true elements of test.

no
return values for false elements of tes

So many R Packages Everywhere, which one do I use? #rstats

Some thoughts on R Packages

CRAN is no longer the sole repository for many useful R packages. This includes R Forge, Google Code and increasingly Github
CRAN lacks the flexibility and social aspect of Github.
CRAN Views is the only thing that lists subject wide listing of R packages. The categorization is however done more on methods than on use cases or business domains.
Multiple R packages for the same thing. Which one do I use? Only Stack Overflow helps with that. No rating , no recommendation system
The packages suggested by R package feature needs better and automatic association analysis . Right now it is manual and dependent on package author and maintainer.
Quis custodiet ipsos custodes? Who guards the guardians of R packages. In an era of cyber security, we need better transparency on security measures within R packages especially given the international nature of the project. I am very sure I ( or anyone) can create R code to communicate discretely especially on Windows
I would rather not install anything on my local machine, and read the package directly from the CRAN . CRAN was designed in an era of low bandwidth- this needs to be upgraded.
Note I am refraining respectfully from the atrocious nature of aesthetics in the home website. Many statisticians feel no use of making R user friendly. My professors at U tenn (from which I dropped out in 2 sems) were horrified when I took courses in graphic design as I wanted to know more on the A and B, which make the A/B testing of statistical design. Now that I am getting older, I get horrified by the lack of HTML, CSS and JQuery by some of the brightest programmers in this project.
Please comment below.

Using R for Cricket Analysis #rstats #IPL

#Downloading the Data for batting across all formats of cricket
library(XML)
url="http://stats.espncricinfo.com/ci/engine/stats/index.html?class=11;template=results;type=batting"
tables=readHTMLTable(url,stringsAsFactors = F)
#Note we wrote stringsAsFactors=F in this to avoid getting factor variables, 
#since we will need to convert these variables to numeric variables
table2=tables$"Overall figures"
rm(tables)
#Creating new variables from Span
table2$Debut=as.numeric(substr(table2$Span,1,4))
table2$LastYr=as.numeric(substr(table2$Span,6,10))
table2$YrsPlayed=table2$LastYr-table2$Debut
#Creating New Variables. In cricket a not out score is denoted by * which can cause data quality error. 
#This is treated by grepl for finding and gsub for removing the *. 
#Note the double \ to escape regex charachter
table2$HSNotOut=grepl("\\*",table2$HS)
table2$HS2=gsub("\\*","",table2$HS)
#Creating a FOR Loop (!) to convert variables to numeric variables
for (i in 3:17) {
+     table2[, i] <- as.numeric(table2[, i])
+ }

and we see why Sachin Tendulkar is the best (by using ggplot via Deducer)


Also see

```
Freaknomics Challenge-
```
1. Prove match fixing does not and cannot exist in IPL
2. Create an ideal fantasy team

Please share:

Please share:

Core Projects

Other Projects

Associated Projects

Please share:

Revolution R Datasets

Please share:

Please share:

Please share:

Please share: