Home » Posts tagged 'datasets'

Tag Archives: datasets

Iris for Big Data #rstats #bigdata

Quote of the Day-

it is impossible to be a data scientist without knowing iris 

#Anonymous #Quotes

 

Revolution Analytics has been nice enough to provide both datasets and code for analyzing Big Data in R.

http://www.revolutionanalytics.com/subscriptions/datasets/

http://packages.revolutionanalytics.com/datasets/

Site was updated so here are the new links

 

while the Datasets collection is still elementary, as a R Instructor I find this list extremely useful. However I wish they look at some other repositories and make .xdf and “tidy” csv versions. A little bit of RODBC usage should help, and so will some descriptions. Maybe they should partner with Quandl, DataMarket, or Infochimps on this initiative than do it alone.

 

Overall there can be a R package (like a Big Data version of the famous datasets package in R)

But a nice and very useful effort

Revolution R Datasets

More code-

http://blog.revolutionanalytics.com/2013/08/big-data-sets-for-r.html

Also a recent project made by a student of mine on Revolution Datasets and using their blog posts.

Note how much more better the above project is than use the mini and super clean datasets within R (like Boston)

 

Hat TIP- R’s very own Mr Smith
Unrelated-
For more on IRIS

 

Data Frame in Python

Exploring some Python Packages and R packages to move /work with both Python and R without melting your brain or exceeding your project deadline

—————————————

If you liked the data.frame structure in R, you have some way to work with them at a faster processing speed in Python.

Here are three packages that enable you to do so-

(1) pydataframe http://code.google.com/p/pydataframe/

An implemention of an almost R like DataFrame object. (install via Pypi/Pip: “pip install pydataframe”)

Usage:

        u = DataFrame( { "Field1": [1, 2, 3],
                        "Field2": ['abc', 'def', 'hgi']},
                        optional:
                         ['Field1', 'Field2']
                         ["rowOne", "rowTwo", "thirdRow"])

A DataFrame is basically a table with rows and columns.

Columns are named, rows are numbered (but can be named) and can be easily selected and calculated upon. Internally, columns are stored as 1d numpy arrays. If you set row names, they’re converted into a dictionary for fast access. There is a rich subselection/slicing API, see help(DataFrame.get_item) (it also works for setting values). Please note that any slice get’s you another DataFrame, to access individual entries use get_row(), get_column(), get_value().

DataFrames also understand basic arithmetic and you can either add (multiply,…) a constant value, or another DataFrame of the same size / with the same column names, like this:

#multiply every value in ColumnA that is smaller than 5 by 6.
my_df[my_df[:,'ColumnA'] < 5, 'ColumnA'] *= 6

#you always need to specify both row and column selectors, use : to mean everything
my_df[:, 'ColumnB'] = my_df[:,'ColumnA'] + my_df[:, 'ColumnC']

#let's take every row that starts with Shu in ColumnA and replace it with a new list (comprehension)
select = my_df.where(lambda row: row['ColumnA'].startswith('Shu'))
my_df[select, 'ColumnA'] = [row['ColumnA'].replace('Shu', 'Sha') for row in my_df[select,:].iter_rows()]

Dataframes talk directly to R via rpy2 (rpy2 is not a prerequiste for the library!)

 

(2) pandas http://pandas.pydata.org/

Library Highlights

  • A fast and efficient DataFrame object for data manipulation with integrated indexing;
  • Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
  • Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
  • Flexible reshaping and pivoting of data sets;
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
  • Columns can be inserted and deleted from data structures for size mutability;
  • Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
  • High performance merging and joining of data sets;
  • Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
  • Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
  • The library has been ruthlessly optimized for performance, with critical code paths compiled to C;
  • Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

Why not R?

First of all, we love open source R! It is the most widely-used open source environment for statistical modeling and graphics, and it provided some early inspiration for pandas features. R users will be pleased to find this library adopts some of the best concepts of R, like the foundational DataFrame (one user familiar with R has described pandas as “R data.frame on steroids”). But pandas also seeks to solve some frustrations common to R users:

  • R has barebones data alignment and indexing functionality, leaving much work to the user. pandas makes it easy and intuitive to work with messy, irregularly indexed data, like time series data. pandas also provides rich tools, like hierarchical indexing, not found in R;
  • R is not well-suited to general purpose programming and system development. pandas enables you to do large-scale data processing seamlessly when developing your production applications;
  • Hybrid systems connecting R to a low-productivity systems language like Java, C++, or C# suffer from significantly reduced agility and maintainability, and you’re still stuck developing the system components in a low-productivity language;
  • The “copyleft” GPL license of R can create concerns for commercial software vendors who want to distribute R with their software under another license. Python and pandas use more permissive licenses.

(3) datamatrix http://pypi.python.org/pypi/datamatrix/0.8

datamatrix 0.8

A Pythonic implementation of R’s data.frame structure.

Latest Version: 0.9

This module allows access to comma- or other delimiter separated files as if they were tables, using a dictionary-like syntax. DataMatrix objects can be manipulated, rows and columns added and removed, or even transposed

—————————————————————–

Modeling in Python

(more…)

Google Cloud is finally here

Amazon gets some competition, and customers should see some relief, unless Google withdraws commitment on these products after a few years of trying (like it often does now!)

 

http://cloud.google.com/products/index.html

Machine Type Pricing
Configuration Virtual Cores Memory GCEU * Local disk Price/Hour $/GCEU/hour
n1-standard-1-d 1 3.75GB *** 2.75 420GB *** $0.145 0.053
n1-standard-2-d 2 7.5GB 5.5 870GB $0.29 0.053
n1-standard-4-d 4 15GB 11 1770GB $0.58 0.053
n1-standard-8-d 8 30GB 22 2 x 1770GB $1.16 0.053
Network Pricing
Ingress Free
Egress to the same Zone. Free
Egress to a different Cloud service within the same Region. Free
Egress to a different Zone in the same Region (per GB) $0.01
Egress to a different Region within the US $0.01 ****
Inter-continental Egress At Internet Egress Rate
Internet Egress (Americas/EMEA destination) per GB
0-1 TB in a month $0.12
1-10 TB $0.11
10+ TB $0.08
Internet Egress (APAC destination) per GB
0-1 TB in a month $0.21
1-10 TB $0.18
10+ TB $0.15
Persistent Disk Pricing
Provisioned space $0.10 GB/month
Snapshot storage** $0.125 GB/month
IO Operations $0.10 per million
IP Address Pricing
Static IP address (assigned but unused) $0.01 per hour
Ephemeral IP address (attached to instance) Free
* GCEU is Google Compute Engine Unit — a measure of computational power of our instances based on industry benchmarks; review the GCEU definition for more information
** coming soon
*** 1GB is defined as 2^30 bytes
**** promotional pricing; eventually will be charged at internet download rates

Google Prediction API

Tap into Google’s machine learning algorithms to analyze data and predict future outcomes.

Leverage machine learning without the complexity
Use the familiar RESTful interface
Enter input in any format – numeric or text

Build smart apps

Learn how you can use Prediction API to build customer sentiment analysis, spam detection or document and email classification.

Google Translation API

Use Google Translate API to build multilingual apps and programmatically translate text in your webpage or application.

Translate text into other languages programmatically
Use the familiar RESTful interface
Take advantage of Google’s powerful translation algorithms

Build multilingual apps

Learn how you can use Translate API to build apps that can programmatically translate text in your applications or websites.

Google BigQuery

Analyze Big Data in the cloud using SQL and get real-time business insights in seconds using Google BigQuery. Use a fully-managed data analysis service with no servers to install or maintain.
Figure

Reliable & Secure

Complete peace of mind as your data is automatically replicated across multiple sites and secured using access control lists.
Scale infinitely

You can store up to hundreds of terabytes, paying only for what you use.
Blazing fast

Run ad hoc SQL queries on
multi-terabyte datasets in seconds.

Google App Engine

Create apps on Google’s platform that are easy to manage and scale. Benefit from the same systems and infrastructure that power Google’s applications.

Focus on your apps

Let us worry about the underlying infrastructure and systems.
Scale infinitely

See your applications scale seamlessly from hundreds to millions of users.
Business ready

Premium paid support and 99.95% SLA for business users

New RCommander with ggplot #rstats

 

My favorite GUI (or one of them) R Commander has a relatively new plugin called KMGGplot2. Until now Deducer was the only GUI with ggplot features , but the much lighter and more popular R Commander has been a long champion in people wanting to pick up R quickly.

 

http://cran.r-project.org/web/packages/RcmdrPlugin.KMggplot2/

RcmdrPlugin.KMggplot2: Rcmdr Plug-In for Kaplan-Meier Plot and Other Plots by Using the ggplot2 Package

 

As you can see by the screenshot- it makes ggplot even easier for people (like R  newbies and experienced folks alike)

 

This package is an R Commander plug-in for Kaplan-Meier plot and other plots by using the ggplot2 package.

Version: 0.1-0
Depends: R (≥ 2.15.0), stats, methods, grid, Rcmdr (≥ 1.8-4), ggplot2 (≥ 0.9.1)
Imports: tcltk2 (≥ 1.2-3), RColorBrewer (≥ 1.0-5), scales (≥ 0.2.1), survival (≥ 2.36-14)
Published: 2012-05-18
Author: Triad sou. and Kengo NAGASHIMA
Maintainer: Triad sou. <triadsou at gmail.com>
License: GPL-2
CRAN checks: RcmdrPlugin.KMggplot2 results

 

----------------------------------------------------------------
NEWS file for the RcmdrPlugin.KMggplot2 package
----------------------------------------------------------------

----------------------------------------------------------------

Changes in version 0.1-0 (2012-05-18)

 o Restructuring implementation approach for efficient
   maintenance.
 o Added options() for storing package specific options (e.g.,
   font size, font family, ...).
 o Added a theme: theme_simple().
 o Added a theme element: theme_rect2().
 o Added a list box for facet_xx() functions in some menus
   (Thanks to Professor Murtaza Haider).
 o Kaplan-Meier plot: added confidence intervals.
 o Box plot: added violin plots.
 o Bar chart for discrete variables: deleted dynamite plots.
 o Bar chart for discrete variables: added stacked bar charts.
 o Scatter plot matrix: added univariate plots at diagonal
   positions (ggplot2::plotmatrix).
 o Deleted the dummy data for histograms, which is large in
   size.

----------------------------------------------------------------

Changes in version 0.0-4 (2011-07-28)

 o Fixed "scale_y_continuous(formatter = "percent")" to
   "scale_y_continuous(labels = percent)" for ggplot2
   (>= 0.9.0).
 o Fixed "legend = FALSE" to "show_guide = FALSE" for
   ggplot2 (>= 0.9.0).
 o Fixed the DESCRIPTION file for ggplot2 (>= 0.9.0) dependency.

----------------------------------------------------------------

Changes in version 0.0-3 (2011-07-28; FIRST RELEASE VERSION)

 o Kaplan-Meier plot: Show no. at risk table on outside.
 o Histogram: Color coding.
 o Histogram: Density estimation.
 o Q-Q plot: Create plots based on a maximum likelihood estimate
   for the parameters of the selected theoretical distribution.
 o Q-Q plot: Create plots based on a user-specified theoretical
   distribution.
 o Box plot / Errorbar plot: Box plot.
 o Box plot / Errorbar plot: Mean plus/minus S.D.
 o Box plot / Errorbar plot: Mean plus/minus S.D. (Bar plot).
 o Box plot / Errorbar plot: 95 percent Confidence interval
   (t distribution).
 o Box plot / Errorbar plot: 95 percent Confidence interval
   (bootstrap).
 o Scatter plot: Fitting a linear regression.
 o Scatter plot: Smoothing with LOESS for small datasets or GAM
   with a cubic regression basis for large data.
 o Scatter plot matrix: Fitting a linear regression.
 o Scatter plot matrix: Smoothing with LOESS for small datasets
   or GAM with a cubic regression basis for large data.
 o Line chart: Normal line chart.
 o Line chart: Line char with a step function.
 o Line chart: Area plot.
 o Pie chart: Pie chart.
 o Bar chart for discrete variables: Bar chart for discrete
   variables.
 o Contour plot: Color coding.
 o Contour plot: Heat map.
 o Distribution plot: Normal distribution.
 o Distribution plot: t distribution.
 o Distribution plot: Chi-square distribution.
 o Distribution plot: F distribution.
 o Distribution plot: Exponential distribution.
 o Distribution plot: Uniform distribution.
 o Distribution plot: Beta distribution.
 o Distribution plot: Cauchy distribution.
 o Distribution plot: Logistic distribution.
 o Distribution plot: Log-normal distribution.
 o Distribution plot: Gamma distribution.
 o Distribution plot: Weibull distribution.
 o Distribution plot: Binomial distribution.
 o Distribution plot: Poisson distribution.
 o Distribution plot: Geometric distribution.
 o Distribution plot: Hypergeometric distribution.
 o Distribution plot: Negative binomial distribution.

Software Review- BigML.com – Machine Learning meets the Cloud

I had a chance to dekko the new startup BigML https://bigml.com/ and was suitably impressed by the briefing and my own puttering around the site. Here is my review-

1) The website is very intutively designed- You can create a dataset from an uploaded file in one click and you can create a Decision Tree model in one click as well. I wish other cloud computing websites like  Google Prediction API make design so intutive and easy to understand. Also unlike Google Prediction API, the models are not black box models, but have a description which can be understood.

2) It includes some well known data sources for people trying it out. They were kind enough to offer 5 invite codes for readers of Decisionstats ( if you want to check it yourself, use the codes below the post, note they are one time only , so the first five get the invites.

BigML is still invite only but plan to get into open release soon.

3) Data Sources can only be by uploading files (csv) but they plan to change this hopefully to get data from buckets (s3? or Google?) and from URLs.

4) The one click operation to convert data source into a dataset shows a histogram (distribution) of individual variables.The back end is clojure , because the team explained it made the easiest sense and fit with Java. The good news (?) is you would never see the clojure code at the back end. You can read about it from http://clojure.org/

As cloud computing takes off (someday) I expect clojure popularity to take off as well.

Clojure is a dynamic programming language that targets the Java Virtual Machine (and the CLR, and JavaScript). It is designed to be a general-purpose language, combining the approachability and interactive development of a scripting language with an efficient and robust infrastructure for multithreaded programming. Clojure is a compiled language – it compiles directly to JVM bytecode, yet remains completely dynamic. Every feature supported by Clojure is supported at runtime. Clojure provides easy access to the Java frameworks, with optional type hints and type inference, to ensure that calls to Java can avoid reflection.

Clojure is a dialect of Lisp

 

5) As of now decision trees is the only distributed algol, but they expect to roll out other machine learning stuff soon. Hopefully this includes regression (as logit and linear) and k means clustering. The trees are created and pruned in real time which gives a slightly animated (and impressive effect). and yes model building is an one click operation.

The real time -live pruning is really impressive and I wonder why /how it can ever be replicated in other software based on desktop, because of the sheer interactive nature.

 

Making the model is just half the work. Creating predictions and scoring the model is what is really the money-earner. It is one click and customization is quite intuitive. It is not quite PMML compliant yet so I hope some Zemanta like functionality can be added so huge amounts of models can be applied to predictions or score data in real time.

 

If you are a developer/data hacker, you should check out this section too- it is quite impressive that the designers of BigML have planned for API access so early.

https://bigml.com/developers

BigML.io gives you:

  • Secure programmatic access to all your BigML resources.
  • Fully white-box access to your datasets and models.
  • Asynchronous creation of datasets and models.
  • Near real-time predictions.

 

Note: For your convenience, some of the snippets below include your real username and API key.

Please keep them secret.

REST API

BigML.io conforms to the design principles of Representational State Transfer (REST)BigML.io is enterely HTTP-based.

BigML.io gives you access to four basic resources: SourceDatasetModel and Prediction. You cancreatereadupdate, and delete resources using the respective standard HTTP methods: POSTGET,PUT and DELETE.

All communication with BigML.io is JSON formatted except for source creation. Source creation is handled with a HTTP PUT using the “multipart/form-data” content-type

HTTPS

All access to BigML.io must be performed over HTTPS

and https://bigml.com/developers/quick_start ( In think an R package which uses JSON ,RCurl  would further help in enhancing ease of usage).

 

Summary-

Overall a welcome addition to make software in the real of cloud computing and statistical computation/business analytics both easy to use and easy to deploy with fail safe mechanisms built in.

Check out https://bigml.com/ for yourself to see.

The invite codes are here -one time use only- first five get the invites- so click and try your luck, machine learning on the cloud.

If you dont get an invite (or it is already used, just leave your email there and wait a couple of days to get approval)

  1. https://bigml.com/accounts/register/?code=E1FE7
  2. https://bigml.com/accounts/register/?code=09991
  3. https://bigml.com/accounts/register/?code=5367D
  4. https://bigml.com/accounts/register/?code=76EEF
  5. https://bigml.com/accounts/register/?code=742FD

Book Review- Machine Learning for Hackers

This is review of the fashionably named book Machine Learning for Hackers by Drew Conway and John Myles White (O’Reilly ). The book is about hacking code in R.

 

The preface introduces the reader to the authors conception of what machine learning and hacking is all about. If the name of the book was machine learning for business analytsts or data miners, I am sure the content would have been unchanged though the popularity (and ambiguity) of the word hacker can often substitute for its usefulness. Indeed the many wise and learned Professors of statistics departments through out the civilized world would be mildly surprised and bemused by their day to day activities as hacking or teaching hackers. The book follows a case study and example based approach and uses the GGPLOT2 package within R programming almost to the point of ignoring any other native graphics system based in R. It can be quite useful for the aspiring reader who wishes to understand and join the booming market for skilled talent in statistical computing.

Chapter 1 has a very useful set of functions for data cleansing and formatting. It walks you through the basics of formatting based on dates and conditions, missing value and outlier treatment and using ggplot package in R for graphical analysis. The case study used is an Infochimps dataset with 60,000 recordings of UFO sightings. The case study is lucid, and done at a extremely helpful pace illustrating the powerful and flexible nature of R functions that can be used for data cleansing.The chapter mentions text editors and IDEs but fails to list them in a tabular format, while listing several other tables like Packages used in the book. It also jumps straight from installation instructions to functions in R without getting into the various kinds of data types within R or specifying where these can be referenced from. It thus assumes a higher level of basic programming understanding for the reader than the average R book.

Chapter 2 discusses data exploration, and has a very clear set of diagrams that explain the various data summary operations that are performed routinely. This is an innovative approach and will help students or newcomers to the field of data analysis. It introduces the reader to type determination functions, as well different kinds of encoding. The introduction to creating functions is quite elegant and simple , and numerical summary methods are explained adequately. While the chapter explains data exploration with the help of various histogram options in ggplot2 , it fails to create a more generic framework for data exploration or rules to assist the reader in visual data exploration in non standard data situations. While the examples are very helpful for a reader , there needs to be slightly more depth to step out of the example and into a framework for visual data exploration (or references for the same). A couple of case studies however elaborately explained cannot do justice to the vast field of data exploration and especially visual data exploration.

Chapter 3 discussed binary classification for the specific purpose for spam filtering using a dataset from SpamAssassin. It introduces the reader to the naïve Bayes classifier and the principles of text mining suing the tm package in R. Some of the example codes could have been better commented for easier readability in the book. Overall it is quite a easy tutorial for creating a naïve Bayes classifier even for beginners.

Chapter 4 discusses the issues in importance ranking and creating recommendation systems specifically in the case of ordering email messages into important and not important. It introduces the useful grepl, gsub, strsplit, strptime ,difftime and strtrim functions for parsing data. The chapter further introduces the reader to the concept of log (and affine) transformations in a lucid and clear way that can help even beginners learn this powerful transformation concept. Again the coding within this chapter is sparsely commented which can cause difficulties to people not used to learn reams of code. ( it may have been part of the code attached with the book, but I am reading an electronic book and I did not find an easy way to go back and forth between the code and the book). The readability of the chapters would be further enhanced by the use of flow charts explaining the path and process followed than overtly verbose textual descriptions running into multiple pages. The chapters are quite clearly written, but a helpful visual summary can help in both revising the concepts and elucidate the approach taken further.A suggestion for the authors could be to compile the list of useful functions they introduce in this book as a sort of reference card (or Ref Card) for R Hackers or atleast have a chapter wise summary of functions, datasets and packages used.

Chapter 5 discusses linear regression , and it is a surprising and not very good explanation of regression theory in the introduction to regression. However the chapter makes up in practical example what it oversimplifies in theory. The chapter on regression is not the finest chapter written in this otherwise excellent book. Part of this is because of relative lack of organization- correlation is explained after linear regression is explained. Once again the lack of a function summary and a process flow diagram hinders readability and a separate section on regression metrics that help make a regression result good or not so good could be a welcome addition. Functions introduced include lm.

Chapter 6 showcases Generalized Additive Model (GAM) and Polynomial Regression, including an introduction to singularity and of over-fitting. Functions included in this chapter are transform, and poly while the package glmnet is also used here. The chapter also introduces the reader formally to the concept of cross validation (though examples of cross validation had been introduced in earlier chapters) and regularization. Logistic regression is also introduced at the end in this chapter.

Chapter 7 is about optimization. It describes error metric in a very easy to understand way. It creates a grid by using nested loops for various values of intercept and slope of a regression equation and computing the sum of square of errors. It then describes the optim function in detail including how it works and it’s various parameters. It introduces the curve function. The chapter then describes ridge regression including definition and hyperparameter lamda. The use of optim function to optimize the error in regression is useful learning for the aspiring hacker. Lastly it describes a case study of breaking codes using the simplistic Caesar cipher, a lexical database and the Metropolis method. Functions introduced in this chapter include .Machine$double.eps .

Chapter 8 deals with Principal Component Analysis and unsupervised learning. It uses the ymd function from lubridate package to convert string to date objects, and the cast function from reshape package to further manipulate the structure of data. Using the princomp functions enables PCA in R.The case study creates a stock market index and compares the results with the Dow Jones index.

Chapter 9 deals with Multidimensional Scaling as well as clustering US senators on the basis of similarity in voting records on legislation .It showcases matrix multiplication using %*% and also the dist function to compute distance matrix.

Chapter 10 has the subject of K Nearest Neighbors for recommendation systems. Packages used include class ,reshape and and functions used include cor, function and log. It also demonstrates creating a custom kNN function for calculating Euclidean distance between center of centroids and data. The case study used is the R package recommendation contest on Kaggle. Overall a simplistic introduction to creating a recommendation system using K nearest neighbors, without getting into any of the prepackaged packages within R that deal with association analysis , clustering or recommendation systems.

Chapter 11 introduces the reader to social network analysis (and elements of graph theory) using the example of Erdos Number as an interesting example of social networks of mathematicians. The example of Social Graph API by Google for hacking are quite new and intriguing (though a bit obsolete by changes, and should be rectified in either the errata or next edition) . However there exists packages within R that should be atleast referenced or used within this chapter (like TwitteR package that use the Twitter API and ROauth package for other social networks). Packages used within this chapter include Rcurl, RJSONIO, and igraph packages of R and functions used include rbind and ifelse. It also introduces the reader to the advanced software Gephi. The last example is to build a recommendation engine for whom to follow in Twitter using R.

Chapter 12 is about model comparison and introduces the concept of Support Vector Machines. It uses the package e1071 and shows the svm function. It also introduces the concept of tuning hyper parameters within default algorithms . A small problem in understanding the concepts is the misalignment of diagram pages with the relevant code. It lastly concludes with using mean square error as a method for comparing models built with different algorithms.

 

Overall the book is a welcome addition in the library of books based on R programming language, and the refreshing nature of the flow of material and the practicality of it’s case studies make this a recommended addition to both academic and corporate business analysts trying to derive insights by hacking lots of heterogeneous data.

Have a look for yourself at-
http://shop.oreilly.com/product/0636920018483.do

Random Sampling a Dataset in R

A common example in business  analytics data is to take a random sample of a very large dataset, to test your analytics code. Note most business analytics datasets are data.frame ( records as rows and variables as columns)  in structure or database bound.This is partly due to a legacy of traditional analytics software.

Here is how we do it in R-

• Refering to parts of data.frame rather than whole dataset.

Using square brackets to reference variable columns and rows

The notation dataset[i,k] refers to element in the ith row and jth column.

The notation dataset[i,] refers to all elements in the ith row .or a record for a data.frame

The notation dataset[,j] refers to all elements in the jth column- or a variable for a data.frame.

For a data.frame dataset

> nrow(dataset) #This gives number of rows

> ncol(dataset) #This gives number of columns

An example for corelation between only a few variables in a data.frame.

> cor(dataset1[,4:6])

Splitting a dataset into test and control.

ts.test=dataset2[1:200] #First 200 rows

ts.control=dataset2[201:275] #Next 75 rows

• Sampling

Random sampling enables us to work on a smaller size of the whole dataset.

use sample to create a random permutation of the vector x.

Suppose we want to take a 5% sample of a data frame with no replacement.

Let us create a dataset ajay of random numbers

ajay=matrix( round(rnorm(200, 5,15)), ncol=10)

#This is the kind of code line that frightens most MBAs!!

Note we use the round function to round off values.

ajay=as.data.frame(ajay)

 nrow(ajay)

[1] 20

> ncol(ajay)

[1] 10

This is a typical business data scenario when we want to select only a few records to do our analysis (or test our code), but have all the columns for those records. Let  us assume we want to sample only 5% of the whole data so we can run our code on it

Then the number of rows in the new object will be 0.05*nrow(ajay).That will be the size of the sample.

The new object can be referenced to choose only a sample of all rows in original object using the size parameter.

We also use the replace=FALSE or F , to not the same row again and again. The new_rows is thus a 5% sample of the existing rows.

Then using the square backets and ajay[new_rows,] to get-

b=ajay[sample(nrow(ajay),replace=F,size=0.05*nrow(ajay)),]

 

You can change the percentage from 5 % to whatever you want accordingly.

Follow

Get every new post delivered to your Inbox.

Join 839 other followers