#rstats -Basic Data Manipulation using R

Continuing my series of basic data manipulation using R. For people knowing analytics and
new to R.
1 Keeping only some variables

Using subset we can keep only the variables we want-

Sitka89 <- subset(Sitka89, select=c(size,Time,treat))

Will keep only the variables we have selected (size,Time,treat).

2 Dropping some variables

Harman23.cor$cov.arm.span <- NULL
This deletes the variable named cov.arm.span in the dataset Harman23.cor

3 Keeping records based on character condition


Note the double equal-to sign
4 Keeping records based on date/time condition

subset(DF, as.Date(Date) >= '2009-09-02' & as.Date(Date) <= '2009-09-04')

5 Converting Date Time Formats into other formats

if the variable dob is “01/04/1977) then following will convert into a date object


and if the same date is 01Apr1977


6 Difference in Date Time Values and Using Current Time

The difftime function helps in creating differences in two date time variables.

difftime(time1, time2, units='secs')


difftime(time1, time2, tz = "", units = c("auto", "secs", "mins", "hours", "days", "weeks"))

For current system date time values you can use



This value can be put in the difftime function shown above to calculate age or time elapsed.

7 Keeping records based on numerical condition

Titanic.sub1<-subset(Titanic,Freq >37)

For enhanced usage-
you can also use the R Commander GUI with the sub menu Data > Active Dataset

8 Sorting Data

Sorting A Data Frame in Ascending Order by a variable

AggregatedData<- sort(AggregatedData, by=~ Package)

Sorting a Data Frame in Descending Order by a variable

AggregatedData<- sort(AggregatedData, by=~ -Installed)

9 Transforming a Dataset Structure around a single variable

Using the Reshape2 Package we can use melt and acast functions


tDat.m<- melt(tDat)

tDatCast<- acast(tDat.m,Subject~Item)

If we choose not to use Reshape package, we can use the default reshape method in R. Please do note this takes longer processing time for bigger datasets.

df.wide <- reshape(df, idvar="Subject", timevar="Item", direction="wide")

10 Type in Data

Using scan() function we can type in data in a list

11 Using Diff for lags and Cum Sum function forCumulative Sums

We can use the diff function to calculate difference between two successive values of a variable.


Cumsum function helps to give cumulative sum


> x=rnorm(10,20) #This gives 10 Randomly distributed numbers with Mean 20

> x

[1] 20.76078 19.21374 18.28483 20.18920 21.65696 19.54178 18.90592 20.67585

[9] 20.02222 18.99311

> diff(x)

[1] -1.5470415 -0.9289122 1.9043664 1.4677589 -2.1151783 -0.6358585 1.7699296

[8] -0.6536232 -1.0291181 >


[1] 20.76078 39.97453 58.25936 78.44855 100.10551 119.64728 138.55320

[8] 159.22905 179.25128 198.24438

> diff(x,2) # The diff function can be used as diff(x, lag = 1, differences = 1, ...) where differences is the order of differencing

[1] -2.4759536 0.9754542 3.3721252 -0.6474195 -2.7510368 1.1340711 1.1163064

[8] -1.6827413

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.

12 Merging Data

Deducer GUI makes it much simpler to merge datasets. The simplest syntax for a merge statement is

totalDataframeZ <- merge(dataframeX,dataframeY,by=c("AccountId","Region"))

13 Aggregating and group processing of a variable

We can use multiple methods for aggregating and by group processing of variables.
Two functions we explore here are aggregate and Tapply.

Refering to the R Online Manual at

## Compute the averages for the variables in 'state.x77', grouped

## according to the region (Northeast, South, North Central, West) that

## each state belongs to

aggregate(state.x77, list(Region = state.region), mean)

Using TApply

## tapply(Summary Variable, Group Variable, Function)



We can also use specialized packages for data manipulation.

For additional By-group processing you can see the doBy package as well as Plyr package
 for data manipulation.Doby contains a variety of utilities including:
 1) Facilities for groupwise computations of summary statistics and other facilities for working with grouped data.
 2) General linear contrasts and LSMEANS (least-squares-means also known as population means),
 3) HTMLreport for autmatic generation of HTML file from R-script with a minimum of markup, 4) various other utilities and is available at[ http://cran.r-project.org/web/packages/doBy/index.html]
Also Available at [http://cran.r-project.org/web/packages/plyr/index.html],
Plyr is a set of tools that solves a common set of problems:
you need to break a big problem down into manageable pieces,
operate on each pieces and then put all the pieces back together.
 For example, you might want to fit a model to each spatial location or
 time point in your study, summarise data by panels or collapse high-dimensional arrays
 to simpler summary statistics.

Author: Ajay Ohri


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s