Continuing my series of basic data manipulation using R. For people knowing analytics and
new to R.
1 Keeping only some variables
Using subset we can keep only the variables we want-
Sitka89 <- subset(Sitka89, select=c(size,Time,treat))
Will keep only the variables we have selected (size,Time,treat).
2 Dropping some variables
Harman23.cor$cov.arm.span <- NULL
This deletes the variable named cov.arm.span in the dataset Harman23.cor
3 Keeping records based on character condition
Note the double equal-to sign
4 Keeping records based on date/time condition
subset(DF, as.Date(Date) >= '2009-09-02' & as.Date(Date) <= '2009-09-04')
5 Converting Date Time Formats into other formats
if the variable dob is “01/04/1977) then following will convert into a date object
and if the same date is 01Apr1977
6 Difference in Date Time Values and Using Current Time
The difftime function helps in creating differences in two date time variables.
difftime(time1, time2, units='secs')
difftime(time1, time2, tz = "", units = c("auto", "secs", "mins", "hours", "days", "weeks"))
For current system date time values you can use
This value can be put in the difftime function shown above to calculate age or time elapsed.
7 Keeping records based on numerical condition
For enhanced usage-
you can also use the R Commander GUI with the sub menu Data > Active Dataset
8 Sorting Data
Sorting A Data Frame in Ascending Order by a variable
AggregatedData<- sort(AggregatedData, by=~ Package)
Sorting a Data Frame in Descending Order by a variable
AggregatedData<- sort(AggregatedData, by=~ -Installed)
9 Transforming a Dataset Structure around a single variable
Using the Reshape2 Package we can use melt and acast functions
If we choose not to use Reshape package, we can use the default reshape method in R. Please do note this takes longer processing time for bigger datasets.
df.wide <- reshape(df, idvar="Subject", timevar="Item", direction="wide")
10 Type in Data
Using scan() function we can type in data in a list
11 Using Diff for lags and Cum Sum function forCumulative Sums
We can use the diff function to calculate difference between two successive values of a variable.
Cumsum function helps to give cumulative sum
> x=rnorm(10,20) #This gives 10 Randomly distributed numbers with Mean 20
 20.76078 19.21374 18.28483 20.18920 21.65696 19.54178 18.90592 20.67585
 20.02222 18.99311
 -1.5470415 -0.9289122 1.9043664 1.4677589 -2.1151783 -0.6358585 1.7699296
 -0.6536232 -1.0291181 >
 20.76078 39.97453 58.25936 78.44855 100.10551 119.64728 138.55320
 159.22905 179.25128 198.24438
> diff(x,2) # The diff function can be used as diff(x, lag = 1, differences = 1, ...) where differences is the order of differencing
 -2.4759536 0.9754542 3.3721252 -0.6474195 -2.7510368 1.1340711 1.1163064
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
12 Merging Data
Deducer GUI makes it much simpler to merge datasets. The simplest syntax for a merge statement is
totalDataframeZ <- merge(dataframeX,dataframeY,by=c("AccountId","Region"))
13 Aggregating and group processing of a variable
We can use multiple methods for aggregating and by group processing of variables.
Two functions we explore here are aggregate and Tapply.
Refering to the R Online Manual at
## Compute the averages for the variables in 'state.x77', grouped
## according to the region (Northeast, South, North Central, West) that
## each state belongs to
aggregate(state.x77, list(Region = state.region), mean)
## tapply(Summary Variable, Group Variable, Function)
We can also use specialized packages for data manipulation.
For additional By-group processing you can see the doBy package as well as Plyr package
for data manipulation.Doby contains a variety of utilities including:
1) Facilities for groupwise computations of summary statistics and other facilities for working with grouped data.
2) General linear contrasts and LSMEANS (least-squares-means also known as population means),
3) HTMLreport for autmatic generation of HTML file from R-script with a minimum of markup, 4) various other utilities and is available at[ http://cran.r-project.org/web/packages/doBy/index.html]
Also Available at [http://cran.r-project.org/web/packages/plyr/index.html],
Plyr is a set of tools that solves a common set of problems:
you need to break a big problem down into manageable pieces,
operate on each pieces and then put all the pieces back together.
For example, you might want to fit a model to each spatial location or
time point in your study, summarise data by panels or collapse high-dimensional arrays
to simpler summary statistics.