Continuing my series of basic data manipulation using R. For people knowing analytics and
new to R.

1 Keeping only some variables
Using subset we can keep only the variables we want-
Sitka89 <- subset(Sitka89, select=c(size,Time,treat))
Will keep only the variables we have selected (size,Time,treat).
2 Dropping some variables
Harman23.cor$cov.arm.span <- NULL
This deletes the variable named cov.arm.span in the dataset Harman23.cor
3 Keeping records based on character condition
Titanic.sub1<-subset(Titanic,Sex=="Male")
Note the double equal-to sign
4 Keeping records based on date/time condition
subset(DF, as.Date(Date) >= '2009-09-02' & as.Date(Date) <= '2009-09-04')
5 Converting Date Time Formats into other formats
if the variable dob is “01/04/1977) then following will convert into a date object
z=strptime(dob,”%d/%m/%Y”)
and if the same date is 01Apr1977
z=strptime(dob,"%d%b%Y")
6 Difference in Date Time Values and Using Current Time
The difftime function helps in creating differences in two date time variables.
difftime(time1, time2, units='secs')
or
difftime(time1, time2, tz = "", units = c("auto", "secs", "mins", "hours", "days", "weeks"))
For current system date time values you can use
Sys.time()
Sys.Date()
This value can be put in the difftime function shown above to calculate age or time elapsed.
7 Keeping records based on numerical condition
Titanic.sub1<-subset(Titanic,Freq >37)
For enhanced usage-
you can also use the R Commander GUI with the sub menu Data > Active Dataset
8 Sorting Data
Sorting A Data Frame in Ascending Order by a variable
AggregatedData<- sort(AggregatedData, by=~ Package)
Sorting a Data Frame in Descending Order by a variable
AggregatedData<- sort(AggregatedData, by=~ -Installed)
9 Transforming a Dataset Structure around a single variable
Using the Reshape2 Package we can use melt and acast functions
library("reshape2")
tDat.m<- melt(tDat)
tDatCast<- acast(tDat.m,Subject~Item)
If we choose not to use Reshape package, we can use the default reshape method in R. Please do note this takes longer processing time for bigger datasets.
df.wide <- reshape(df, idvar="Subject", timevar="Item", direction="wide")
10 Type in Data
Using scan() function we can type in data in a list
11 Using Diff for lags and Cum Sum function forCumulative Sums
We can use the diff function to calculate difference between two successive values of a variable.
Diff(Dataset$X)
Cumsum function helps to give cumulative sum
Cumsum(Dataset$X)
> x=rnorm(10,20) #This gives 10 Randomly distributed numbers with Mean 20
> x
[1] 20.76078 19.21374 18.28483 20.18920 21.65696 19.54178 18.90592 20.67585
[9] 20.02222 18.99311
> diff(x)
[1] -1.5470415 -0.9289122 1.9043664 1.4677589 -2.1151783 -0.6358585 1.7699296
[8] -0.6536232 -1.0291181 >
cumsum(x)
[1] 20.76078 39.97453 58.25936 78.44855 100.10551 119.64728 138.55320
[8] 159.22905 179.25128 198.24438
> diff(x,2) # The diff function can be used as diff(x, lag = 1, differences = 1, ...) where differences is the order of differencing
[1] -2.4759536 0.9754542 3.3721252 -0.6474195 -2.7510368 1.1340711 1.1163064
[8] -1.6827413
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
12 Merging Data
Deducer GUI makes it much simpler to merge datasets. The simplest syntax for a merge statement is
totalDataframeZ <- merge(dataframeX,dataframeY,by=c("AccountId","Region"))
13 Aggregating and group processing of a variable
We can use multiple methods for aggregating and by group processing of variables.
Two functions we explore here are aggregate and Tapply.
Refering to the R Online Manual at
[http://stat.ethz.ch/R-manual/R-patched/library/stats/html/aggregate.html]
## Compute the averages for the variables in 'state.x77', grouped
## according to the region (Northeast, South, North Central, West) that
## each state belongs to
aggregate(state.x77, list(Region = state.region), mean)
Using TApply
## tapply(Summary Variable, Group Variable, Function)
Reference
[http://www.ats.ucla.edu/stat/r/library/advanced_function_r.htm#tapply]
We can also use specialized packages for data manipulation.
For additional By-group processing you can see the doBy package as well as Plyr package
for data manipulation.Doby contains a variety of utilities including:
1) Facilities for groupwise computations of summary statistics and other facilities for working with grouped data.
2) General linear contrasts and LSMEANS (least-squares-means also known as population means),
3) HTMLreport for autmatic generation of HTML file from R-script with a minimum of markup, 4) various other utilities and is available at[ http://cran.r-project.org/web/packages/doBy/index.html]
Also Available at [http://cran.r-project.org/web/packages/plyr/index.html],
Plyr is a set of tools that solves a common set of problems:
you need to break a big problem down into manageable pieces,
operate on each pieces and then put all the pieces back together.
For example, you might want to fit a model to each spatial location or
time point in your study, summarise data by panels or collapse high-dimensional arrays
to simpler summary statistics.
Like this:
Like Loading...