Continuing my series of basic data manipulation using R. For people knowing analytics and
new to R.
1 Keeping only some variables Using subset we can keep only the variables we want- Sitka89 <- subset(Sitka89, select=c(size,Time,treat)) Will keep only the variables we have selected (size,Time,treat). 2 Dropping some variables Harman23.cor$cov.arm.span <- NULL
This deletes the variable named cov.arm.span in the dataset Harman23.cor 3 Keeping records based on character condition Titanic.sub1<-subset(Titanic,Sex=="Male") Note the double equal-to sign
4 Keeping records based on date/time condition subset(DF, as.Date(Date) >= '2009-09-02' & as.Date(Date) <= '2009-09-04') 5 Converting Date Time Formats into other formats if the variable dob is “01/04/1977) then following will convert into a date object z=strptime(dob,”%d/%m/%Y”) and if the same date is 01Apr1977 z=strptime(dob,"%d%b%Y") 6 Difference in Date Time Values and Using Current Time The difftime function helps in creating differences in two date time variables. difftime(time1, time2, units='secs') or difftime(time1, time2, tz = "", units = c("auto", "secs", "mins", "hours", "days", "weeks")) For current system date time values you can use Sys.time() Sys.Date() This value can be put in the difftime function shown above to calculate age or time elapsed. 7 Keeping records based on numerical condition Titanic.sub1<-subset(Titanic,Freq >37) For enhanced usage-
you can also use the R Commander GUI with the sub menu Data > Active Dataset 8 Sorting Data Sorting A Data Frame in Ascending Order by a variable AggregatedData<- sort(AggregatedData, by=~ Package) Sorting a Data Frame in Descending Order by a variable AggregatedData<- sort(AggregatedData, by=~ -Installed) 9 Transforming a Dataset Structure around a single variable Using the Reshape2 Package we can use melt and acast functions library("reshape2") tDat.m<- melt(tDat) tDatCast<- acast(tDat.m,Subject~Item) If we choose not to use Reshape package, we can use the default reshape method in R. Please do note this takes longer processing time for bigger datasets. df.wide <- reshape(df, idvar="Subject", timevar="Item", direction="wide") 10 Type in Data Using scan() function we can type in data in a list 11 Using Diff for lags and Cum Sum function forCumulative Sums We can use the diff function to calculate difference between two successive values of a variable. Diff(Dataset$X) Cumsum function helps to give cumulative sum Cumsum(Dataset$X) > x=rnorm(10,20) #This gives 10 Randomly distributed numbers with Mean 20 > x [1] 20.76078 19.21374 18.28483 20.18920 21.65696 19.54178 18.90592 20.67585 [9] 20.02222 18.99311 > diff(x) [1] -1.5470415 -0.9289122 1.9043664 1.4677589 -2.1151783 -0.6358585 1.7699296 [8] -0.6536232 -1.0291181 > cumsum(x) [1] 20.76078 39.97453 58.25936 78.44855 100.10551 119.64728 138.55320 [8] 159.22905 179.25128 198.24438 > diff(x,2) # The diff function can be used as diff(x, lag = 1, differences = 1, ...) where differences is the order of differencing [1] -2.4759536 0.9754542 3.3721252 -0.6474195 -2.7510368 1.1340711 1.1163064 [8] -1.6827413 Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole. 12 Merging Data Deducer GUI makes it much simpler to merge datasets. The simplest syntax for a merge statement is totalDataframeZ <- merge(dataframeX,dataframeY,by=c("AccountId","Region")) 13 Aggregating and group processing of a variable We can use multiple methods for aggregating and by group processing of variables.
Two functions we explore here are aggregate and Tapply. Refering to the R Online Manual at
[http://stat.ethz.ch/R-manual/R-patched/library/stats/html/aggregate.html] ## Compute the averages for the variables in 'state.x77', grouped ## according to the region (Northeast, South, North Central, West) that ## each state belongs to aggregate(state.x77, list(Region = state.region), mean) Using TApply ## tapply(Summary Variable, Group Variable, Function) Reference [http://www.ats.ucla.edu/stat/r/library/advanced_function_r.htm#tapply] We can also use specialized packages for data manipulation. For additional By-group processing you can see the doBy package as well as Plyr package
for data manipulation.Doby contains a variety of utilities including:
1) Facilities for groupwise computations of summary statistics and other facilities for working with grouped data.
2) General linear contrasts and LSMEANS (least-squares-means also known as population means),
3) HTMLreport for autmatic generation of HTML file from R-script with a minimum of markup, 4) various other utilities and is available at[ http://cran.r-project.org/web/packages/doBy/index.html]
Also Available at [http://cran.r-project.org/web/packages/plyr/index.html],
Plyr is a set of tools that solves a common set of problems:
you need to break a big problem down into manageable pieces,
operate on each pieces and then put all the pieces back together.
For example, you might want to fit a model to each spatial location or
time point in your study, summarise data by panels or collapse high-dimensional arrays
to simpler summary statistics.