A common example in business analytics data is to take a random sample of a very large dataset, to test your analytics code. Note most business analytics datasets are data.frame ( records as rows and variables as columns) in structure or database bound.This is partly due to a legacy of traditional analytics software.
Here is how we do it in R-
• Refering to parts of data.frame rather than whole dataset.
Using square brackets to reference variable columns and rows
The notation dataset[i,k] refers to element in the ith row and jth column.
The notation dataset[i,] refers to all elements in the ith row .or a record for a data.frame
The notation dataset[,j] refers to all elements in the jth column- or a variable for a data.frame.
For a data.frame dataset
> nrow(dataset) #This gives number of rows
> ncol(dataset) #This gives number of columns
An example for corelation between only a few variables in a data.frame.
> cor(dataset1[,4:6])
Splitting a dataset into test and control.
ts.test=dataset2[1:200] #First 200 rows
ts.control=dataset2[201:275] #Next 75 rows
• Sampling
Random sampling enables us to work on a smaller size of the whole dataset.
use sample to create a random permutation of the vector x.
Suppose we want to take a 5% sample of a data frame with no replacement.
Let us create a dataset ajay of random numbers
#This is the kind of code line that frightens most MBAs!!
Note we use the round function to round off values.
ajay=as.data.frame(ajay) nrow(ajay)
[1] 20
> ncol(ajay)
[1] 10
This is a typical business data scenario when we want to select only a few records to do our analysis (or test our code), but have all the columns for those records. Let us assume we want to sample only 5% of the whole data so we can run our code on it
Then the number of rows in the new object will be 0.05*nrow(ajay).That will be the size of the sample.
The new object can be referenced to choose only a sample of all rows in original object using the size parameter.
We also use the replace=FALSE or F , to not the same row again and again. The new_rows is thus a 5% sample of the existing rows.
Then using the square backets and ajay[new_rows,] to get-
You can change the percentage from 5 % to whatever you want accordingly.