50 functions to clear a basic interview for Business Analytics #rstats

Due respect to all cheat sheets and ref cards, but these are the functions that I use in a sequence to analyze a business data set.

Packages

install.packages(“Hmisc”) installs package Hmisc
library(Hmisc) –loads package Hmisc
update.packages() Updates all packages

Data Input

getwd() – Gets you the current working directory
setwd(“C:/Path”) -Sets the working directory to the new path , here C:/Path
dir() – Lists all the files present in the current working directory
a=read.csv(“1.csv”,header=T,sep=”,”,stringsAsFactors=T)

here

a= read.csv (assigns the object name a to whatever comes to the right side)

You can also explicitly assign a character name to a data object using assign)

read.csv is a type of input command derived from read.table to read in a rectangular object (rows and columns)

header specifies whether the first line has variable names or not

sep denotes seperator (generally a comma for CSV files but can be space or tab for text files)

stringsAsFactors=T reads in strings as character values and not factor levels

Object Inspection

str(a) Gives the structure of object named including class, dimensions, type of variables , names of variables and a few values of variables as well. Only drawback is can throw up a lot of information for a big data set
names(a) Gives the names of variables of the object
class(a) Gives the class of a object like data.frame, list,matrix, vector etc
dim(a) Gives the dimension of object (rows column)
nrow(a) Gives the number of rows of object a- useful when used as an input for another function
ncol(a) Gives the number of columns of object a
length(a) Gives the length of object- useful for vectors, for a data frame it is the same as ncol
a[i,j] Gives the value in ith row and jth column
a$var1 Gives the variable named var1 in object a . This can be treated as a seperate object on it’s own for inspection or analysis

Data Inspection

head(a,10) gives the first ten rows of object a
tail(a,20) gives the last twenty rows of object a
b=ajay[sample(nrow(ajay),replace=F,size=0.05*nrow(ajay)),]

Lets get a 5 % random sample of the object ajay

[] uses the subset to give value in the specified row

Sample is the command for sample

So out nrow(ajay) or total number to be sampled of ,size= Size of sample it is taking 5% numbers, and these are the row numbers that are being returned. replace =F means each number is selected only once

Math Functions

Basic-

sum -sum
sqrt -square root
sd -standard deviation
log –log
mean -mean
median– median

Additional–

cumsum – Cumulative Sum for a column
diff –Differencing
lag – Lag

Data Manipulation

paste(a$Var) converts Var from Factor/Numeric variable to Charachter Variable
as.numeric(a$Var2) Converts a character variable into a numeric variable
is.na(a) retruns TRUE wheneve it encounters a Missing Value within the object
na.omit(a) Deletes all missing values (denoted by NA within R)
na.rm=T (this option enables you to calculate values Ignoring Missing Values)
nchar(abc) gives the values of characters in a character value
substr(“ajay”,1,3) gives the sub string from starting place 1 to ending place 3. Note in R index starts from 1 for first object

Date Manipulation

library(lubridate)
> a=”20Mar1987″
> dmy(a)
[1] “1987-03-20 UTC”
> b=”20/7/89″
> dmy(b)
[1] “1989-07-20 UTC”
> c=”October 12 93″
> mdy(c)
[1] “1993-10-12 UTC”

Data Analysis

summary(a) Gives summary of object including min,max,median,mean, 1st quartile, 3rd Quartile) for numeric objects and frequency analysis of Factor variables
table(a) Gives Frequency Analysis of variable or obejct
table(a$var1,a$var2) Gives cross tabs of Var1 with respect to Var 2 of object a

library(Hmisc) loads HMisc which enables use to use describe and summarize function

describe(a$var1) gives a much more elaborate and concise summary of the variable Var 1- it’s a better version of summary
summarize(a$var1,a$var2,FUN) applies a function (like sum, median, summary or mean) on Var 1 , as GROUPED by Var2
cor(a) gives corelation between all numeric variables of a

Data Visualization

plot(a$var1,a$var2) Plots Var 1 with Var 2
boxplot(a) boxplot
hist(a$Var1) Histogram
plot(density(a$Var1) Density Plot
pie (pie chart)

Modeling

a=lm(y~x) creates model
vif(a) gives Variance Inflation (library(car) may help)
outlierTest(a) gives Outliers

summary(a) gives model summary including parameter estimates

Write Output

write.csv(a) Write output as a csv file
png(“graph.png”) Write plot as png file
q() –Quits R Session

——————————————————————————————————————————————————-

R is a huge language with 5000 packages and hundreds of thousands of functions.

But if you memorize these functions ~50, I assure you will make a much more positive impression in your business analytics interview !

Bonus-

Sys.time() and Sys.Date() gives current   time and date (note the change in case)
while system.time(experession gives time taken to evaluate an expression)

Citation- http://www.ats.ucla.edu/stat/r/faq/timing_code.htm

Also

yo=function(a,b){ a*b*12} creates a custom function called yo which can be then invoked as yo(2,3)

Coming up-

the apply family, the Hadley collective  and GUIs  for the second round of interview ;)

Author: Ajay Ohri

http://about.me/ajayohri View all posts by Ajay Ohri

3 thoughts on “50 functions to clear a basic interview for Business Analytics #rstats”

khushboo30jain says:

June 8, 2018 at 2:03 am

For the 4th function under the section data input, the code includes stringsasfactors=T i.e. it reads strings as factor levels instead of character values.

Pingback: From Cricket to Down the Rabbit Hole | Alyssa Fu
dataratta says:

March 18, 2015 at 5:19 am

Reblogged this on DataRatta and commented:
50 functions to clear a basic interview for Business Analytics #rstats