Due respect to all cheat sheets and ref cards, but these are the functions that I use in a sequence to analyze a business data set.

**Packages**

**install.packages(“Hmisc”)**installs package Hmisc**library(Hmisc) –**loads package Hmisc**update.packages()**Updates all packages

**Data Input**

**getwd**() – Gets you the current working directory**setwd**(“C:/Path”) -Sets the working directory to the new path , here C:/Path**dir()**– Lists all the files present in the current working directory- a=
**read.csv**(“1.csv”,header=T,sep=”,”,stringsAsFactors=T)

here

a**=** **read.csv** (**assigns** the object name a to whatever comes to the right side)

You can also explicitly assign a character name to a data object using **assign**)

read.csv is a type of input command derived from read.table to read in a rectangular object (rows and columns)

**header** specifies whether the first line has variable names or not

**sep** denotes seperator (generally a comma for CSV files but can be space or tab for text files)

**stringsAsFactors**=T reads in strings as character values and not factor levels

**Object Inspection**

**str(a)**Gives the structure of object named including class, dimensions, type of variables , names of variables and a few values of variables as well. Only drawback is can throw up a lot of information for a big data set**names(a)**Gives the names of variables of the object**class(a)**Gives the class of a object like data.frame, list,matrix, vector etc**dim(a)**Gives**nrow(a)**Gives the number of rows of object a- useful when used as an input for another function**ncol(a)**Gives the number of columns of object a**length(a)**Gives the length of object- useful for vectors, for a data frame it is the same as ncol**a[i,j]**Gives the value in ith row and jth column**a$var1**Gives the variable named var1 in object a . This can be treated as a seperate object on it’s own for inspection or analysis

**Data Inspection**

**head(a,10)**gives the first ten rows of object a**tail(a,20)**gives the last twenty rows of object a- b=ajay[
**sample**(nrow(ajay),**replace**=F,**size**=0.05*nrow(ajay)),]

Lets get a 5 % random sample of the object ajay

[] uses the subset to give value in the specified row

Sample is the command for sample

So out nrow(ajay) or total number to be sampled of ,**size**= Size of sample it is taking 5% numbers, and these are the row numbers that are being returned. replace =F means each number is selected only once

**Math Functions**

**Basic-**

**sum**-sum**sqrt**-square root**sd**-standard deviation**log –**log**mean**-mean**median**– median

**Additional**–

**cumsum –**Cumulative Sum for a column**diff –**Differencing**lag –**Lag

**Data Manipulation**

**paste(a$Var)**converts Var from Factor/Numeric variable to Charachter Variable**as.numeric(a$Var2)**Converts a character variable into a numeric variable**is.na(a)**retruns TRUE wheneve it encounters a Missing Value within the object**na.omit(a)**Deletes all missing values (denoted by NA within R)**na.rm=T**(this option enables you to calculate values Ignoring Missing Values)**nchar**(**abc**) gives the values of characters in a character value**substr(“ajay”,1,3)**gives the sub string from starting place 1 to ending place 3. Note in R index starts from 1 for first object

**Date Manipulation**

library(**lubridate**)

> a=”20Mar1987″

> **dmy**(a)

[1] “1987-03-20 UTC”

> b=”20/7/89″

> **dmy**(b)

[1] “1989-07-20 UTC”

> c=”October 12 93″

> **mdy**(c)

[1] “1993-10-12 UTC”

**Data Analysis**

**summary(a)**Gives summary of object including min,max,median,mean, 1st quartile, 3rd Quartile) for numeric objects and frequency analysis of Factor variables**table(a)**Gives Frequency Analysis of variable or obejct**table(a$var1,a$var2)**Gives cross tabs of Var1 with respect to Var 2 of object a

*library(Hmisc) loads HMisc which enables use to use describe and summarize function*

**describe(a$var1)**gives a much more elaborate and concise summary of the variable Var 1- it’s a better version of summary**summarize(a$var1,a$var2,FUN)**applies a function (like sum, median, summary or mean) on Var 1 , as GROUPED by Var2**cor(a)**gives corelation between all numeric variables of a

**Data Visualization**

**plot(a$var1,a$var2)**Plots Var 1 with Var 2**boxplot(a)**boxplot**hist(a$Var1)**Histogram**plot(density(a$Var1)**Density Plot**pie**(pie chart)

Modeling

- a=
**lm**(y~x) creates model **vif**(a) gives Variance Inflation (**library(car)**may help)**outlierTest**(a) gives Outliers

summary(a) gives model summary including parameter estimates

- Write Output

**write.csv(a)**Write output as a csv file**png(“graph.png”)**Write plot as png file**q() –**Quits R Session

——————————————————————————————————————————————————-

R is a huge language with 5000 packages and hundreds of thousands of functions.

But if you memorize these functions ~50, I assure you will make a much more positive impression in your business analytics interview !

Bonus-

Sys.time() and Sys.Date() gives current time and date (note the change in case) while system.time(experession gives time taken to evaluate an expression) Citation- http://www.ats.ucla.edu/stat/r/faq/timing_code.htm Alsoyo=function(a,b){ a*b*12}creates a custom function called yo which can be then invoked as yo(2,3) Coming up-the apply family, the Hadley collective and GUIs for the second round of interview ;)

For the 4th function under the section data input, the code includes stringsasfactors=T i.e. it reads strings as factor levels instead of character values.

Reblogged this on DataRatta and commented:

50 functions to clear a basic interview for Business Analytics #rstats