Book on Analytics

Basics of Data Handling for R beginners #rstats

  • Assigning Objects

We can create new data objects and variables quite easily within R. We use the = or the → operator to denote assigning an object to it’s name. For the purpose of this article we will use = to assign objectnames and objects. This is very useful when we are doing data manipulation as we can reuse the manipulated data as inputs for other steps in our analysis.

 

Types of Data Objects in R

  • Lists

A list is simply a collection of data. We create a list using the c operator.

The following code creates a list named numlist from 6 input numeric data

numlist=c(1,2,3,4,5,78)

 

The following code creates a list named charlist from 6 input character data

charlist=c(“John”,”Peter”,”Simon”,”Paul”,”Francis”)

 

The following code creates a list named mixlistfrom both numeric and character data.

mixlist=c(1,2,3,4,”R language”,”Ajay”)

 

  • Matrices

Matrix is a two dimensional collection of data in rows and columns, unlike a list which is basically one dimensional. We can create a matrix using the matrix command while specifying the number of rows by nrow and number of columns by ncol paramter.

In the following code , we create an matrix named ajay and the data is input in 3 rows as specified, but it is entered into first column, then second column , so on.

ajay=matrix(c(1,2,3,4,5,6,12,18,24),nrow=3)

ajay

[,1] [,2] [,3]

[1,] 1 4 12

[2,] 2 5 18

[3,] 3 6 24

 

However please note the effect of using the byrow=T (TRUE) option. In the following code we create an matrix named ajay and the data is input in 3 rows as specified, but it is entered into first row, then second row , so on.

 

>ajay=matrix(c(1,2,3,4,5,6,12,18,24),nrow=3,byrow=T)

>ajay

[,1] [,2] [,3]

[1,] 1 2 3

[2,] 4 5 6

[3,] 12 18 24

  • Data Frames

A data frame is a list of variables of the same number of rows with unique row names. The column names are the names of the variables.

 

Data Munging using #rstats Part 1 -Understanding Data Quality

This is a series of posts on Data Munging using R.

we will examine the various ways to input data and examine errors in the data input stage. We will accordingly study ways to detect errors and rectify them using the R language. People estimate that almost 60-70% of a project’s time goes in the data input, data quality and data validation stage. By the principle of Garbage-In -Garbage -Out, we believe that an analysis is only as good as the input quality of data. Thus data quality is both an integral part as well as one of the first stages in a project before we move to comprehensive statistical analysis.

Data Quality is an important part of studying data manipulation. How do we define Data Quality?

In this chapter, Data quality is defined as manipulating data in the desired shape, size and format. We further elaborate that as follows-

Data that is useful for analysis without any errors is high quality data.

Data that is problematic for accurate analysis because of any errors is low quality data.

Data Quality errors are defined as deviations from actual data, due to systematic, computing or human mistakes.

Rectifying data quality errors involves the steps of error detection, missing value imputation. It also involves using the feedback from these steps to design better data input mechanisms.

The major types of Data Quality errors are-

Missing Data- This is defined as when data is simply missing. It may be represented by a “. “or a blank space or by special notation like NA (not available) . In R , missing data is represented by NA. Missing data is the easiest to detect but it is tough to rectify since most of the time we deal with data collected in real time in the past time and it is difficult and expensive to replace it with actual data. Some methods of replacing missing data is by imputing or inferring what the missing values could be , by looking at measures of central tendency like median , or mean, or by checking correlation with other variables or data points with better data population or by looking at historic data for a particular sub-set. Accordingly missing values for a particular data variable can be divided into sub sets for imputation by various means (like for different Geographic Values, or Different Time Values)

Invalid Data (too high or too low numeric (and date-time) data, character data in invalid format).

Incorrect Data (due to input errors including invalid or obsolete business rules, human input, low quality OCR scans)

The major causes of Data Quality Errors are-

Human Error (due to input, typing )

Machine Error ( due to invalid input readable eg. like by low resolution scanning device)

Syntax Error ( due to invalid logic or assumptions)

Data Format Error (due to a format that is not readable by software reading in data)

Steps for Diagnosis-

Missing Value Detection (using functions related to is.NA) and Missing Value Imputation

Distribution Analysis (using functions like summary,describe, and visualizations like boxplot, histogram)

Outliers (Bonferroni) Detection and Outlier Capping ( Minimum- Maximum)

Correlation with other variables ( using correlation statistics)

Diagnosis of Data Quality

 

The following functions in R will help us evaluate the quality of data in our data object.

str- gives structure of object for a data frame including class, dimensions, variable names, variable types, first few observations of each variable)

names- gives variable names.

dim- dimensions of object.

length- gives length of data object.

nrow- gives number of rows of data object.

ncol – gives number of columns of data object.

class- gives data class of object. This can be list, matrix or data.frame or other classes.

We use the famous iris dataset and attach it or load it in our R session using the command

data(iris). We then try out each of the functions given above.

> data(iris)

> str(iris)

data.frame': 150 obs. of 5 variables:

$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

> names(iris)

[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"

> dim(iris)

[1] 150 5

> length(iris)

[1] 5

> nrow(iris)

[1] 150

> ncol(iris)

[1] 5

> class(iris)

[1] "data.frame"

It is quite clear that the str function by itself is enough for the first step data quality as it contains all the other parameters.

We now and try and print out a part of the object to check what is stored there. By default we can print the entire object by just writing it’s name. However this may be inconvenient in some cases when there are a large number of rows.

Accordingly we use the head and tail functions to look at the beginning and last rows in a data object.

head – gives first few observations in a data object as specified by parameter in head (objectname, number of rows)

tail -gives last few observations in a data object as specified by parameter in tail (objectname, number of rows)

Here we take the first 7 rows and the last 3 rows of dataset iris. Note that the first column in the output below is the row.number.
> head(iris,7)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

7 4.6 3.4 1.4 0.3 setosa

> tail(iris,3)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

148 6.5 3.0 5.2 2.0 virginica

149 6.2 3.4 5.4 2.3 virginica

150 5.9 3.0 5.1 1.8 virginica

We can also pass negative numbers as parameters to head and tail. Here we are trying to take the first and last 7 rows ( or numbers of rows in object -143 rows). Since the object iris has 150 rows , -143 evaluates to 7 in head and tail functions.

> head(iris,-143)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

7 4.6 3.4 1.4 0.3 setosa

> tail(iris,-143)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

144 6.8 3.2 5.9 2.3 virginica

145 6.7 3.3 5.7 2.5 virginica

146 6.7 3.0 5.2 2.3 virginica

147 6.3 2.5 5.0 1.9 virginica

148 6.5 3.0 5.2 2.0 virginica

149 6.2 3.4 5.4 2.3 virginica

150 5.9 3.0 5.1 1.8 virginica

1.2 Strings

One of the most common errors in data analytics is mismatch in string variables . String variables also known as character variables are non-numeric text, and even a single misplacement in white space, or upper case, lower case can cause discrepancies in the data. One of the most common types of data for which this error attains criticality is address data and name data.

From the perspective of R, the data “virginica” is a different data (or factor-level) from “ virginica” and from “Virginica”.“1600 Penn Avenue” is a different address from “1600 Pennsylvania Avenue” and from “1600 PA”. This can lead to escalation of costs especially since users of business analytics try and create unique and accurate contact details ( names and addresses). This attains even more importance for running credit checks and financial data, since an inaccurate data mismatch can lead to a wrong credit score to a person, leading to liability of the credit provider.

For changing case we use the functions toupper and tolower

> a=c("ajay","vijay","ravi","rahul","bharat")

> toupper(a)

[1] "AJAY" "VIJAY" "RAVI" "RAHUL" "BHARAT"

> b=c("Jane","JILL","AMY","NaNCY")

> tolower(b)

[1] "jane" "jill" "amy" "nancy"

sub,gsub,grepl

 

grepl can be used to find a part of a string . For example, in cricket we denote a not out score of 250 runs by a star, .i.e. 250* but denote a score of 250 out as 250. This can create a problem if we are trying to read in data. It will either treat it as character level data, or if we coerce it to return numeric values, it will show the not out scores by missing values.

We want to find all instance of “*” in address field and see if they are not out. grepl returns a logical vector (match or not for each element of x). We will further expand on this example in our Case Study for Cricket Analytics

table2$HSNotOut=grepl("\\*",table2$HS)


We use sub and gsub to substitute parts of string. While the sub function replaces the first occurrence, the gsub function replaces all occurrences of the matching pattern with the parameter supplied.

Here we are trying to replace white space in a sentence. Notice the sub function seems to work better than gsub in this case.

> newstring=" Hello World We are Experts in Learning R"

> sub(" ","",newstring)

[1] "Hello World We are Experts in Learning R"

> gsub(" ","",newstring)

[1] "HelloWorldWeareExpertsinLearningR"

Let us try to convert currency data into numeric data.For the sake of learning we are using a small data object , a list called “money” with three different inputs.

> money=c("$10,000","20000","32,000")

> money

[1] "$10,000" "20000" "32,000"

We replace a comma (used mainly for thousands in currency data) using gsub as shown before.

> money2=gsub(",","",money)

> money2

[1] "$10000" "20000" "32000"

$ indicates the end of a line in regular expressions. \$ is a dollar sign. So we have to use \\$ as an input in the gsub expression.

> money3=gsub("\\$","",money2)

> money3

[1] "10000" "20000" "32000"

At this point we may be satisfied that we have got the format we wanted. However this is an error, as these are still strings- as we find out by running the mean function

> mean(money3)

[1] NA

Warning message:

In mean.default(money3) : argument is not numeric or logical: returning NA

We then use the as operator to convert one data type (character) into another ( numeric).The as operator is generally used in syntax as.outputdataobject.class. Accordingly we will use as.numeric for the conversion.

 

> money4=as.numeric(money3)

> money4

[1] 10000 20000 32000

> mean(money4)

[1] 20666.67


Please note , we used many intermediate steps to do the multiple steps of data manipulation and used the = sign to assign this to new objects. We can combine two steps into one by putting them within successive brackets. This is illustrated below, when we are trying to convert character data containing (% Percentages) into Numeric data.

> mean(as.numeric(gsub("%","",percentages)))

[1] 35

> percentages

[1] "%20" "%30" "%40" "50"

Note we have found the mean but the original object is not changed.

 

Do gsub only one variable at a time

Slight problem is suppose there is data like 1,504 – it will be converted to NA instead of 1504.The way to solve this is use the nice gsub function ONLY on that variable. Since the comma is also the most commonly used delimiter , you dont want to replace all the commas, just only the one in that variable.

dataset$Variable2=as.numeric(paste(gsub(“,”,”",dataset$Variable)))

 

 

Additional- The function setAs creates methods for the as function to use. This is an advanced usage.

 

 

 

 

Big Data : Building Big Brother’s Evil Empire

Background- The Russell–Einstein Manifesto was issued in London on July 9, 1955 by Bertrand Russell in the midst of the Cold War. It highlighted the dangers posed by nuclear weapons and called for world leaders to seek peaceful resolutions to international conflict. The signatories included eleven pre-eminent intellectuals and scientists, including Albert Einstein, who signed it just days before his death on April 18, 1955.

It is time Data Scientists most of whom have tacitly accepted that open source and open societies are movements for good, come together and stop enabling Big Brothers in China, India and United States from accumulating lifetimes of information on the citizens of the Internet. Does the 21st Century need a better equipped and enabled United Nations to deal with climate change, weather weapons and cyber warfare. While the myth of Anglo Saxon unipolar world is dead, renegade elements from the military industrial complex from both the West and East threaten the peace with illegal and unethical spying on citizens and civilians.

As data scientists, we enable them , these Governments with the new cyber nukes to collect, sort, aggregate, visualize , huge amounts of text data. I bet the NSA and China use Hadoop and R for purposes that are both repulsive and illegal. Now you can join the gravy train, and earn big money working for one, or the other, or even both.

Or we can come up with a newer version of Russell- Einstein manifesto for Big Data and Internet. The military industrial complex of the East and West has screwed the real planet enough. Why screw the cyber -world of Internet by these weapons of mass collection? Using Python, R, SAS, JMP, or using regression, k means, map reduce/.  Data scientists should have a conscience before they fire their code.

Al

Big Brother is watching you- George Orwell

Big Data is watching you- George  Bush

How to help your government keep the world safe using statistics #rstats #python #sas

Big Data for Big Brother. Now playing. At a computer near you. How to help water the tree of liberty using statistics?

Use R

Screenshot from 2013-06-09 20:35:36

or

Use Python Screenshot from 2013-06-09 20:35:29

or use SAS software

SAS/CIA from the last paragraph of

http://www.sas.com/offices/asiapacific/india/sasnews/pressclippings/ET_CD_Mumbai_Jul12.pdf

Screenshot from 2013-06-09 20:19:01

 

Thoughts on Guardian’s expose of US Govt Data Mining

I am writing this sitting in Canada, in a language given to me by my education and colonial history. Some of my best friends and mentors have been Americans, Europeans and Asians. I will try and state this as objectively as possible.

  1. If China’s government sees your data or even data of dissidents, it is considered bad, but if the US Government does it, it is considered okay. Is that really okay? I would trust the SCOTUS, the IRS, the Congress, but I am not sure I should trust the Pentagon, NSA and White House with zero restraints.
  2. In reality no human eye can see so much data. They have algorithms running for text mining, automated programs that enable human analysts to zoom in if required. The US government is not presumably interested in your dating life but the data is there.
  3. Economic Espionage has been a traditional tradecraft of Western policy since they borrowed gunpowder and silk from China, to Operation Paper Clip , giving Nazis pardons for US space programs.
  4. USA has a long tradition and policy of government and defense working with the private sector to give them economic advantages. Internet was released by Al Gore and DARPA.
  5. With the new challenges of climate change, economic rivalry, diminishing energy resources- should the US government be trusted with almost 80% of the data flowing through the English speaking Internet.
  6. The collection of data from non American citizens effectively makes this an undeclared cyber-war that the Obama government is waging against the world.
  7. If Albert Einstein could protest nuclear weapons 60 years ago, as data scientists it is our community’s duty to clear the rules of engagement of data collection and data mining. Before we get into one more cyber cold war.obama-big-brother

Barack_Obama_Hope_poster

 

I am disappointed with Microsoft, Yahoo, Google, Facebook, PalTalk, AOL, Skype, YouTube, Apple for not even once protesting this move.

I would also like to know what is the expense in past 7 years of this monitoring and how many threats were neutralized (and why the Boston Bombers and others could not be)

I would like to know if data belonging to members of the US Congress was collected or purged by the records from the NSA or if there are any exclusion criterion for people or was data collected for everybody

Read more here- http://www.guardian.co.uk/world/2013/jun/06/us-tech-giants-nsa-data

To my intense horror, it seems Julian Assange was right about Eric Schmidt.

There are NO ways of making money that are NOT evil.

Writing on APIs for Programmable Web

I have been writing free lance on APIs for Programmable Web. Here is an updated list of the articles, many of these would be of interest to analytics users. Note- some of these are interviews and they are in bold.

PW Interview: Francisco J Martin, CEO BigML.com 2013/05/30

PW Interview: Tal Rotbart Founder- CTO, SpringSense 2013/05/28

PW Interview: Jeh Daruwala CEO Yactraq API, Behavorial Targeting for videos 2013/05/13

PW Interview: Michael Schonfeld of Dwolla API on Innovation Meeting the Payment Web  2013/05/02

PW Interview: Stephen Balaban of Lamda Labs on the Face Recognition API  2013/04/29

PW Interview: Amber Feng, Stripe API, The Payment Web 2013/04/24

PW Interview: Greg Lamp and Austin Ogilvie of Yhat on Shipping Predictive Models via API   2013/04/22

Google Mirror API documentation is open for developers   2013/04/18

PW Interview: Ricky Robinett, Ordr.in API, Ordering Food meets API    2013/04/16

PW Interview: Jacob Perkins, Text Processing API, NLP meets API   2013/04/10

Amazon EC2 On Demand Windows Instances -Prices reduced by 20%  2013/04/08

Amazon S3 API Requests prices slashed by half  2013/04/02

PW Interview: Stuart Battersby, Chatterbox API, Machine Learning meets Social 2013/04/02

PW Interview: Karthik Ram, rOpenSci, Wrapping all science API2013/03/20

Viralheat Human Intent API- To buy or not to buy 2013/03/13

Interview Tammer Kamel CEO and Founder Quandl 2013/03/07

YHatHQ API: Calling Hosted Statistical Models 2013/03/04

Quandl API: A Wikipedia for Numerical Data 2013/02/25

Amazon Redshift API is out of limited preview and available! 2013/02/18

Windows Azure Media Services REST API 2013/02/14

Data Science Toolkit Wraps Many Data Services in One API 2013/02/11

Diving into Codeacademy’s API Lessons 2013/01/31

Google APIs finetuning Cloud Storage JSON API 2013/01/29

2012
Ergast API Puts Car Racing Fans in the Driver’s Seat 2012/12/05
Springer APIs- Fostering Innovation via API Contests 2012/11/20
Statistically programming the web – Shiny,HttR and RevoDeploy API 2012/11/19
Google Cloud SQL API- Bigger ,Faster and now Free 2012/11/12
A Look at the Web’s Most Popular API -Google Maps API 2012/10/09
Cloud Storage APIs for the next generation Enterprise 2012/09/26
Last.fm API: Sultan of Musical APIs 2012/09/12
Socrata Data API: Keeping Government Open 2012/08/29
BigML API Gets Bigger 2012/08/22
Bing APIs: the Empire Strikes Back 2012/08/15
Google Cloud SQL: Relational Database on the Cloud 2012/08/13
Google BigQuery API Makes Big Data Analytics Easy 2012/08/05
Your Store in The Cloud -Google Cloud Storage API 2012/08/01
Predict the future with Google Prediction API 2012/07/30
The Romney vs Obama API 2012/07/27

Tatvic bets on R

Tatvic, a up and coming startup founded by an ex-Trilogy colleague, has helped with the R for Google Analytics package. While Tatvic is into heavy duty web analytics, they are betting big on R, and using it for Web Analytics. David Smith, most excellent blogger-de-chief in R universe has blogged on them before here http://blog.revolutionanalytics.com/2013/02/analyze-web-traffic-data-with-google-analytics-and-r.html

Here is an upcoming seminar on R in Web Analytics.

Click here

From this webinar, you will get to know:

  • What is R and why should you use this tool? How to extract your Web Analytics data into R?
  • How to build a predictive model using web analytics data with the help of R?
  • How predictive modelling can take your analysis to the next level?
  • How to carry out insightful analysis through visualization?

Who should attend: Every web analyst who wants to take his analysis to the next level.

ps- Hat tip to Caroline  A

  • 301,644 views

Analytics Conference

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 1,112 other followers

Analytics Software

Categories

Archives

Follow

Get every new post delivered to your Inbox.

Join 1,112 other followers

%d bloggers like this: