Data Cleaning using R.

This tutorial is a part of the Data Science Tutorial series. In this tutorial, we will discuss " Cleaning The Data" which is the First step. So let's begin the show…

library(stringr) 
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

Now , lets the Data set on which we are going to work.You’ll be importing and cleaning four real datasets that are a little messier.Your first dataset describes online ticket sales for various events across the country. It’s stored as a Comma-Separated Value (CSV) file called sales.csv.

PART I: IMPORTING AND VISULIZATION OF DATA

sales<-read.csv("sales.csv",stringsAsFactors=FALSE)

Default strecture of read.csv() function is as : read.csv(file, header = TRUE, sep = “,”, quote = “"”, dec = “.”, fill = TRUE, comment.char = “”, …)

We imported the data in working directry(check your working directery by getwd() function and set the working directory by setwd() function). Now, Lets examine the Data. Let’s refresh your memory on some useful functions that can do that: dim() returns the dimensions of an object head() displays the first part of an object *names() returns the names associated with an object

# View dimensions of sales

dim(sales)

## [1] 5000   46

# Inspect first 6 rows of sales
head(sales)

	X	event_id	primary_act_id	secondary_act_id
1	1	abcaf1adb99a935fc661	43f0436b905bfa7c2eec	b85143bf51323b72e53c
2	2	6c56d7f08c95f2aa453c	1a3e9aecd0617706a794	f53529c5679ea6ca5a48
3	3	c7ab4524a121f9d687d2	4b677c3f5bec71eec8d1	b85143bf51323b72e53c
4	4	394cb493f893be9b9ed1	b1ccea01ad6ef8522796	b85143bf51323b72e53c
5	5	55b5f67e618557929f48	91c03a34b562436efa3c	b85143bf51323b72e53c
6	6	4f10fd8b9f550352bd56	ac4b847b3fde66f2117e	63814f3d63317f1b56c4

6 rows | 1-5 of 47 columns

# View column names of sales
names(sales)

##  [1] "X"                      "event_id"              
##  [3] "primary_act_id"         "secondary_act_id"      
##  [5] "purch_party_lkup_id"    "event_name"            
##  [7] "primary_act_name"       "secondary_act_name"    
##  [9] "major_cat_name"         "minor_cat_name"        
## [11] "la_event_type_cat"      "event_disp_name"       
## [13] "ticket_text"            "tickets_purchased_qty" 
## [15] "trans_face_val_amt"     "delivery_type_cd"      
## [17] "event_date_time"        "event_dt"              
## [19] "presale_dt"             "onsale_dt"             
## [21] "sales_ord_create_dttm"  "sales_ord_tran_dt"     
## [23] "print_dt"               "timezn_nm"             
## [25] "venue_city"             "venue_state"           
## [27] "venue_postal_cd_sgmt_1" "sales_platform_cd"     
## [29] "print_flg"              "la_valid_tkt_event_flg"
## [31] "fin_mkt_nm"             "web_session_cookie_val"
## [33] "gndr_cd"                "age_yr"                
## [35] "income_amt"             "edu_val"               
## [37] "edu_1st_indv_val"       "edu_2nd_indv_val"      
## [39] "adults_in_hh_num"       "married_ind"           
## [41] "child_present_ind"      "home_owner_ind"        
## [43] "occpn_val"              "occpn_1st_val"         
## [45] "occpn_2nd_val"          "dist_to_ven"

Beside these three functions str() and summary() are be very useful. So , Lets use it and check how it works.

#Look at strecture of the sales
str(sales)

## 'data.frame':    5000 obs. of  46 variables:
##  $ X                     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ event_id              : chr  "abcaf1adb99a935fc661" "6c56d7f08c95f2aa453c" "c7ab4524a121f9d687d2" "394cb493f893be9b9ed1" ...
##  $ primary_act_id        : chr  "43f0436b905bfa7c2eec" "1a3e9aecd0617706a794" "4b677c3f5bec71eec8d1" "b1ccea01ad6ef8522796" ...
##  $ secondary_act_id      : chr  "b85143bf51323b72e53c" "f53529c5679ea6ca5a48" "b85143bf51323b72e53c" "b85143bf51323b72e53c" ...
##  $ purch_party_lkup_id   : chr  "7dfa56dd7d5956b17587" "4f9e6fc637eaf7b736c2" "6c2545703bd527a7144d" "527d6b1eaffc69ddd882" ...
##  $ event_name            : chr  "Xfinity Center Mansfield Premier Parking: Florida Georgia Line" "Gorge Camping - dave matthews band - sept 3-7" "Dodge Theatre Adams Street Parking - benise" "Gexa Energy Pavilion Vip Parking : kid rock with sheryl crow" ...
##  $ primary_act_name      : chr  "XFINITY Center Mansfield Premier Parking" "Gorge Camping" "Parking Event" "Gexa Energy Pavilion VIP Parking" ...
##  $ secondary_act_name    : chr  "NULL" "Dave Matthews Band" "NULL" "NULL" ...
##  $ major_cat_name        : chr  "MISC" "MISC" "MISC" "MISC" ...
##  $ minor_cat_name        : chr  "PARKING" "CAMPING" "PARKING" "PARKING" ...
##  $ la_event_type_cat     : chr  "PARKING" "INVALID" "PARKING" "PARKING" ...
##  $ event_disp_name       : chr  "Xfinity Center Mansfield Premier Parking: Florida Georgia Line" "Gorge Camping - dave matthews band - sept 3-7" "Dodge Theatre Adams Street Parking - benise" "Gexa Energy Pavilion Vip Parking : kid rock with sheryl crow" ...
##  $ ticket_text           : chr  "   THIS TICKET IS VALID        FOR PARKING ONLY         GOOD THIS DAY ONLY       PREMIER PARKING PASS    XFINITY CENTER,LOTS 4 "| __truncated__ "%OVERNIGHT C A M P I N G%* * * * * *%GORGE CAMPGROUND%* GOOD THIS DATE ONLY *%SEP 3 - 6, 2009" "ADAMS STREET GARAGE%PARKING FOR 4/21/06 ONLY%DODGE THEATRE PARKING PASS%ENTRANCE ON ADAMS STREET%BENISE%GARAGE OPENS AT 6:00PM" "   THIS TICKET IS VALID        FOR PARKING ONLY      GOOD FOR THIS DATE ONLY       VIP PARKING PASS        GEXA ENERGY PAVILION"| __truncated__ ...
##  $ tickets_purchased_qty : int  1 1 1 1 1 2 1 1 1 1 ...
##  $ trans_face_val_amt    : num  45 75 5 20 20 10 30 28 20 25 ...
##  $ delivery_type_cd      : chr  "eTicket" "TicketFast" "TicketFast" "Mail" ...
##  $ event_date_time       : chr  "2015-09-12 23:30:00" "2009-09-05 01:00:00" "2006-04-22 01:30:00" "2011-09-03 00:00:00" ...
##  $ event_dt              : chr  "2015-09-12" "2009-09-04" "2006-04-21" "2011-09-02" ...
##  $ presale_dt            : chr  "NULL" "NULL" "NULL" "NULL" ...
##  $ onsale_dt             : chr  "2015-05-15" "2009-03-13" "2006-02-25" "2011-04-22" ...
##  $ sales_ord_create_dttm : chr  "2015-09-11 18:17:45" "2009-07-06 00:00:00" "2006-04-05 00:00:00" "2011-07-01 17:38:50" ...
##  $ sales_ord_tran_dt     : chr  "2015-09-11" "2009-07-05" "2006-04-05" "2011-07-01" ...
##  $ print_dt              : chr  "2015-09-12" "2009-09-01" "2006-04-05" "2011-07-06" ...
##  $ timezn_nm             : chr  "EST" "PST" "MST" "CST" ...
##  $ venue_city            : chr  "MANSFIELD" "QUINCY" "PHOENIX" "DALLAS" ...
##  $ venue_state           : chr  "MASSACHUSETTS" "WASHINGTON" "ARIZONA" "TEXAS" ...
##  $ venue_postal_cd_sgmt_1: chr  "02048" "98848" "85003" "75210" ...
##  $ sales_platform_cd     : chr  "www.concerts.livenation.com" "NULL" "NULL" "NULL" ...
##  $ print_flg             : chr  "T " "T " "T " "T " ...
##  $ la_valid_tkt_event_flg: chr  "N " "N " "N " "N " ...
##  $ fin_mkt_nm            : chr  "Boston" "Seattle" "Arizona" "Dallas" ...
##  $ web_session_cookie_val: chr  "7dfa56dd7d5956b17587" "4f9e6fc637eaf7b736c2" "6c2545703bd527a7144d" "527d6b1eaffc69ddd882" ...
##  $ gndr_cd               : chr  NA NA NA NA ...
##  $ age_yr                : chr  NA NA NA NA ...
##  $ income_amt            : chr  NA NA NA NA ...
##  $ edu_val               : chr  NA NA NA NA ...
##  $ edu_1st_indv_val      : chr  NA NA NA NA ...
##  $ edu_2nd_indv_val      : chr  NA NA NA NA ...
##  $ adults_in_hh_num      : chr  NA NA NA NA ...
##  $ married_ind           : chr  NA NA NA NA ...
##  $ child_present_ind     : chr  NA NA NA NA ...
##  $ home_owner_ind        : chr  NA NA NA NA ...
##  $ occpn_val             : chr  NA NA NA NA ...
##  $ occpn_1st_val         : chr  NA NA NA NA ...
##  $ occpn_2nd_val         : chr  NA NA NA NA ...
##  $ dist_to_ven           : int  NA 59 NA NA NA NA NA NA NA NA ...

# View a summary of sales
summary(sales)

##        X          event_id         primary_act_id     secondary_act_id  
##  Min.   :   1   Length:5000        Length:5000        Length:5000       
##  1st Qu.:1251   Class :character   Class :character   Class :character  
##  Median :2500   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2500                                                           
##  3rd Qu.:3750                                                           
##  Max.   :5000                                                           
##                                                                         
##  purch_party_lkup_id  event_name        primary_act_name  
##  Length:5000         Length:5000        Length:5000       
##  Class :character    Class :character   Class :character  
##  Mode  :character    Mode  :character   Mode  :character  
##                                                           
##                                                           
##                                                           
##                                                           
##  secondary_act_name major_cat_name     minor_cat_name    
##  Length:5000        Length:5000        Length:5000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  la_event_type_cat  event_disp_name    ticket_text       
##  Length:5000        Length:5000        Length:5000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  tickets_purchased_qty trans_face_val_amt delivery_type_cd  
##  Min.   :1.000         Min.   :   1.00    Length:5000       
##  1st Qu.:1.000         1st Qu.:  20.00    Class :character  
##  Median :1.000         Median :  30.00    Mode  :character  
##  Mean   :1.639         Mean   :  77.08                      
##  3rd Qu.:2.000         3rd Qu.:  85.00                      
##  Max.   :8.000         Max.   :1520.88                      
##                                                             
##  event_date_time      event_dt          presale_dt       
##  Length:5000        Length:5000        Length:5000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##   onsale_dt         sales_ord_create_dttm sales_ord_tran_dt 
##  Length:5000        Length:5000           Length:5000       
##  Class :character   Class :character      Class :character  
##  Mode  :character   Mode  :character      Mode  :character  
##                                                             
##                                                             
##                                                             
##                                                             
##    print_dt          timezn_nm          venue_city       
##  Length:5000        Length:5000        Length:5000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  venue_state        venue_postal_cd_sgmt_1 sales_platform_cd 
##  Length:5000        Length:5000            Length:5000       
##  Class :character   Class :character       Class :character  
##  Mode  :character   Mode  :character       Mode  :character  
##                                                              
##                                                              
##                                                              
##                                                              
##   print_flg         la_valid_tkt_event_flg  fin_mkt_nm       
##  Length:5000        Length:5000            Length:5000       
##  Class :character   Class :character       Class :character  
##  Mode  :character   Mode  :character       Mode  :character  
##                                                              
##                                                              
##                                                              
##                                                              
##  web_session_cookie_val   gndr_cd             age_yr         
##  Length:5000            Length:5000        Length:5000       
##  Class :character       Class :character   Class :character  
##  Mode  :character       Mode  :character   Mode  :character  
##                                                              
##                                                              
##                                                              
##                                                              
##   income_amt          edu_val          edu_1st_indv_val  
##  Length:5000        Length:5000        Length:5000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  edu_2nd_indv_val   adults_in_hh_num   married_ind       
##  Length:5000        Length:5000        Length:5000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  child_present_ind  home_owner_ind      occpn_val        
##  Length:5000        Length:5000        Length:5000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  occpn_1st_val      occpn_2nd_val       dist_to_ven    
##  Length:5000        Length:5000        Min.   :   0.0  
##  Class :character   Class :character   1st Qu.:  12.0  
##  Mode  :character   Mode  :character   Median :  26.0  
##                                        Mean   : 158.2  
##                                        3rd Qu.:  77.5  
##                                        Max.   :2548.0  
##                                        NA's   :4677

If you noticced the result of str() , you can see that some of columns have the velue NA. This can be solve by replacing the approriate value. Note: you should not remove the NA colume because it have meaning in the data.(Removing colume which contain NA in some case which we talk late in this tutorial)

glimpse() function is the similar the str() function which you can use . This function is in dplyr package. we already loaded this package in the starting of the tutorial, but you must load this package before using glimpse() function.

#Get a glimpse of the sales
glimpse(sales)

## Observations: 5,000
## Variables: 46
## $ X                      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ...
## $ event_id               <chr> "abcaf1adb99a935fc661", "6c56d7f08c95f2...
## $ primary_act_id         <chr> "43f0436b905bfa7c2eec", "1a3e9aecd06177...
## $ secondary_act_id       <chr> "b85143bf51323b72e53c", "f53529c5679ea6...
## $ purch_party_lkup_id    <chr> "7dfa56dd7d5956b17587", "4f9e6fc637eaf7...
## $ event_name             <chr> "Xfinity Center Mansfield Premier Parki...
## $ primary_act_name       <chr> "XFINITY Center Mansfield Premier Parki...
## $ secondary_act_name     <chr> "NULL", "Dave Matthews Band", "NULL", "...
## $ major_cat_name         <chr> "MISC", "MISC", "MISC", "MISC", "MISC",...
## $ minor_cat_name         <chr> "PARKING", "CAMPING", "PARKING", "PARKI...
## $ la_event_type_cat      <chr> "PARKING", "INVALID", "PARKING", "PARKI...
## $ event_disp_name        <chr> "Xfinity Center Mansfield Premier Parki...
## $ ticket_text            <chr> "   THIS TICKET IS VALID        FOR PAR...
## $ tickets_purchased_qty  <int> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 4, ...
## $ trans_face_val_amt     <dbl> 45, 75, 5, 20, 20, 10, 30, 28, 20, 25, ...
## $ delivery_type_cd       <chr> "eTicket", "TicketFast", "TicketFast", ...
## $ event_date_time        <chr> "2015-09-12 23:30:00", "2009-09-05 01:0...
## $ event_dt               <chr> "2015-09-12", "2009-09-04", "2006-04-21...
## $ presale_dt             <chr> "NULL", "NULL", "NULL", "NULL", "2005-0...
## $ onsale_dt              <chr> "2015-05-15", "2009-03-13", "2006-02-25...
## $ sales_ord_create_dttm  <chr> "2015-09-11 18:17:45", "2009-07-06 00:0...
## $ sales_ord_tran_dt      <chr> "2015-09-11", "2009-07-05", "2006-04-05...
## $ print_dt               <chr> "2015-09-12", "2009-09-01", "2006-04-05...
## $ timezn_nm              <chr> "EST", "PST", "MST", "CST", "PST", "PST...
## $ venue_city             <chr> "MANSFIELD", "QUINCY", "PHOENIX", "DALL...
## $ venue_state            <chr> "MASSACHUSETTS", "WASHINGTON", "ARIZONA...
## $ venue_postal_cd_sgmt_1 <chr> "02048", "98848", "85003", "75210", "98...
## $ sales_platform_cd      <chr> "www.concerts.livenation.com", "NULL", ...
## $ print_flg              <chr> "T ", "T ", "T ", "T ", "T ", "T ", "T ...
## $ la_valid_tkt_event_flg <chr> "N ", "N ", "N ", "N ", "N ", "N ", "N ...
## $ fin_mkt_nm             <chr> "Boston", "Seattle", "Arizona", "Dallas...
## $ web_session_cookie_val <chr> "7dfa56dd7d5956b17587", "4f9e6fc637eaf7...
## $ gndr_cd                <chr> NA, NA, NA, NA, NA, NA, "M", NA, NA, NA...
## $ age_yr                 <chr> NA, NA, NA, NA, NA, NA, "28", NA, NA, N...
## $ income_amt             <chr> NA, NA, NA, NA, NA, NA, "112500", NA, N...
## $ edu_val                <chr> NA, NA, NA, NA, NA, NA, "High School", ...
## $ edu_1st_indv_val       <chr> NA, NA, NA, NA, NA, NA, "High School", ...
## $ edu_2nd_indv_val       <chr> NA, NA, NA, NA, NA, NA, "NULL", NA, NA,...
## $ adults_in_hh_num       <chr> NA, NA, NA, NA, NA, NA, "4", NA, NA, NA...
## $ married_ind            <chr> NA, NA, NA, NA, NA, NA, "0", NA, NA, NA...
## $ child_present_ind      <chr> NA, NA, NA, NA, NA, NA, "1", NA, NA, NA...
## $ home_owner_ind         <chr> NA, NA, NA, NA, NA, NA, "0", NA, NA, NA...
## $ occpn_val              <chr> NA, NA, NA, NA, NA, NA, "NULL", NA, NA,...
## $ occpn_1st_val          <chr> NA, NA, NA, NA, NA, NA, "Craftsman Blue...
## $ occpn_2nd_val          <chr> NA, NA, NA, NA, NA, NA, "NULL", NA, NA,...
## $ dist_to_ven            <int> NA, 59, NA, NA, NA, NA, NA, NA, NA, NA,...

There are function nrow() and ncol() which return the number of rows and columns in a data frame, respectively.

#Number of rows in sales
nrow(sales)

## [1] 5000

#number of colmes in sales
ncol(sales)

## [1] 46

#View the first 6 row
head(sales)

	X	event_id	primary_act_id	secondary_act_id
1	1	abcaf1adb99a935fc661	43f0436b905bfa7c2eec	b85143bf51323b72e53c
2	2	6c56d7f08c95f2aa453c	1a3e9aecd0617706a794	f53529c5679ea6ca5a48
3	3	c7ab4524a121f9d687d2	4b677c3f5bec71eec8d1	b85143bf51323b72e53c
4	4	394cb493f893be9b9ed1	b1ccea01ad6ef8522796	b85143bf51323b72e53c
5	5	55b5f67e618557929f48	91c03a34b562436efa3c	b85143bf51323b72e53c
6	6	4f10fd8b9f550352bd56	ac4b847b3fde66f2117e	63814f3d63317f1b56c4

6 rows | 1-5 of 47 columns

#view the firt 10 rows
head(sales,n=10)

	X	event_id	primary_act_id	secondary_act_id
1	1	abcaf1adb99a935fc661	43f0436b905bfa7c2eec	b85143bf51323b72e53c
2	2	6c56d7f08c95f2aa453c	1a3e9aecd0617706a794	f53529c5679ea6ca5a48
3	3	c7ab4524a121f9d687d2	4b677c3f5bec71eec8d1	b85143bf51323b72e53c
4	4	394cb493f893be9b9ed1	b1ccea01ad6ef8522796	b85143bf51323b72e53c
5	5	55b5f67e618557929f48	91c03a34b562436efa3c	b85143bf51323b72e53c
6	6	4f10fd8b9f550352bd56	ac4b847b3fde66f2117e	63814f3d63317f1b56c4
7	7	105ea9ee10065b54b0df	a14232befff04be1e2f3	b85143bf51323b72e53c
8	8	593b910d17913d7101cb	91c03a34b562436efa3c	b85143bf51323b72e53c
9	9	0fb4e7152e09a40ffedd	0efaba7ce3f0d7466b42	f66a1cf484a8cd868fb4
10	10	93fce8094d89e0a94aac	f6425a3223e73ea6de5a	d8c54467eb14c7dd1476

1-10 of 10 rows | 1-5 of 47 columns

# View the last 6 rows
tail(sales)

	X	event_id	primary_act_id	secondary_act_id
4995	4995	997c44d1def48153316d	716df6554f0887b73f1a	b85143bf51323b72e53c
4996	4996	79372bc1e7e14dfd3f82	a14232befff04be1e2f3	b85143bf51323b72e53c
4997	4997	f3b0de6a19e2a16e4fb1	83e5715f0ed3e13e6398	b2b558b1fe3efad19edb
4998	4998	7c9e2f6847a2dbe4ebff	9ea8ab40f58d18305ac8	b85143bf51323b72e53c
4999	4999	7af1ccd9a3ec3b6cef1c	e274b5c11a2a63e00141	b85143bf51323b72e53c
5000	5000	802f4bf8026e38c2acf2	24c326bd80bd2c8c9e17	b22d584a2132b33e23e0

6 rows | 1-5 of 47 columns

#view the last 11 rows
tail(sales,n=11)

	X	event_id	primary_act_id	secondary_act_id
4990	4990	893e8bc79ada0d860ac9	ebe30b90662305cd8f18	3f010d7b8687dbfc8409
4991	4991	6789e447a44036d5628c	4b677c3f5bec71eec8d1	b85143bf51323b72e53c
4992	4992	09984e0d6ca9b1de13ed	da9ab9b6f5d99b8c3195	b85143bf51323b72e53c
4993	4993	720f557e40ffdb2cfa16	91c03a34b562436efa3c	a75e1e1b72ac7f7ab1f6
4994	4994	dfbc00d338c1e2e99d4b	a0ed82451f57a8914eba	b85143bf51323b72e53c
4995	4995	997c44d1def48153316d	716df6554f0887b73f1a	b85143bf51323b72e53c
4996	4996	79372bc1e7e14dfd3f82	a14232befff04be1e2f3	b85143bf51323b72e53c
4997	4997	f3b0de6a19e2a16e4fb1	83e5715f0ed3e13e6398	b2b558b1fe3efad19edb
4998	4998	7c9e2f6847a2dbe4ebff	9ea8ab40f58d18305ac8	b85143bf51323b72e53c
4999	4999	7af1ccd9a3ec3b6cef1c	e274b5c11a2a63e00141	b85143bf51323b72e53c

You can use square brackets to subset a data frame as follows: my_sales[1:6, ] # First 6 rows of my_sales my_sales[, 4] # Fourth column of my_sales

Alternatively, you can remove rows and columns using negative indices. For example: my_sales[-(1:6), ] # Omit first 6 rows of my_df my_sales[, -4] # Omit fourth column of my_df

Lets try on data seles .

# Remove the first column of sales: sales2
sales2<-sales[,-1]

PART II : DATA CLEANING(OPERATIONS ON DATA)

In this part of Tutorial , we are going to use functions of tidyr and stringr package .

tidyr package :- There are two fundamental verbs of data tidying:

gather() takes multiple columns, and gathers them into key-value pairs: it makes â€œwideâ€ data longer.

spread(). takes two columns (key & value) and spreads in to multiple columns, it makes â€œlongâ€ data wider. tidyr also provides separate() and extract() functions which makes it easier to pull apart a column that represents multiple variables. The complement to separate() is unite().

So Lets start with gather function.

#load the Iris Dataset
iris<-read.csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/iris.csv")
head(iris)

	X	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	1	5.1	3.5	1.4	0.2	setosa
2	2	4.9	3.0	1.4	0.2	setosa
3	3	4.7	3.2	1.3	0.2	setosa
4	4	4.6	3.1	1.5	0.2	setosa
5	5	5.0	3.6	1.4	0.2	setosa
6	6	5.4	3.9	1.7	0.4	setosa

6 rows

# get first observation for each Species in iris data -- base R
mini_iris <- iris[c(1, 51, 101), ]
# gather Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
gather(mini_iris, key = flower_att, value = measurement,
       Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)

X	Species	flower_att	measurement
1	setosa	Sepal.Length	5.1
51	versicolor	Sepal.Length	7.0
101	virginica	Sepal.Length	6.3
1	setosa	Sepal.Width	3.5
51	versicolor	Sepal.Width	3.2
101	virginica	Sepal.Width	3.3
1	setosa	Petal.Length	1.4
51	versicolor	Petal.Length	4.7
101	virginica	Petal.Length	6.0
1	setosa	Petal.Width	0.2

# same result but less verbose
mini_iris_1<-gather(mini_iris, key = flower_att, value = measurement, -Species)
print(mini_iris_1)

##       Species   flower_att measurement
## 1      setosa            X         1.0
## 2  versicolor            X        51.0
## 3   virginica            X       101.0
## 4      setosa Sepal.Length         5.1
## 5  versicolor Sepal.Length         7.0
## 6   virginica Sepal.Length         6.3
## 7      setosa  Sepal.Width         3.5
## 8  versicolor  Sepal.Width         3.2
## 9   virginica  Sepal.Width         3.3
## 10     setosa Petal.Length         1.4
## 11 versicolor Petal.Length         4.7
## 12  virginica Petal.Length         6.0
## 13     setosa  Petal.Width         0.2
## 14 versicolor  Petal.Width         1.4
## 15  virginica  Petal.Width         2.5

spread(mini_iris_1,key=flower_att,value=measurement)

	Species	Petal.Length	Petal.Width	Sepal.Length	Sepal.Width	X
1	setosa	1.4	0.2	5.1	3.5	1
2	versicolor	4.7	1.4	7.0	3.2	51
3	virginica	6.0	2.5	6.3	3.3	101

3 rows

Let’s move on the another function seperate().Given either regular expression or a vector of character positions, separate() turns a single character column into multiple columns.

By Default: separate(data, col, into, sep = “[^[:alnum:]]+”, remove = TRUE, convert = FALSE, extra = “warn”, fill = “warn”, …) It used where data combined in one column ie. id trt key time 1 1 treatment work.T1 0.08514 2 2 control work.T1 0.22544 3 3 treatment work.T1 0.27453 4 4 control work.T1 0.27231 5 1 treatment home.T1 0.61583 6 2 control home.T1 0.42967 7 3 treatment home.T1 0.65166 8 4 control home.T1 0.56774

look at the key column where two column are combined work.T1 =work and T1.

df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
df %>% separate(x, c("A", "B"))

df <- data.frame(x = c(11, 21, 13, 41))
df %>% separate(x, c("A", "B"),sep=1)

df1<-separate(df,x, c("A", "B"),sep=1)

Alternatively, Unite () combined multiple columns in one.C

Default ecpression :

unite(data, col, …, sep = “_“, remove = TRUE)

df1%>%unite(X,A,B,sep="")

%>% Operator Although not required, the tidyr and dplyr packages make use of the pipe operator %>% developed by Stefan Milton Bache in the R package magrittr. Although all the functions in tidyr and dplyr can be used without the pipe operator, one of the great conveniences these packages provide is the ability to string multiple functions together by incorporating %>%.

This operator will forward a value, or the result of an expression, into the next function call/expression. For instance a function to filter data can be written as:

filter(data, variable == numeric_value) or data %>% filter(variable == numeric_value)

Both functions complete the same task and the benefit of using %>% is not evident; however, when you desire to perform multiple functions its advantage becomes obvious. For instance, if we want to filter some data, summarize it, and then order the summarized results we would write it out as: ested Option:

arrange(
        summarize(
            filter(data, variable == numeric_value),
            Total = sum(variable)
        ),
    desc(Total)
)

Multiple Object Option:

 a <- filter(data, variable == numeric_value)
 b <- summarise(a, Total = sum(variable))
 c <- arrange(b, desc(Total))

%>% Option:

 data %>%
        filter(variable == â€œvalueâ€) %>%
        summarise(Total = sum(variable)) %>%
        arrange(desc(Total))

Our next function is extract(),wgich turns each group into a new column. If the groups don’t match, or the input is NA, the output will be NA.

Default expression: extract(data, col, into, regex = “([[:alnum:]]+)”, remove = TRUE, convert = FALSE, …)

df <- data.frame(x = c(NA, "a-b", "a-d", "b-c", "d-e"))
print(df)

##      x
## 1 <NA>
## 2  a-b
## 3  a-d
## 4  b-c
## 5  d-e

df %>% extract(x, "A")

5 rows

dplyr Operations There are seven fundamental functions of data transformation:

1.select() selecting variables 2.filter() provides basic filtering capabilities 3.group_by() groups data by categorical levels 4.summarise() summarise data by functions of choice 5.arrange() ordering data 6.join() joining separate dataframes 7.mutate() create new variables

Data Cleaning using R.

Did you find this article valuable?