Data Cleaning using R.

This tutorial is a part of the Data Science Tutorial series. In this tutorial, we will discuss " Cleaning The Data" which is the First step. So let's begin the show…

library(stringr) 
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats

Now , lets the Data set on which we are going to work.You’ll be importing and cleaning four real datasets that are a little messier.Your first dataset describes online ticket sales for various events across the country. It’s stored as a Comma-Separated Value (CSV) file called sales.csv.

PART I: IMPORTING AND VISULIZATION OF DATA

sales<-read.csv("sales.csv",stringsAsFactors=FALSE)

Default strecture of read.csv() function is as : read.csv(file, header = TRUE, sep = “,”, quote = “"”, dec = “.”, fill = TRUE, comment.char = “”, …)

We imported the data in working directry(check your working directery by getwd() function and set the working directory by setwd() function). Now, Lets examine the Data. Let’s refresh your memory on some useful functions that can do that: dim() returns the dimensions of an object head() displays the first part of an object *names() returns the names associated with an object

# View dimensions of sales

dim(sales)
## [1] 5000   46
# Inspect first 6 rows of sales
head(sales)
Xevent_idprimary_act_idsecondary_act_id
11abcaf1adb99a935fc66143f0436b905bfa7c2eecb85143bf51323b72e53c
226c56d7f08c95f2aa453c1a3e9aecd0617706a794f53529c5679ea6ca5a48
33c7ab4524a121f9d687d24b677c3f5bec71eec8d1b85143bf51323b72e53c
44394cb493f893be9b9ed1b1ccea01ad6ef8522796b85143bf51323b72e53c
5555b5f67e618557929f4891c03a34b562436efa3cb85143bf51323b72e53c
664f10fd8b9f550352bd56ac4b847b3fde66f2117e63814f3d63317f1b56c4

6 rows | 1-5 of 47 columns

# View column names of sales
names(sales)
##  [1] "X"                      "event_id"              
##  [3] "primary_act_id"         "secondary_act_id"      
##  [5] "purch_party_lkup_id"    "event_name"            
##  [7] "primary_act_name"       "secondary_act_name"    
##  [9] "major_cat_name"         "minor_cat_name"        
## [11] "la_event_type_cat"      "event_disp_name"       
## [13] "ticket_text"            "tickets_purchased_qty" 
## [15] "trans_face_val_amt"     "delivery_type_cd"      
## [17] "event_date_time"        "event_dt"              
## [19] "presale_dt"             "onsale_dt"             
## [21] "sales_ord_create_dttm"  "sales_ord_tran_dt"     
## [23] "print_dt"               "timezn_nm"             
## [25] "venue_city"             "venue_state"           
## [27] "venue_postal_cd_sgmt_1" "sales_platform_cd"     
## [29] "print_flg"              "la_valid_tkt_event_flg"
## [31] "fin_mkt_nm"             "web_session_cookie_val"
## [33] "gndr_cd"                "age_yr"                
## [35] "income_amt"             "edu_val"               
## [37] "edu_1st_indv_val"       "edu_2nd_indv_val"      
## [39] "adults_in_hh_num"       "married_ind"           
## [41] "child_present_ind"      "home_owner_ind"        
## [43] "occpn_val"              "occpn_1st_val"         
## [45] "occpn_2nd_val"          "dist_to_ven"

Beside these three functions str() and summary() are be very useful. So , Lets use it and check how it works.

#Look at strecture of the sales
str(sales)
## 'data.frame':    5000 obs. of  46 variables:
##  $ X                     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ event_id              : chr  "abcaf1adb99a935fc661" "6c56d7f08c95f2aa453c" "c7ab4524a121f9d687d2" "394cb493f893be9b9ed1" ...
##  $ primary_act_id        : chr  "43f0436b905bfa7c2eec" "1a3e9aecd0617706a794" "4b677c3f5bec71eec8d1" "b1ccea01ad6ef8522796" ...
##  $ secondary_act_id      : chr  "b85143bf51323b72e53c" "f53529c5679ea6ca5a48" "b85143bf51323b72e53c" "b85143bf51323b72e53c" ...
##  $ purch_party_lkup_id   : chr  "7dfa56dd7d5956b17587" "4f9e6fc637eaf7b736c2" "6c2545703bd527a7144d" "527d6b1eaffc69ddd882" ...
##  $ event_name            : chr  "Xfinity Center Mansfield Premier Parking: Florida Georgia Line" "Gorge Camping - dave matthews band - sept 3-7" "Dodge Theatre Adams Street Parking - benise" "Gexa Energy Pavilion Vip Parking : kid rock with sheryl crow" ...
##  $ primary_act_name      : chr  "XFINITY Center Mansfield Premier Parking" "Gorge Camping" "Parking Event" "Gexa Energy Pavilion VIP Parking" ...
##  $ secondary_act_name    : chr  "NULL" "Dave Matthews Band" "NULL" "NULL" ...
##  $ major_cat_name        : chr  "MISC" "MISC" "MISC" "MISC" ...
##  $ minor_cat_name        : chr  "PARKING" "CAMPING" "PARKING" "PARKING" ...
##  $ la_event_type_cat     : chr  "PARKING" "INVALID" "PARKING" "PARKING" ...
##  $ event_disp_name       : chr  "Xfinity Center Mansfield Premier Parking: Florida Georgia Line" "Gorge Camping - dave matthews band - sept 3-7" "Dodge Theatre Adams Street Parking - benise" "Gexa Energy Pavilion Vip Parking : kid rock with sheryl crow" ...
##  $ ticket_text           : chr  "   THIS TICKET IS VALID        FOR PARKING ONLY         GOOD THIS DAY ONLY       PREMIER PARKING PASS    XFINITY CENTER,LOTS 4 "| __truncated__ "%OVERNIGHT C A M P I N G%* * * * * *%GORGE CAMPGROUND%* GOOD THIS DATE ONLY *%SEP 3 - 6, 2009" "ADAMS STREET GARAGE%PARKING FOR 4/21/06 ONLY%DODGE THEATRE PARKING PASS%ENTRANCE ON ADAMS STREET%BENISE%GARAGE OPENS AT 6:00PM" "   THIS TICKET IS VALID        FOR PARKING ONLY      GOOD FOR THIS DATE ONLY       VIP PARKING PASS        GEXA ENERGY PAVILION"| __truncated__ ...
##  $ tickets_purchased_qty : int  1 1 1 1 1 2 1 1 1 1 ...
##  $ trans_face_val_amt    : num  45 75 5 20 20 10 30 28 20 25 ...
##  $ delivery_type_cd      : chr  "eTicket" "TicketFast" "TicketFast" "Mail" ...
##  $ event_date_time       : chr  "2015-09-12 23:30:00" "2009-09-05 01:00:00" "2006-04-22 01:30:00" "2011-09-03 00:00:00" ...
##  $ event_dt              : chr  "2015-09-12" "2009-09-04" "2006-04-21" "2011-09-02" ...
##  $ presale_dt            : chr  "NULL" "NULL" "NULL" "NULL" ...
##  $ onsale_dt             : chr  "2015-05-15" "2009-03-13" "2006-02-25" "2011-04-22" ...
##  $ sales_ord_create_dttm : chr  "2015-09-11 18:17:45" "2009-07-06 00:00:00" "2006-04-05 00:00:00" "2011-07-01 17:38:50" ...
##  $ sales_ord_tran_dt     : chr  "2015-09-11" "2009-07-05" "2006-04-05" "2011-07-01" ...
##  $ print_dt              : chr  "2015-09-12" "2009-09-01" "2006-04-05" "2011-07-06" ...
##  $ timezn_nm             : chr  "EST" "PST" "MST" "CST" ...
##  $ venue_city            : chr  "MANSFIELD" "QUINCY" "PHOENIX" "DALLAS" ...
##  $ venue_state           : chr  "MASSACHUSETTS" "WASHINGTON" "ARIZONA" "TEXAS" ...
##  $ venue_postal_cd_sgmt_1: chr  "02048" "98848" "85003" "75210" ...
##  $ sales_platform_cd     : chr  "www.concerts.livenation.com" "NULL" "NULL" "NULL" ...
##  $ print_flg             : chr  "T " "T " "T " "T " ...
##  $ la_valid_tkt_event_flg: chr  "N " "N " "N " "N " ...
##  $ fin_mkt_nm            : chr  "Boston" "Seattle" "Arizona" "Dallas" ...
##  $ web_session_cookie_val: chr  "7dfa56dd7d5956b17587" "4f9e6fc637eaf7b736c2" "6c2545703bd527a7144d" "527d6b1eaffc69ddd882" ...
##  $ gndr_cd               : chr  NA NA NA NA ...
##  $ age_yr                : chr  NA NA NA NA ...
##  $ income_amt            : chr  NA NA NA NA ...
##  $ edu_val               : chr  NA NA NA NA ...
##  $ edu_1st_indv_val      : chr  NA NA NA NA ...
##  $ edu_2nd_indv_val      : chr  NA NA NA NA ...
##  $ adults_in_hh_num      : chr  NA NA NA NA ...
##  $ married_ind           : chr  NA NA NA NA ...
##  $ child_present_ind     : chr  NA NA NA NA ...
##  $ home_owner_ind        : chr  NA NA NA NA ...
##  $ occpn_val             : chr  NA NA NA NA ...
##  $ occpn_1st_val         : chr  NA NA NA NA ...
##  $ occpn_2nd_val         : chr  NA NA NA NA ...
##  $ dist_to_ven           : int  NA 59 NA NA NA NA NA NA NA NA ...
# View a summary of sales
summary(sales)
##        X          event_id         primary_act_id     secondary_act_id  
##  Min.   :   1   Length:5000        Length:5000        Length:5000       
##  1st Qu.:1251   Class :character   Class :character   Class :character  
##  Median :2500   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2500                                                           
##  3rd Qu.:3750                                                           
##  Max.   :5000                                                           
##                                                                         
##  purch_party_lkup_id  event_name        primary_act_name  
##  Length:5000         Length:5000        Length:5000       
##  Class :character    Class :character   Class :character  
##  Mode  :character    Mode  :character   Mode  :character  
##                                                           
##                                                           
##                                                           
##                                                           
##  secondary_act_name major_cat_name     minor_cat_name    
##  Length:5000        Length:5000        Length:5000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  la_event_type_cat  event_disp_name    ticket_text       
##  Length:5000        Length:5000        Length:5000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  tickets_purchased_qty trans_face_val_amt delivery_type_cd  
##  Min.   :1.000         Min.   :   1.00    Length:5000       
##  1st Qu.:1.000         1st Qu.:  20.00    Class :character  
##  Median :1.000         Median :  30.00    Mode  :character  
##  Mean   :1.639         Mean   :  77.08                      
##  3rd Qu.:2.000         3rd Qu.:  85.00                      
##  Max.   :8.000         Max.   :1520.88                      
##                                                             
##  event_date_time      event_dt          presale_dt       
##  Length:5000        Length:5000        Length:5000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##   onsale_dt         sales_ord_create_dttm sales_ord_tran_dt 
##  Length:5000        Length:5000           Length:5000       
##  Class :character   Class :character      Class :character  
##  Mode  :character   Mode  :character      Mode  :character  
##                                                             
##                                                             
##                                                             
##                                                             
##    print_dt          timezn_nm          venue_city       
##  Length:5000        Length:5000        Length:5000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  venue_state        venue_postal_cd_sgmt_1 sales_platform_cd 
##  Length:5000        Length:5000            Length:5000       
##  Class :character   Class :character       Class :character  
##  Mode  :character   Mode  :character       Mode  :character  
##                                                              
##                                                              
##                                                              
##                                                              
##   print_flg         la_valid_tkt_event_flg  fin_mkt_nm       
##  Length:5000        Length:5000            Length:5000       
##  Class :character   Class :character       Class :character  
##  Mode  :character   Mode  :character       Mode  :character  
##                                                              
##                                                              
##                                                              
##                                                              
##  web_session_cookie_val   gndr_cd             age_yr         
##  Length:5000            Length:5000        Length:5000       
##  Class :character       Class :character   Class :character  
##  Mode  :character       Mode  :character   Mode  :character  
##                                                              
##                                                              
##                                                              
##                                                              
##   income_amt          edu_val          edu_1st_indv_val  
##  Length:5000        Length:5000        Length:5000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  edu_2nd_indv_val   adults_in_hh_num   married_ind       
##  Length:5000        Length:5000        Length:5000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  child_present_ind  home_owner_ind      occpn_val        
##  Length:5000        Length:5000        Length:5000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  occpn_1st_val      occpn_2nd_val       dist_to_ven    
##  Length:5000        Length:5000        Min.   :   0.0  
##  Class :character   Class :character   1st Qu.:  12.0  
##  Mode  :character   Mode  :character   Median :  26.0  
##                                        Mean   : 158.2  
##                                        3rd Qu.:  77.5  
##                                        Max.   :2548.0  
##                                        NA's   :4677

If you noticced the result of str() , you can see that some of columns have the velue NA. This can be solve by replacing the approriate value. Note: you should not remove the NA colume because it have meaning in the data.(Removing colume which contain NA in some case which we talk late in this tutorial)

glimpse() function is the similar the str() function which you can use . This function is in dplyr package. we already loaded this package in the starting of the tutorial, but you must load this package before using glimpse() function.

#Get a glimpse of the sales
glimpse(sales)
## Observations: 5,000
## Variables: 46
## $ X                      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ...
## $ event_id               <chr> "abcaf1adb99a935fc661", "6c56d7f08c95f2...
## $ primary_act_id         <chr> "43f0436b905bfa7c2eec", "1a3e9aecd06177...
## $ secondary_act_id       <chr> "b85143bf51323b72e53c", "f53529c5679ea6...
## $ purch_party_lkup_id    <chr> "7dfa56dd7d5956b17587", "4f9e6fc637eaf7...
## $ event_name             <chr> "Xfinity Center Mansfield Premier Parki...
## $ primary_act_name       <chr> "XFINITY Center Mansfield Premier Parki...
## $ secondary_act_name     <chr> "NULL", "Dave Matthews Band", "NULL", "...
## $ major_cat_name         <chr> "MISC", "MISC", "MISC", "MISC", "MISC",...
## $ minor_cat_name         <chr> "PARKING", "CAMPING", "PARKING", "PARKI...
## $ la_event_type_cat      <chr> "PARKING", "INVALID", "PARKING", "PARKI...
## $ event_disp_name        <chr> "Xfinity Center Mansfield Premier Parki...
## $ ticket_text            <chr> "   THIS TICKET IS VALID        FOR PAR...
## $ tickets_purchased_qty  <int> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 4, ...
## $ trans_face_val_amt     <dbl> 45, 75, 5, 20, 20, 10, 30, 28, 20, 25, ...
## $ delivery_type_cd       <chr> "eTicket", "TicketFast", "TicketFast", ...
## $ event_date_time        <chr> "2015-09-12 23:30:00", "2009-09-05 01:0...
## $ event_dt               <chr> "2015-09-12", "2009-09-04", "2006-04-21...
## $ presale_dt             <chr> "NULL", "NULL", "NULL", "NULL", "2005-0...
## $ onsale_dt              <chr> "2015-05-15", "2009-03-13", "2006-02-25...
## $ sales_ord_create_dttm  <chr> "2015-09-11 18:17:45", "2009-07-06 00:0...
## $ sales_ord_tran_dt      <chr> "2015-09-11", "2009-07-05", "2006-04-05...
## $ print_dt               <chr> "2015-09-12", "2009-09-01", "2006-04-05...
## $ timezn_nm              <chr> "EST", "PST", "MST", "CST", "PST", "PST...
## $ venue_city             <chr> "MANSFIELD", "QUINCY", "PHOENIX", "DALL...
## $ venue_state            <chr> "MASSACHUSETTS", "WASHINGTON", "ARIZONA...
## $ venue_postal_cd_sgmt_1 <chr> "02048", "98848", "85003", "75210", "98...
## $ sales_platform_cd      <chr> "www.concerts.livenation.com", "NULL", ...
## $ print_flg              <chr> "T ", "T ", "T ", "T ", "T ", "T ", "T ...
## $ la_valid_tkt_event_flg <chr> "N ", "N ", "N ", "N ", "N ", "N ", "N ...
## $ fin_mkt_nm             <chr> "Boston", "Seattle", "Arizona", "Dallas...
## $ web_session_cookie_val <chr> "7dfa56dd7d5956b17587", "4f9e6fc637eaf7...
## $ gndr_cd                <chr> NA, NA, NA, NA, NA, NA, "M", NA, NA, NA...
## $ age_yr                 <chr> NA, NA, NA, NA, NA, NA, "28", NA, NA, N...
## $ income_amt             <chr> NA, NA, NA, NA, NA, NA, "112500", NA, N...
## $ edu_val                <chr> NA, NA, NA, NA, NA, NA, "High School", ...
## $ edu_1st_indv_val       <chr> NA, NA, NA, NA, NA, NA, "High School", ...
## $ edu_2nd_indv_val       <chr> NA, NA, NA, NA, NA, NA, "NULL", NA, NA,...
## $ adults_in_hh_num       <chr> NA, NA, NA, NA, NA, NA, "4", NA, NA, NA...
## $ married_ind            <chr> NA, NA, NA, NA, NA, NA, "0", NA, NA, NA...
## $ child_present_ind      <chr> NA, NA, NA, NA, NA, NA, "1", NA, NA, NA...
## $ home_owner_ind         <chr> NA, NA, NA, NA, NA, NA, "0", NA, NA, NA...
## $ occpn_val              <chr> NA, NA, NA, NA, NA, NA, "NULL", NA, NA,...
## $ occpn_1st_val          <chr> NA, NA, NA, NA, NA, NA, "Craftsman Blue...
## $ occpn_2nd_val          <chr> NA, NA, NA, NA, NA, NA, "NULL", NA, NA,...
## $ dist_to_ven            <int> NA, 59, NA, NA, NA, NA, NA, NA, NA, NA,...

There are function nrow() and ncol() which return the number of rows and columns in a data frame, respectively.

#Number of rows in sales
nrow(sales)
## [1] 5000
#number of colmes in sales
ncol(sales)
## [1] 46
#View the first 6 row
head(sales)
Xevent_idprimary_act_idsecondary_act_id
11abcaf1adb99a935fc66143f0436b905bfa7c2eecb85143bf51323b72e53c
226c56d7f08c95f2aa453c1a3e9aecd0617706a794f53529c5679ea6ca5a48
33c7ab4524a121f9d687d24b677c3f5bec71eec8d1b85143bf51323b72e53c
44394cb493f893be9b9ed1b1ccea01ad6ef8522796b85143bf51323b72e53c
5555b5f67e618557929f4891c03a34b562436efa3cb85143bf51323b72e53c
664f10fd8b9f550352bd56ac4b847b3fde66f2117e63814f3d63317f1b56c4

6 rows | 1-5 of 47 columns

#view the firt 10 rows
head(sales,n=10)
Xevent_idprimary_act_idsecondary_act_id
11abcaf1adb99a935fc66143f0436b905bfa7c2eecb85143bf51323b72e53c
226c56d7f08c95f2aa453c1a3e9aecd0617706a794f53529c5679ea6ca5a48
33c7ab4524a121f9d687d24b677c3f5bec71eec8d1b85143bf51323b72e53c
44394cb493f893be9b9ed1b1ccea01ad6ef8522796b85143bf51323b72e53c
5555b5f67e618557929f4891c03a34b562436efa3cb85143bf51323b72e53c
664f10fd8b9f550352bd56ac4b847b3fde66f2117e63814f3d63317f1b56c4
77105ea9ee10065b54b0dfa14232befff04be1e2f3b85143bf51323b72e53c
88593b910d17913d7101cb91c03a34b562436efa3cb85143bf51323b72e53c
990fb4e7152e09a40ffedd0efaba7ce3f0d7466b42f66a1cf484a8cd868fb4
101093fce8094d89e0a94aacf6425a3223e73ea6de5ad8c54467eb14c7dd1476

1-10 of 10 rows | 1-5 of 47 columns

# View the last 6 rows
tail(sales)
Xevent_idprimary_act_idsecondary_act_id
49954995997c44d1def48153316d716df6554f0887b73f1ab85143bf51323b72e53c
4996499679372bc1e7e14dfd3f82a14232befff04be1e2f3b85143bf51323b72e53c
49974997f3b0de6a19e2a16e4fb183e5715f0ed3e13e6398b2b558b1fe3efad19edb
499849987c9e2f6847a2dbe4ebff9ea8ab40f58d18305ac8b85143bf51323b72e53c
499949997af1ccd9a3ec3b6cef1ce274b5c11a2a63e00141b85143bf51323b72e53c
50005000802f4bf8026e38c2acf224c326bd80bd2c8c9e17b22d584a2132b33e23e0

6 rows | 1-5 of 47 columns

#view the last 11 rows
tail(sales,n=11)
Xevent_idprimary_act_idsecondary_act_id
49904990893e8bc79ada0d860ac9ebe30b90662305cd8f183f010d7b8687dbfc8409
499149916789e447a44036d5628c4b677c3f5bec71eec8d1b85143bf51323b72e53c
4992499209984e0d6ca9b1de13edda9ab9b6f5d99b8c3195b85143bf51323b72e53c
49934993720f557e40ffdb2cfa1691c03a34b562436efa3ca75e1e1b72ac7f7ab1f6
49944994dfbc00d338c1e2e99d4ba0ed82451f57a8914ebab85143bf51323b72e53c
49954995997c44d1def48153316d716df6554f0887b73f1ab85143bf51323b72e53c
4996499679372bc1e7e14dfd3f82a14232befff04be1e2f3b85143bf51323b72e53c
49974997f3b0de6a19e2a16e4fb183e5715f0ed3e13e6398b2b558b1fe3efad19edb
499849987c9e2f6847a2dbe4ebff9ea8ab40f58d18305ac8b85143bf51323b72e53c
499949997af1ccd9a3ec3b6cef1ce274b5c11a2a63e00141b85143bf51323b72e53c

You can use square brackets to subset a data frame as follows: my_sales[1:6, ] # First 6 rows of my_sales my_sales[, 4] # Fourth column of my_sales

Alternatively, you can remove rows and columns using negative indices. For example: my_sales[-(1:6), ] # Omit first 6 rows of my_df my_sales[, -4] # Omit fourth column of my_df

Lets try on data seles .

# Remove the first column of sales: sales2
sales2<-sales[,-1]

PART II : DATA CLEANING(OPERATIONS ON DATA)

In this part of Tutorial , we are going to use functions of tidyr and stringr package .

  • tidyr package :- There are two fundamental verbs of data tidying:

gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer.

spread(). takes two columns (key & value) and spreads in to multiple columns, it makes “long” data wider. tidyr also provides separate() and extract() functions which makes it easier to pull apart a column that represents multiple variables. The complement to separate() is unite().

So Lets start with gather function.

#load the Iris Dataset
iris<-read.csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/iris.csv")
head(iris)
XSepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
115.13.51.40.2setosa
224.93.01.40.2setosa
334.73.21.30.2setosa
444.63.11.50.2setosa
555.03.61.40.2setosa
665.43.91.70.4setosa

6 rows

# get first observation for each Species in iris data -- base R
mini_iris <- iris[c(1, 51, 101), ]
# gather Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
gather(mini_iris, key = flower_att, value = measurement,
       Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
XSpeciesflower_attmeasurement
1setosaSepal.Length5.1
51versicolorSepal.Length7.0
101virginicaSepal.Length6.3
1setosaSepal.Width3.5
51versicolorSepal.Width3.2
101virginicaSepal.Width3.3
1setosaPetal.Length1.4
51versicolorPetal.Length4.7
101virginicaPetal.Length6.0
1setosaPetal.Width0.2
# same result but less verbose
mini_iris_1<-gather(mini_iris, key = flower_att, value = measurement, -Species)
print(mini_iris_1)
##       Species   flower_att measurement
## 1      setosa            X         1.0
## 2  versicolor            X        51.0
## 3   virginica            X       101.0
## 4      setosa Sepal.Length         5.1
## 5  versicolor Sepal.Length         7.0
## 6   virginica Sepal.Length         6.3
## 7      setosa  Sepal.Width         3.5
## 8  versicolor  Sepal.Width         3.2
## 9   virginica  Sepal.Width         3.3
## 10     setosa Petal.Length         1.4
## 11 versicolor Petal.Length         4.7
## 12  virginica Petal.Length         6.0
## 13     setosa  Petal.Width         0.2
## 14 versicolor  Petal.Width         1.4
## 15  virginica  Petal.Width         2.5
spread(mini_iris_1,key=flower_att,value=measurement)
SpeciesPetal.LengthPetal.WidthSepal.LengthSepal.WidthX
1setosa1.40.25.13.51
2versicolor4.71.47.03.251
3virginica6.02.56.33.3101

3 rows

Let’s move on the another function seperate().Given either regular expression or a vector of character positions, separate() turns a single character column into multiple columns.

By Default: separate(data, col, into, sep = “[^[:alnum:]]+”, remove = TRUE, convert = FALSE, extra = “warn”, fill = “warn”, …) It used where data combined in one column ie. id trt key time 1 1 treatment work.T1 0.08514 2 2 control work.T1 0.22544 3 3 treatment work.T1 0.27453 4 4 control work.T1 0.27231 5 1 treatment home.T1 0.61583 6 2 control home.T1 0.42967 7 3 treatment home.T1 0.65166 8 4 control home.T1 0.56774

look at the key column where two column are combined work.T1 =work and T1.

df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
df %>% separate(x, c("A", "B"))
df <- data.frame(x = c(11, 21, 13, 41))
df %>% separate(x, c("A", "B"),sep=1)
df1<-separate(df,x, c("A", "B"),sep=1)

Alternatively, Unite () combined multiple columns in one.C

Default ecpression :

unite(data, col, …, sep = “_“, remove = TRUE)

df1%>%unite(X,A,B,sep="")

%>% Operator Although not required, the tidyr and dplyr packages make use of the pipe operator %>% developed by Stefan Milton Bache in the R package magrittr. Although all the functions in tidyr and dplyr can be used without the pipe operator, one of the great conveniences these packages provide is the ability to string multiple functions together by incorporating %>%.

This operator will forward a value, or the result of an expression, into the next function call/expression. For instance a function to filter data can be written as:

filter(data, variable == numeric_value) or data %>% filter(variable == numeric_value)

Both functions complete the same task and the benefit of using %>% is not evident; however, when you desire to perform multiple functions its advantage becomes obvious. For instance, if we want to filter some data, summarize it, and then order the summarized results we would write it out as: ested Option:

arrange(
        summarize(
            filter(data, variable == numeric_value),
            Total = sum(variable)
        ),
    desc(Total)
)

Multiple Object Option:

 a <- filter(data, variable == numeric_value)
 b <- summarise(a, Total = sum(variable))
 c <- arrange(b, desc(Total))

%>% Option:

 data %>%
        filter(variable == “value”) %>%
        summarise(Total = sum(variable)) %>%
        arrange(desc(Total))

Our next function is extract(),wgich turns each group into a new column. If the groups don’t match, or the input is NA, the output will be NA.

Default expression: extract(data, col, into, regex = “([[:alnum:]]+)”, remove = TRUE, convert = FALSE, …)

df <- data.frame(x = c(NA, "a-b", "a-d", "b-c", "d-e"))
print(df)
##      x
## 1 <NA>
## 2  a-b
## 3  a-d
## 4  b-c
## 5  d-e
df %>% extract(x, "A")

5 rows

dplyr Operations There are seven fundamental functions of data transformation:

1.select() selecting variables 2.filter() provides basic filtering capabilities 3.group_by() groups data by categorical levels 4.summarise() summarise data by functions of choice 5.arrange() ordering data 6.join() joining separate dataframes 7.mutate() create new variables

Did you find this article valuable?

Support Dr. Himanshu Rai by becoming a sponsor. Any amount is appreciated!