Data Cleaning using R.
This tutorial is a part of the Data Science Tutorial series. In this tutorial, we will discuss " Cleaning The Data" which is the First step. So let's begin the show…
library(stringr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
Now , lets the Data set on which we are going to work.You’ll be importing and cleaning four real datasets that are a little messier.Your first dataset describes online ticket sales for various events across the country. It’s stored as a Comma-Separated Value (CSV) file called sales.csv.
PART I: IMPORTING AND VISULIZATION OF DATA
sales<-read.csv("sales.csv",stringsAsFactors=FALSE)
Default strecture of read.csv() function is as : read.csv(file, header = TRUE, sep = “,”, quote = “"”, dec = “.”, fill = TRUE, comment.char = “”, …)
We imported the data in working directry(check your working directery by getwd() function and set the working directory by setwd() function). Now, Lets examine the Data. Let’s refresh your memory on some useful functions that can do that: dim() returns the dimensions of an object head() displays the first part of an object *names() returns the names associated with an object
# View dimensions of sales
dim(sales)
## [1] 5000 46
# Inspect first 6 rows of sales
head(sales)
X | event_id | primary_act_id | secondary_act_id | ||
1 | 1 | abcaf1adb99a935fc661 | 43f0436b905bfa7c2eec | b85143bf51323b72e53c | |
2 | 2 | 6c56d7f08c95f2aa453c | 1a3e9aecd0617706a794 | f53529c5679ea6ca5a48 | |
3 | 3 | c7ab4524a121f9d687d2 | 4b677c3f5bec71eec8d1 | b85143bf51323b72e53c | |
4 | 4 | 394cb493f893be9b9ed1 | b1ccea01ad6ef8522796 | b85143bf51323b72e53c | |
5 | 5 | 55b5f67e618557929f48 | 91c03a34b562436efa3c | b85143bf51323b72e53c | |
6 | 6 | 4f10fd8b9f550352bd56 | ac4b847b3fde66f2117e | 63814f3d63317f1b56c4 |
6 rows | 1-5 of 47 columns
# View column names of sales
names(sales)
## [1] "X" "event_id"
## [3] "primary_act_id" "secondary_act_id"
## [5] "purch_party_lkup_id" "event_name"
## [7] "primary_act_name" "secondary_act_name"
## [9] "major_cat_name" "minor_cat_name"
## [11] "la_event_type_cat" "event_disp_name"
## [13] "ticket_text" "tickets_purchased_qty"
## [15] "trans_face_val_amt" "delivery_type_cd"
## [17] "event_date_time" "event_dt"
## [19] "presale_dt" "onsale_dt"
## [21] "sales_ord_create_dttm" "sales_ord_tran_dt"
## [23] "print_dt" "timezn_nm"
## [25] "venue_city" "venue_state"
## [27] "venue_postal_cd_sgmt_1" "sales_platform_cd"
## [29] "print_flg" "la_valid_tkt_event_flg"
## [31] "fin_mkt_nm" "web_session_cookie_val"
## [33] "gndr_cd" "age_yr"
## [35] "income_amt" "edu_val"
## [37] "edu_1st_indv_val" "edu_2nd_indv_val"
## [39] "adults_in_hh_num" "married_ind"
## [41] "child_present_ind" "home_owner_ind"
## [43] "occpn_val" "occpn_1st_val"
## [45] "occpn_2nd_val" "dist_to_ven"
Beside these three functions str() and summary() are be very useful. So , Lets use it and check how it works.
#Look at strecture of the sales
str(sales)
## 'data.frame': 5000 obs. of 46 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ event_id : chr "abcaf1adb99a935fc661" "6c56d7f08c95f2aa453c" "c7ab4524a121f9d687d2" "394cb493f893be9b9ed1" ...
## $ primary_act_id : chr "43f0436b905bfa7c2eec" "1a3e9aecd0617706a794" "4b677c3f5bec71eec8d1" "b1ccea01ad6ef8522796" ...
## $ secondary_act_id : chr "b85143bf51323b72e53c" "f53529c5679ea6ca5a48" "b85143bf51323b72e53c" "b85143bf51323b72e53c" ...
## $ purch_party_lkup_id : chr "7dfa56dd7d5956b17587" "4f9e6fc637eaf7b736c2" "6c2545703bd527a7144d" "527d6b1eaffc69ddd882" ...
## $ event_name : chr "Xfinity Center Mansfield Premier Parking: Florida Georgia Line" "Gorge Camping - dave matthews band - sept 3-7" "Dodge Theatre Adams Street Parking - benise" "Gexa Energy Pavilion Vip Parking : kid rock with sheryl crow" ...
## $ primary_act_name : chr "XFINITY Center Mansfield Premier Parking" "Gorge Camping" "Parking Event" "Gexa Energy Pavilion VIP Parking" ...
## $ secondary_act_name : chr "NULL" "Dave Matthews Band" "NULL" "NULL" ...
## $ major_cat_name : chr "MISC" "MISC" "MISC" "MISC" ...
## $ minor_cat_name : chr "PARKING" "CAMPING" "PARKING" "PARKING" ...
## $ la_event_type_cat : chr "PARKING" "INVALID" "PARKING" "PARKING" ...
## $ event_disp_name : chr "Xfinity Center Mansfield Premier Parking: Florida Georgia Line" "Gorge Camping - dave matthews band - sept 3-7" "Dodge Theatre Adams Street Parking - benise" "Gexa Energy Pavilion Vip Parking : kid rock with sheryl crow" ...
## $ ticket_text : chr " THIS TICKET IS VALID FOR PARKING ONLY GOOD THIS DAY ONLY PREMIER PARKING PASS XFINITY CENTER,LOTS 4 "| __truncated__ "%OVERNIGHT C A M P I N G%* * * * * *%GORGE CAMPGROUND%* GOOD THIS DATE ONLY *%SEP 3 - 6, 2009" "ADAMS STREET GARAGE%PARKING FOR 4/21/06 ONLY%DODGE THEATRE PARKING PASS%ENTRANCE ON ADAMS STREET%BENISE%GARAGE OPENS AT 6:00PM" " THIS TICKET IS VALID FOR PARKING ONLY GOOD FOR THIS DATE ONLY VIP PARKING PASS GEXA ENERGY PAVILION"| __truncated__ ...
## $ tickets_purchased_qty : int 1 1 1 1 1 2 1 1 1 1 ...
## $ trans_face_val_amt : num 45 75 5 20 20 10 30 28 20 25 ...
## $ delivery_type_cd : chr "eTicket" "TicketFast" "TicketFast" "Mail" ...
## $ event_date_time : chr "2015-09-12 23:30:00" "2009-09-05 01:00:00" "2006-04-22 01:30:00" "2011-09-03 00:00:00" ...
## $ event_dt : chr "2015-09-12" "2009-09-04" "2006-04-21" "2011-09-02" ...
## $ presale_dt : chr "NULL" "NULL" "NULL" "NULL" ...
## $ onsale_dt : chr "2015-05-15" "2009-03-13" "2006-02-25" "2011-04-22" ...
## $ sales_ord_create_dttm : chr "2015-09-11 18:17:45" "2009-07-06 00:00:00" "2006-04-05 00:00:00" "2011-07-01 17:38:50" ...
## $ sales_ord_tran_dt : chr "2015-09-11" "2009-07-05" "2006-04-05" "2011-07-01" ...
## $ print_dt : chr "2015-09-12" "2009-09-01" "2006-04-05" "2011-07-06" ...
## $ timezn_nm : chr "EST" "PST" "MST" "CST" ...
## $ venue_city : chr "MANSFIELD" "QUINCY" "PHOENIX" "DALLAS" ...
## $ venue_state : chr "MASSACHUSETTS" "WASHINGTON" "ARIZONA" "TEXAS" ...
## $ venue_postal_cd_sgmt_1: chr "02048" "98848" "85003" "75210" ...
## $ sales_platform_cd : chr "www.concerts.livenation.com" "NULL" "NULL" "NULL" ...
## $ print_flg : chr "T " "T " "T " "T " ...
## $ la_valid_tkt_event_flg: chr "N " "N " "N " "N " ...
## $ fin_mkt_nm : chr "Boston" "Seattle" "Arizona" "Dallas" ...
## $ web_session_cookie_val: chr "7dfa56dd7d5956b17587" "4f9e6fc637eaf7b736c2" "6c2545703bd527a7144d" "527d6b1eaffc69ddd882" ...
## $ gndr_cd : chr NA NA NA NA ...
## $ age_yr : chr NA NA NA NA ...
## $ income_amt : chr NA NA NA NA ...
## $ edu_val : chr NA NA NA NA ...
## $ edu_1st_indv_val : chr NA NA NA NA ...
## $ edu_2nd_indv_val : chr NA NA NA NA ...
## $ adults_in_hh_num : chr NA NA NA NA ...
## $ married_ind : chr NA NA NA NA ...
## $ child_present_ind : chr NA NA NA NA ...
## $ home_owner_ind : chr NA NA NA NA ...
## $ occpn_val : chr NA NA NA NA ...
## $ occpn_1st_val : chr NA NA NA NA ...
## $ occpn_2nd_val : chr NA NA NA NA ...
## $ dist_to_ven : int NA 59 NA NA NA NA NA NA NA NA ...
# View a summary of sales
summary(sales)
## X event_id primary_act_id secondary_act_id
## Min. : 1 Length:5000 Length:5000 Length:5000
## 1st Qu.:1251 Class :character Class :character Class :character
## Median :2500 Mode :character Mode :character Mode :character
## Mean :2500
## 3rd Qu.:3750
## Max. :5000
##
## purch_party_lkup_id event_name primary_act_name
## Length:5000 Length:5000 Length:5000
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## secondary_act_name major_cat_name minor_cat_name
## Length:5000 Length:5000 Length:5000
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## la_event_type_cat event_disp_name ticket_text
## Length:5000 Length:5000 Length:5000
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## tickets_purchased_qty trans_face_val_amt delivery_type_cd
## Min. :1.000 Min. : 1.00 Length:5000
## 1st Qu.:1.000 1st Qu.: 20.00 Class :character
## Median :1.000 Median : 30.00 Mode :character
## Mean :1.639 Mean : 77.08
## 3rd Qu.:2.000 3rd Qu.: 85.00
## Max. :8.000 Max. :1520.88
##
## event_date_time event_dt presale_dt
## Length:5000 Length:5000 Length:5000
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## onsale_dt sales_ord_create_dttm sales_ord_tran_dt
## Length:5000 Length:5000 Length:5000
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## print_dt timezn_nm venue_city
## Length:5000 Length:5000 Length:5000
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## venue_state venue_postal_cd_sgmt_1 sales_platform_cd
## Length:5000 Length:5000 Length:5000
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## print_flg la_valid_tkt_event_flg fin_mkt_nm
## Length:5000 Length:5000 Length:5000
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## web_session_cookie_val gndr_cd age_yr
## Length:5000 Length:5000 Length:5000
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## income_amt edu_val edu_1st_indv_val
## Length:5000 Length:5000 Length:5000
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## edu_2nd_indv_val adults_in_hh_num married_ind
## Length:5000 Length:5000 Length:5000
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## child_present_ind home_owner_ind occpn_val
## Length:5000 Length:5000 Length:5000
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## occpn_1st_val occpn_2nd_val dist_to_ven
## Length:5000 Length:5000 Min. : 0.0
## Class :character Class :character 1st Qu.: 12.0
## Mode :character Mode :character Median : 26.0
## Mean : 158.2
## 3rd Qu.: 77.5
## Max. :2548.0
## NA's :4677
If you noticced the result of str() , you can see that some of columns have the velue NA. This can be solve by replacing the approriate value. Note: you should not remove the NA colume because it have meaning in the data.(Removing colume which contain NA in some case which we talk late in this tutorial)
glimpse() function is the similar the str() function which you can use . This function is in dplyr package. we already loaded this package in the starting of the tutorial, but you must load this package before using glimpse() function.
#Get a glimpse of the sales
glimpse(sales)
## Observations: 5,000
## Variables: 46
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ...
## $ event_id <chr> "abcaf1adb99a935fc661", "6c56d7f08c95f2...
## $ primary_act_id <chr> "43f0436b905bfa7c2eec", "1a3e9aecd06177...
## $ secondary_act_id <chr> "b85143bf51323b72e53c", "f53529c5679ea6...
## $ purch_party_lkup_id <chr> "7dfa56dd7d5956b17587", "4f9e6fc637eaf7...
## $ event_name <chr> "Xfinity Center Mansfield Premier Parki...
## $ primary_act_name <chr> "XFINITY Center Mansfield Premier Parki...
## $ secondary_act_name <chr> "NULL", "Dave Matthews Band", "NULL", "...
## $ major_cat_name <chr> "MISC", "MISC", "MISC", "MISC", "MISC",...
## $ minor_cat_name <chr> "PARKING", "CAMPING", "PARKING", "PARKI...
## $ la_event_type_cat <chr> "PARKING", "INVALID", "PARKING", "PARKI...
## $ event_disp_name <chr> "Xfinity Center Mansfield Premier Parki...
## $ ticket_text <chr> " THIS TICKET IS VALID FOR PAR...
## $ tickets_purchased_qty <int> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 4, ...
## $ trans_face_val_amt <dbl> 45, 75, 5, 20, 20, 10, 30, 28, 20, 25, ...
## $ delivery_type_cd <chr> "eTicket", "TicketFast", "TicketFast", ...
## $ event_date_time <chr> "2015-09-12 23:30:00", "2009-09-05 01:0...
## $ event_dt <chr> "2015-09-12", "2009-09-04", "2006-04-21...
## $ presale_dt <chr> "NULL", "NULL", "NULL", "NULL", "2005-0...
## $ onsale_dt <chr> "2015-05-15", "2009-03-13", "2006-02-25...
## $ sales_ord_create_dttm <chr> "2015-09-11 18:17:45", "2009-07-06 00:0...
## $ sales_ord_tran_dt <chr> "2015-09-11", "2009-07-05", "2006-04-05...
## $ print_dt <chr> "2015-09-12", "2009-09-01", "2006-04-05...
## $ timezn_nm <chr> "EST", "PST", "MST", "CST", "PST", "PST...
## $ venue_city <chr> "MANSFIELD", "QUINCY", "PHOENIX", "DALL...
## $ venue_state <chr> "MASSACHUSETTS", "WASHINGTON", "ARIZONA...
## $ venue_postal_cd_sgmt_1 <chr> "02048", "98848", "85003", "75210", "98...
## $ sales_platform_cd <chr> "www.concerts.livenation.com", "NULL", ...
## $ print_flg <chr> "T ", "T ", "T ", "T ", "T ", "T ", "T ...
## $ la_valid_tkt_event_flg <chr> "N ", "N ", "N ", "N ", "N ", "N ", "N ...
## $ fin_mkt_nm <chr> "Boston", "Seattle", "Arizona", "Dallas...
## $ web_session_cookie_val <chr> "7dfa56dd7d5956b17587", "4f9e6fc637eaf7...
## $ gndr_cd <chr> NA, NA, NA, NA, NA, NA, "M", NA, NA, NA...
## $ age_yr <chr> NA, NA, NA, NA, NA, NA, "28", NA, NA, N...
## $ income_amt <chr> NA, NA, NA, NA, NA, NA, "112500", NA, N...
## $ edu_val <chr> NA, NA, NA, NA, NA, NA, "High School", ...
## $ edu_1st_indv_val <chr> NA, NA, NA, NA, NA, NA, "High School", ...
## $ edu_2nd_indv_val <chr> NA, NA, NA, NA, NA, NA, "NULL", NA, NA,...
## $ adults_in_hh_num <chr> NA, NA, NA, NA, NA, NA, "4", NA, NA, NA...
## $ married_ind <chr> NA, NA, NA, NA, NA, NA, "0", NA, NA, NA...
## $ child_present_ind <chr> NA, NA, NA, NA, NA, NA, "1", NA, NA, NA...
## $ home_owner_ind <chr> NA, NA, NA, NA, NA, NA, "0", NA, NA, NA...
## $ occpn_val <chr> NA, NA, NA, NA, NA, NA, "NULL", NA, NA,...
## $ occpn_1st_val <chr> NA, NA, NA, NA, NA, NA, "Craftsman Blue...
## $ occpn_2nd_val <chr> NA, NA, NA, NA, NA, NA, "NULL", NA, NA,...
## $ dist_to_ven <int> NA, 59, NA, NA, NA, NA, NA, NA, NA, NA,...
There are function nrow() and ncol() which return the number of rows and columns in a data frame, respectively.
#Number of rows in sales
nrow(sales)
## [1] 5000
#number of colmes in sales
ncol(sales)
## [1] 46
#View the first 6 row
head(sales)
X | event_id | primary_act_id | secondary_act_id | ||
1 | 1 | abcaf1adb99a935fc661 | 43f0436b905bfa7c2eec | b85143bf51323b72e53c | |
2 | 2 | 6c56d7f08c95f2aa453c | 1a3e9aecd0617706a794 | f53529c5679ea6ca5a48 | |
3 | 3 | c7ab4524a121f9d687d2 | 4b677c3f5bec71eec8d1 | b85143bf51323b72e53c | |
4 | 4 | 394cb493f893be9b9ed1 | b1ccea01ad6ef8522796 | b85143bf51323b72e53c | |
5 | 5 | 55b5f67e618557929f48 | 91c03a34b562436efa3c | b85143bf51323b72e53c | |
6 | 6 | 4f10fd8b9f550352bd56 | ac4b847b3fde66f2117e | 63814f3d63317f1b56c4 |
6 rows | 1-5 of 47 columns
#view the firt 10 rows
head(sales,n=10)
X | event_id | primary_act_id | secondary_act_id | ||
1 | 1 | abcaf1adb99a935fc661 | 43f0436b905bfa7c2eec | b85143bf51323b72e53c | |
2 | 2 | 6c56d7f08c95f2aa453c | 1a3e9aecd0617706a794 | f53529c5679ea6ca5a48 | |
3 | 3 | c7ab4524a121f9d687d2 | 4b677c3f5bec71eec8d1 | b85143bf51323b72e53c | |
4 | 4 | 394cb493f893be9b9ed1 | b1ccea01ad6ef8522796 | b85143bf51323b72e53c | |
5 | 5 | 55b5f67e618557929f48 | 91c03a34b562436efa3c | b85143bf51323b72e53c | |
6 | 6 | 4f10fd8b9f550352bd56 | ac4b847b3fde66f2117e | 63814f3d63317f1b56c4 | |
7 | 7 | 105ea9ee10065b54b0df | a14232befff04be1e2f3 | b85143bf51323b72e53c | |
8 | 8 | 593b910d17913d7101cb | 91c03a34b562436efa3c | b85143bf51323b72e53c | |
9 | 9 | 0fb4e7152e09a40ffedd | 0efaba7ce3f0d7466b42 | f66a1cf484a8cd868fb4 | |
10 | 10 | 93fce8094d89e0a94aac | f6425a3223e73ea6de5a | d8c54467eb14c7dd1476 |
1-10 of 10 rows | 1-5 of 47 columns
# View the last 6 rows
tail(sales)
X | event_id | primary_act_id | secondary_act_id | ||
4995 | 4995 | 997c44d1def48153316d | 716df6554f0887b73f1a | b85143bf51323b72e53c | |
4996 | 4996 | 79372bc1e7e14dfd3f82 | a14232befff04be1e2f3 | b85143bf51323b72e53c | |
4997 | 4997 | f3b0de6a19e2a16e4fb1 | 83e5715f0ed3e13e6398 | b2b558b1fe3efad19edb | |
4998 | 4998 | 7c9e2f6847a2dbe4ebff | 9ea8ab40f58d18305ac8 | b85143bf51323b72e53c | |
4999 | 4999 | 7af1ccd9a3ec3b6cef1c | e274b5c11a2a63e00141 | b85143bf51323b72e53c | |
5000 | 5000 | 802f4bf8026e38c2acf2 | 24c326bd80bd2c8c9e17 | b22d584a2132b33e23e0 |
6 rows | 1-5 of 47 columns
#view the last 11 rows
tail(sales,n=11)
X | event_id | primary_act_id | secondary_act_id | ||
4990 | 4990 | 893e8bc79ada0d860ac9 | ebe30b90662305cd8f18 | 3f010d7b8687dbfc8409 | |
4991 | 4991 | 6789e447a44036d5628c | 4b677c3f5bec71eec8d1 | b85143bf51323b72e53c | |
4992 | 4992 | 09984e0d6ca9b1de13ed | da9ab9b6f5d99b8c3195 | b85143bf51323b72e53c | |
4993 | 4993 | 720f557e40ffdb2cfa16 | 91c03a34b562436efa3c | a75e1e1b72ac7f7ab1f6 | |
4994 | 4994 | dfbc00d338c1e2e99d4b | a0ed82451f57a8914eba | b85143bf51323b72e53c | |
4995 | 4995 | 997c44d1def48153316d | 716df6554f0887b73f1a | b85143bf51323b72e53c | |
4996 | 4996 | 79372bc1e7e14dfd3f82 | a14232befff04be1e2f3 | b85143bf51323b72e53c | |
4997 | 4997 | f3b0de6a19e2a16e4fb1 | 83e5715f0ed3e13e6398 | b2b558b1fe3efad19edb | |
4998 | 4998 | 7c9e2f6847a2dbe4ebff | 9ea8ab40f58d18305ac8 | b85143bf51323b72e53c | |
4999 | 4999 | 7af1ccd9a3ec3b6cef1c | e274b5c11a2a63e00141 | b85143bf51323b72e53c |
You can use square brackets to subset a data frame as follows: my_sales[1:6, ] # First 6 rows of my_sales my_sales[, 4] # Fourth column of my_sales
Alternatively, you can remove rows and columns using negative indices. For example: my_sales[-(1:6), ] # Omit first 6 rows of my_df my_sales[, -4] # Omit fourth column of my_df
Lets try on data seles .
# Remove the first column of sales: sales2
sales2<-sales[,-1]
PART II : DATA CLEANING(OPERATIONS ON DATA)
In this part of Tutorial , we are going to use functions of tidyr and stringr package .
- tidyr package :- There are two fundamental verbs of data tidying:
gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide†data longer.
spread(). takes two columns (key & value) and spreads in to multiple columns, it makes “long†data wider. tidyr also provides separate() and extract() functions which makes it easier to pull apart a column that represents multiple variables. The complement to separate() is unite().
So Lets start with gather function.
#load the Iris Dataset
iris<-read.csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/iris.csv")
head(iris)
X | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
1 | 1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
3 | 3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
4 | 4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
5 | 5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
6 | 6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
6 rows
# get first observation for each Species in iris data -- base R
mini_iris <- iris[c(1, 51, 101), ]
# gather Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
gather(mini_iris, key = flower_att, value = measurement,
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
X | Species | flower_att | measurement |
1 | setosa | Sepal.Length | 5.1 |
51 | versicolor | Sepal.Length | 7.0 |
101 | virginica | Sepal.Length | 6.3 |
1 | setosa | Sepal.Width | 3.5 |
51 | versicolor | Sepal.Width | 3.2 |
101 | virginica | Sepal.Width | 3.3 |
1 | setosa | Petal.Length | 1.4 |
51 | versicolor | Petal.Length | 4.7 |
101 | virginica | Petal.Length | 6.0 |
1 | setosa | Petal.Width | 0.2 |
# same result but less verbose
mini_iris_1<-gather(mini_iris, key = flower_att, value = measurement, -Species)
print(mini_iris_1)
## Species flower_att measurement
## 1 setosa X 1.0
## 2 versicolor X 51.0
## 3 virginica X 101.0
## 4 setosa Sepal.Length 5.1
## 5 versicolor Sepal.Length 7.0
## 6 virginica Sepal.Length 6.3
## 7 setosa Sepal.Width 3.5
## 8 versicolor Sepal.Width 3.2
## 9 virginica Sepal.Width 3.3
## 10 setosa Petal.Length 1.4
## 11 versicolor Petal.Length 4.7
## 12 virginica Petal.Length 6.0
## 13 setosa Petal.Width 0.2
## 14 versicolor Petal.Width 1.4
## 15 virginica Petal.Width 2.5
spread(mini_iris_1,key=flower_att,value=measurement)
Species | Petal.Length | Petal.Width | Sepal.Length | Sepal.Width | X | |
1 | setosa | 1.4 | 0.2 | 5.1 | 3.5 | 1 |
2 | versicolor | 4.7 | 1.4 | 7.0 | 3.2 | 51 |
3 | virginica | 6.0 | 2.5 | 6.3 | 3.3 | 101 |
3 rows
Let’s move on the another function seperate().Given either regular expression or a vector of character positions, separate() turns a single character column into multiple columns.
By Default: separate(data, col, into, sep = “[^[:alnum:]]+”, remove = TRUE, convert = FALSE, extra = “warn”, fill = “warn”, …) It used where data combined in one column ie. id trt key time 1 1 treatment work.T1 0.08514 2 2 control work.T1 0.22544 3 3 treatment work.T1 0.27453 4 4 control work.T1 0.27231 5 1 treatment home.T1 0.61583 6 2 control home.T1 0.42967 7 3 treatment home.T1 0.65166 8 4 control home.T1 0.56774
look at the key column where two column are combined work.T1 =work and T1.
df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
df %>% separate(x, c("A", "B"))
df <- data.frame(x = c(11, 21, 13, 41))
df %>% separate(x, c("A", "B"),sep=1)
df1<-separate(df,x, c("A", "B"),sep=1)
Alternatively, Unite () combined multiple columns in one.C
Default ecpression :
unite(data, col, …, sep = “_“, remove = TRUE)
df1%>%unite(X,A,B,sep="")
%>% Operator Although not required, the tidyr and dplyr packages make use of the pipe operator %>% developed by Stefan Milton Bache in the R package magrittr. Although all the functions in tidyr and dplyr can be used without the pipe operator, one of the great conveniences these packages provide is the ability to string multiple functions together by incorporating %>%.
This operator will forward a value, or the result of an expression, into the next function call/expression. For instance a function to filter data can be written as:
filter(data, variable == numeric_value) or data %>% filter(variable == numeric_value)
Both functions complete the same task and the benefit of using %>% is not evident; however, when you desire to perform multiple functions its advantage becomes obvious. For instance, if we want to filter some data, summarize it, and then order the summarized results we would write it out as: ested Option:
arrange(
summarize(
filter(data, variable == numeric_value),
Total = sum(variable)
),
desc(Total)
)
Multiple Object Option:
a <- filter(data, variable == numeric_value)
b <- summarise(a, Total = sum(variable))
c <- arrange(b, desc(Total))
%>% Option:
data %>%
filter(variable == “valueâ€) %>%
summarise(Total = sum(variable)) %>%
arrange(desc(Total))
Our next function is extract(),wgich turns each group into a new column. If the groups don’t match, or the input is NA, the output will be NA.
Default expression: extract(data, col, into, regex = “([[:alnum:]]+)”, remove = TRUE, convert = FALSE, …)
df <- data.frame(x = c(NA, "a-b", "a-d", "b-c", "d-e"))
print(df)
## x
## 1 <NA>
## 2 a-b
## 3 a-d
## 4 b-c
## 5 d-e
df %>% extract(x, "A")
5 rows
dplyr Operations There are seven fundamental functions of data transformation:
1.select() selecting variables 2.filter() provides basic filtering capabilities 3.group_by() groups data by categorical levels 4.summarise() summarise data by functions of choice 5.arrange() ordering data 6.join() joining separate dataframes 7.mutate() create new variables