This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
R code goes in code chunks, denoted by three backticks. Executing code chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Crtl+Shift+Enter (Windows) or Cmd+Shift+Enter (Mac).
Along the workshop, I will try to share as many best practices as possible. These will be denoted by an fyi box like this:
This is a best practice in R coding (or at least i think it is)
This is an exercise for you to complete
This is a is a hint for the exercise
The first chunk in an R Notebook is usually titled “setup,” and by convention includes the R packages you want to load. Remember, in order to use an R package you have to run some library()
code every session. Execute these lines of code to load the packages.
To install a package, you have to type install.packages("packagename")
.
I’ve actually set the packages up in a way that if the package is not installeed yet if(!require(packagename))
(!require()
means package does not exist), it will automatically install it install.packages("packagename")
, and run load the package library(packagename)
.
Here is an example: if (!require(knitr)) install.packages("knitr"); library(knitr)
.
if (!require(knitr)) install.packages("knitr"); library(knitr)
## Loading required package: knitr
## Warning: package 'knitr' was built under R version 4.0.5
knitr::opts_chunk$set(comment = "#", warning = FALSE, message = FALSE)
if (!require(kableExtra)) install.packages("kableExtra"); library(kableExtra)
if (!require(tidyverse)) install.packages("tidyverse") ; library(tidyverse)
if (!require(skimr)) install.packages("skimr") ; library(skimr)
First we have to know how to load our data. Without this step, there is no data to even begin manipulating!
With the readr
, haven
, readxl
packages, we can load various type of data just by typing read_*()
, where * could represent csv, txt, sav, sas, dta, or excel.
Here’s an example:
# Set directories
root_dir <- ".." # .. denotes the parent directory of 01-Data-Wrangling.Rmd
# You can set your root_dir to the directory of where RWorkshop is
# e.g. /User/YOURNAME/RWorkshop
data_dir <- file.path(root_dir, "data")
# Set files
sample_csvfile <- "sample_data1.csv"
# Load data
df_sfx <- read_csv(file.path(data_dir, sample_csvfile))
Notice that I separated and comment each part of the process to make it clear for the readers and I to follow. Also, I use file.path()
to create several subfolders within the project directory (i call it root_dir). This type of naming will make things clear for more complex projects where you have several data sets in several folders, especially for longitudinal studies!
It is good to always create an Rproject for each project you are working on so R will automatically go to the directory where the RProject is located at when you open your project. Some common directories you may find are: data, output, R, plots, and report.
Now let’s try loading these few files into separate data frames:
read_spss()
and name it df_spssread_excel()
and name it spss_datadictHint 1: First, we have to load the haven
and readxl
packages as they are not loaded in the core tidyverse collection. Use the library()
function to load these packages
Hint 2: use file.path(filepath, filename)
to get the full path to a file rather than changing your directory! Do it like this:
# Type your code after this line!
library(haven)
library(readxl)
spss_file <- "sample_data3.sav"
spss_datadictfile <- "sample_data3_datadictionary.xlsx"
df_spss <- read_spss(file.path(data_dir, spss_file))
spss_datadict <- read_excel(file.path(data_dir, spss_datadictfile))
The first thing we want to do is to inspect the data. We have two data sets and one data dictionary. Let’s look at what each dataset contains.
df_sfx
Did you notice how under every column or variable, the variable type is written there. For example, Dx
is a character
, and Age
is a double
. This format looks like a data frame, but it is in tibble
format.
Tibbles are data frames, but they tweak some older behaviors to make life a little easier. I’ll use the term data frame and tibble interchangeably here since we are working with data frames overall.
The main difference between a tibble and a data.frame
is the manner it is printed. Tibbles show only the first 10 rows, while data.frame
shows everything! Hence, tibbles are designed so that you don’t accidentally overwhelm your console when you print large data frames.
Try typing df_sfx
and as.data.frame(df_sfx)
in the console and see the difference.
With typing your variable name in the console or editor, you print out the data as you imported it. This step helps you look at whether you have imported your data correctly.
The next thing you might want to do, is to examine some descriptives, such as the quartiles, min/max values, median, mean, and sd. We can do this by running sumary()
or skim()
.
summary(df_sfx) # Gets the summary of each variable (Min/max values, quartiles, median, mean)
# ID Dx Age Sex
# Min. : 1.00 Length:200 Min. :48.00 Length:200
# 1st Qu.: 50.75 Class :character 1st Qu.:59.00 Class :character
# Median :100.50 Mode :character Median :65.00 Mode :character
# Mean :100.50 Mean :64.97
# 3rd Qu.:150.25 3rd Qu.:70.00
# Max. :200.00 Max. :84.00
# Education MMSE CDR COG1
# Min. : 9.00 Min. : 5.00 Min. :0.0000 Min. : 0.229
# 1st Qu.:15.00 1st Qu.:18.00 1st Qu.:0.5000 1st Qu.: 3.648
# Median :17.00 Median :22.00 Median :0.5000 Median : 6.757
# Mean :17.08 Mean :21.09 Mean :0.5675 Mean : 7.748
# 3rd Qu.:19.00 3rd Qu.:25.00 3rd Qu.:1.0000 3rd Qu.:11.749
# Max. :25.00 Max. :30.00 Max. :1.0000 Max. :23.777
# COG2 COG3 RS1 RS2
# Min. : 0.9432 Min. :-7.565 Min. :-0.00291 Min. :-0.09319
# 1st Qu.: 7.2045 1st Qu.: 3.774 1st Qu.: 0.07026 1st Qu.: 0.07501
# Median : 8.7569 Median : 7.634 Median : 0.10032 Median : 0.11069
# Mean : 8.9015 Mean : 7.796 Mean : 0.10243 Mean : 0.11643
# 3rd Qu.:10.6878 3rd Qu.:11.546 3rd Qu.: 0.13020 3rd Qu.: 0.15951
# Max. :14.7452 Max. :21.738 Max. : 0.23003 Max. : 0.32720
# RS3 RS4 q1 q2
# Min. :-0.06079 Min. :-0.01458 Min. :1.00 Min. :1.000
# 1st Qu.: 0.03723 1st Qu.: 0.04233 1st Qu.:2.00 1st Qu.:2.000
# Median : 0.07329 Median : 0.06891 Median :3.00 Median :3.000
# Mean : 0.07456 Mean : 0.06749 Mean :3.03 Mean :2.935
# 3rd Qu.: 0.11244 3rd Qu.: 0.09359 3rd Qu.:4.00 3rd Qu.:4.000
# Max. : 0.22846 Max. : 0.13337 Max. :5.00 Max. :5.000
# q3 q4 q5 q6
# Min. :1.00 Min. :1.000 Min. :1.00 Min. :1.000
# 1st Qu.:2.00 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:2.000
# Median :3.00 Median :3.000 Median :3.00 Median :3.000
# Mean :3.04 Mean :2.855 Mean :2.97 Mean :2.985
# 3rd Qu.:4.00 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:4.000
# Max. :5.00 Max. :5.000 Max. :5.00 Max. :5.000
skim(df_sfx) # tells you similar things to summar(), but also gives you more!
Name | df_sfx |
Number of rows | 200 |
Number of columns | 20 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 18 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Dx | 0 | 1 | 2 | 7 | 0 | 6 | 0 |
Sex | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
ID | 0 | 1 | 100.50 | 57.88 | 1.00 | 50.75 | 100.50 | 150.25 | 200.00 | <U+2587><U+2587><U+2587><U+2587><U+2587> |
Age | 0 | 1 | 64.97 | 7.71 | 48.00 | 59.00 | 65.00 | 70.00 | 84.00 | <U+2583><U+2585><U+2587><U+2585><U+2582> |
Education | 0 | 1 | 17.08 | 2.97 | 9.00 | 15.00 | 17.00 | 19.00 | 25.00 | <U+2582><U+2586><U+2587><U+2586><U+2582> |
MMSE | 0 | 1 | 21.09 | 5.58 | 5.00 | 18.00 | 22.00 | 25.00 | 30.00 | <U+2582><U+2582><U+2586><U+2587><U+2585> |
CDR | 0 | 1 | 0.57 | 0.37 | 0.00 | 0.50 | 0.50 | 1.00 | 1.00 | <U+2583><U+2581><U+2587><U+2581><U+2586> |
COG1 | 0 | 1 | 7.75 | 5.14 | 0.23 | 3.65 | 6.76 | 11.75 | 23.78 | <U+2587><U+2587><U+2585><U+2582><U+2581> |
COG2 | 0 | 1 | 8.90 | 2.63 | 0.94 | 7.20 | 8.76 | 10.69 | 14.75 | <U+2581><U+2582><U+2587><U+2587><U+2582> |
COG3 | 0 | 1 | 7.80 | 4.84 | -7.57 | 3.77 | 7.63 | 11.55 | 21.74 | <U+2581><U+2585><U+2587><U+2586><U+2581> |
RS1 | 0 | 1 | 0.10 | 0.04 | 0.00 | 0.07 | 0.10 | 0.13 | 0.23 | <U+2582><U+2587><U+2587><U+2583><U+2581> |
RS2 | 0 | 1 | 0.12 | 0.06 | -0.09 | 0.08 | 0.11 | 0.16 | 0.33 | <U+2581><U+2583><U+2587><U+2583><U+2581> |
RS3 | 0 | 1 | 0.07 | 0.05 | -0.06 | 0.04 | 0.07 | 0.11 | 0.23 | <U+2582><U+2585><U+2587><U+2585><U+2581> |
RS4 | 0 | 1 | 0.07 | 0.03 | -0.01 | 0.04 | 0.07 | 0.09 | 0.13 | <U+2582><U+2586><U+2586><U+2587><U+2583> |
q1 | 0 | 1 | 3.03 | 1.37 | 1.00 | 2.00 | 3.00 | 4.00 | 5.00 | <U+2586><U+2587><U+2587><U+2587><U+2586> |
q2 | 0 | 1 | 2.94 | 1.46 | 1.00 | 2.00 | 3.00 | 4.00 | 5.00 | <U+2587><U+2587><U+2587><U+2586><U+2587> |
q3 | 0 | 1 | 3.04 | 1.35 | 1.00 | 2.00 | 3.00 | 4.00 | 5.00 | <U+2586><U+2585><U+2587><U+2586><U+2585> |
q4 | 0 | 1 | 2.86 | 1.35 | 1.00 | 2.00 | 3.00 | 4.00 | 5.00 | <U+2586><U+2587><U+2585><U+2586><U+2585> |
q5 | 0 | 1 | 2.97 | 1.42 | 1.00 | 2.00 | 3.00 | 4.00 | 5.00 | <U+2587><U+2586><U+2587><U+2587><U+2587> |
q6 | 0 | 1 | 2.98 | 1.47 | 1.00 | 2.00 | 3.00 | 4.00 | 5.00 | <U+2587><U+2587><U+2586><U+2586><U+2587> |
Did you see the difference between summary()
and skim()
? Skim also provides you with a small historgram at the last column, which is helpful if you want to know the distribution of the variable. Furthermore, it also highlights negative values in red!
skim()
is one of the more powerful tools compared to summary()
for basic data descriptives. I would recommend using it whenever you load a dataset to check whether your data has been loaded correctly.
First see what df_spss looks like, and then skim df_spss. What do you notice?
# Type your code after this line!
df_spss
skim(df_spss)
Name | df_spss |
Number of rows | 325 |
Number of columns | 7 |
_______________________ | |
Column type frequency: | |
character | 5 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
agegp2 | 2 | 0.99 | 1 | 1 | 0 | 2 | 0 |
educ | 2 | 0.99 | 1 | 1 | 0 | 5 | 0 |
EDUCGP2 | 2 | 0.99 | 1 | 1 | 0 | 2 | 0 |
DASSdepgp2 | 2 | 0.99 | 1 | 1 | 0 | 2 | 0 |
EPDSgp2 | 5 | 0.98 | 1 | 1 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
ID | 0 | 1.00 | 163.00 | 93.96 | 1 | 82 | 163 | 244 | 325 | <U+2587><U+2587><U+2587><U+2587><U+2587> |
age | 2 | 0.99 | 32.14 | 4.60 | 18 | 29 | 32 | 35 | 44 | <U+2581><U+2583><U+2587><U+2587><U+2581> |
Did you notice that there are some missing values (NA)? Most of our variables have this thing: <S3: haven_labelled>
. This means that we also imported the labels from SPSS. These are in fact categorical variables rather than numerical variables.
This happens when you import SPSS files, so check them before you run further analysis! This is the type of data spss gives you, as spss likes numbers. Hence, we use skim()
to check our data first.
From skimming, we see 5 character columns and 2 numeric columns. We also know that agegp2, educ, EDUCGP2, DASSdepgp2, and EPDSgp2 are characters, and ID and age are numerics.
So, how do we know what each number represents? We can see this by typing df_spss$agegp2
. We use $
after the data frame to call out the column of the data frame.
df_spss$agegp2
# <labelled<double>[325]>: Under & over 35s
# [1] 1 1 1 1 1 1 2 2 1 1 1 2 1 NA 2 2 1 2 2 1 1 1 1 1 2
# [26] 1 1 1 1 1 1 1 1 2 1 1 1 2 2 2 1 1 1 2 1 1 2 1 1 2
# [51] 1 1 1 2 1 1 1 1 2 1 1 1 1 2 2 1 2 2 1 1 1 1 1 1 2
# [76] 1 1 2 1 1 1 2 2 1 2 2 1 1 1 2 1 1 1 1 2 1 1 1 1 1
# [101] 2 1 1 1 2 1 2 2 1 2 2 1 1 1 2 1 1 1 1 2 1 2 1 1 1
# [126] 2 1 2 1 2 2 2 1 1 2 1 1 2 1 1 1 2 2 2 2 1 1 1 1 2
# [151] 2 1 1 2 1 2 2 2 1 1 1 1 1 2 1 1 1 1 2 1 2 1 1 1 2
# [176] 2 1 1 1 1 2 1 2 2 1 1 2 1 1 2 1 2 2 2 1 1 1 2 1 1
# [201] 1 1 2 1 1 1 2 1 1 1 2 2 1 2 1 2 1 1 1 1 1 2 2 1 1
# [226] 1 1 2 1 2 2 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1
# [251] 2 1 2 2 1 2 NA 1 2 2 2 1 2 1 2 1 1 1 1 1 1 2 1 1 1
# [276] 1 2 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 1 2 1 1 1
# [301] 2 1 1 1 1 2 1 1 1 1 1 2 2 2 2 1 1 1 1 2 2 1 2 1 2
#
# Labels:
# value label
# 1 Under 35s
# 2 35 and over
Look at the other variables: educ, EDUCGP2, DASSdepgp2, and EPDSgp2
# Type your code after this line!
df_spss$educ
# <labelled<double>[325]>
# [1] 6 NA 6 5 5 6 5 6 4 4 4 3 5 5 6 5 2 5 5 6 6 5 6 6 6
# [26] 5 5 3 5 6 6 4 3 6 5 6 5 6 6 6 6 5 6 2 5 3 6 5 3 3
# [51] 5 6 4 3 6 6 6 6 6 2 3 4 6 6 6 6 6 6 3 4 3 6 5 5 4
# [76] 3 5 6 5 6 5 3 6 5 5 4 4 5 4 4 3 6 6 6 6 4 6 4 6 5
# [101] 4 4 2 6 6 4 6 6 4 5 6 4 3 6 4 6 5 4 6 4 5 3 5 6 6
# [126] 6 4 5 5 3 6 6 5 5 6 5 5 5 6 3 4 5 3 3 5 6 6 6 5 6
# [151] 6 2 6 3 5 4 5 4 4 4 6 4 5 4 5 4 6 5 6 6 4 5 6 5 3
# [176] 2 5 3 6 2 6 4 4 6 3 4 4 4 5 3 4 6 6 6 6 6 5 6 5 4
# [201] 6 6 6 5 5 3 4 5 5 5 5 5 4 6 6 5 5 2 5 5 5 5 6 6 6
# [226] 4 4 4 3 5 3 5 4 5 5 3 3 5 4 5 5 3 2 5 6 5 6 2 5 5
# [251] 6 5 4 4 6 5 NA 5 6 5 6 5 6 5 5 6 5 6 6 6 4 5 6 3 5
# [276] 5 5 6 6 4 5 4 4 3 5 5 5 5 5 5 4 4 6 4 6 6 5 3 5 5
# [301] 6 3 6 5 3 4 6 5 6 5 5 5 6 6 6 4 5 5 5 4 6 6 5 6 6
#
# Labels:
# value label
# 1 primary school
# 2 some secondary
# 3 completed secondary
# 4 additional training
# 5 udergrad uni
# 6 postgrad uni
df_spss$EDUCGP2
# <labelled<double>[325]>: Education groups (2) Non-tertiary & tertiary
# [1] 2 NA 2 2 2 2 2 2 1 1 1 1 2 2 2 2 1 2 2 2 2 2 2 2 2
# [26] 2 2 1 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 1 1
# [51] 2 2 1 1 2 2 2 2 2 1 1 1 2 2 2 2 2 2 1 1 1 2 2 2 1
# [76] 1 2 2 2 2 2 1 2 2 2 1 1 2 1 1 1 2 2 2 2 1 2 1 2 2
# [101] 1 1 1 2 2 1 2 2 1 2 2 1 1 2 1 2 2 1 2 1 2 1 2 2 2
# [126] 2 1 2 2 1 2 2 2 2 2 2 2 2 2 1 1 2 1 1 2 2 2 2 2 2
# [151] 2 1 2 1 2 1 2 1 1 1 2 1 2 1 2 1 2 2 2 2 1 2 2 2 1
# [176] 1 2 1 2 1 2 1 1 2 1 1 1 1 2 1 1 2 2 2 2 2 2 2 2 1
# [201] 2 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2
# [226] 1 1 1 1 2 1 2 1 2 2 1 1 2 1 2 2 1 1 2 2 2 2 1 2 2
# [251] 2 2 1 1 2 2 NA 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2
# [276] 2 2 2 2 1 2 1 1 1 2 2 2 2 2 2 1 1 2 1 2 2 2 1 2 2
# [301] 2 1 2 2 1 1 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2
#
# Labels:
# value label
# 1 Non-tertiary
# 2 Tertiary
df_spss$DASSdepgp2
# <labelled<double>[325]>: DASS depress gps
# [1] 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
# [26] 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0
# [51] 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [76] 1 0 1 1 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0
# [101] 0 0 1 1 0 0 1 0 0 0 0 0 1 0 NA 0 0 1 0 1 0 0 0 0 0
# [126] 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
# [151] 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0
# [176] 0 0 1 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 1 0 0
# [201] 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0
# [226] 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
# [251] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 NA 0 0
# [276] 1 0 1 0 0 0 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0
# [301] 0 0 0 0 0 1 1 0 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 0 0
#
# Labels:
# value label
# 0 not depressed
# 1 mild to severe depression
df_spss$EPDSgp2
# <labelled<double>[325]>: EPDS gps
# [1] 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0
# [26] 0 0 0 0 NA 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 1 1 0
# [51] 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0
# [76] 1 0 0 1 0 1 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 0 0 0 0
# [101] 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0
# [126] 0 0 NA 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
# [151] 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0
# [176] 0 0 1 0 0 1 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 1 1 0
# [201] 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 NA 0 0 0 1 0
# [226] 0 1 0 0 1 0 0 0 1 0 1 0 0 1 1 0 0 1 0 0 0 0 0 1 0
# [251] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 NA 0 0
# [276] 0 0 1 0 0 0 1 NA 0 1 0 1 1 0 1 1 0 0 0 1 0 1 0 1 0
# [301] 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 0 0 0 1 0 1 0 0 1
#
# Labels:
# value label
# 0 not depressed
# 1 likely depressed
We can extract the labels from the dataset, but this takes a long time if we have a lot of variables. Hence, it is important to keep a data dictionary for our project. We have imported
We can apply as_factor()
to the entire SPSS data to convert all the numerics with factor labels into characters. It will be easier for us in the future steps of the analysis.
df_spss_converted <- as_factor(df_spss)
df_spss
df_spss_converted
There are five key functions in tidyverse that allow you to solve the vast majority of your data-manipulation challenges. These functions specifically come from the dplyr package.
filter()
.select()
.arrange()
.mutate()
.summarize()
.Sometimes we want to select certain variables to work with (your data may be huge). In that case, we can use select()
to filter those variables out.
df_sfx %>% select(ID)
We use %>%
to “pipe” a value. The %>%
comes from the magrittr
package in tidyverse
. This operator will forward a value, or the result of an expression, into the next function call/expression. In the above example, we take the df_sfx
object, and forward it to the next step, which is to select the ID variable using select(ID)
.
Running analysis without %>%
is like going to New York for a foodie hunt without trying New York’s Pizza.
Use CTRL + SHIFT + M (WINDOWS) or CMD + SHIFT + M (MAC) to pipe!
Use the %>%
function and select()
to select these variables from df_sfx
: Dx, Age, Sex
Hint: Separate the variables in select()
with a comma!
# Type your code after this line!
df_sfx %>% select(Dx, Age, Sex)
Sometimes, you don’t want some variables in your new data. For example, we do not want q1 to q6 in our new dataset. Hence we can use select(-variable)
to subtract these variables from the data frame.
df_sfx %>% select(-q1, -q2, -q3, -q4, -q5, -q6)
Okay let’s look above: It is too long to type all the variables in the function to extract them out. In this case we can use put these variables in vector format and parse them as an object for later use.
q_vars <- c("q1", "q2", "q3", "q4", "q5", "q6")
df_sfx %>% select(-q_vars)
See that df_sfx %>% select(-q_vars)
gives you the same outcome as df_sfx %>% select(-q1, -q2, -q3, -q4, -q5, -q6)
.
We want to remove the RS1, RS2, RS3, RS4 variables this time.
Hint: Parse the variables you want to remove into an object called RS_networks
.
# Type your code after this line!
RS_networks <- c("RS1", "RS2", "RS3", "RS4")
df_sfx %>% select(-RS_networks)
We use filter()
to remove or select rows depending on their values.
df_sfx %>% filter(Sex == "Female")
Comparisons: We can use >
, >=
, <
, <=
, !=
(not equal), and ==
(equal) to filter our values. Logical Operators: We can also use &
(and), &&
(and something else) |
(or) and !
(not) to filter our variables too!
Select and filter:
Use SHIFT + ALT + UP/DOWN (windows) or COMMAND + OPTION + UP/DOWN (Mac) to copy a line above/below
Hint: For no.2, you need to use two filter combinations: > , & Hint: For no.3, you need to use three filter combinations: > , & , | Hint: for characters such as CONTROL, you will need to use “CONTROL” instead of CONTROL
# Type your code after this line!
df_sfx %>% filter(MMSE > 23)
df_sfx %>% filter(MMSE > 16 & CDR == 0)
df_sfx %>% filter(MMSE > 16 & CDR == 0 & (Dx == "CONTROL" | Dx == "bvFTD" | Dx == "nfvPPA"))
Typing Dx == "CONTROL" | Dx == "bvFTD" | Dx == "nfvPPA"
takes a very long time. What if you have 10 categorical variables (we have 6 here), and you want 7 of them? It is not the best practice to use this method.
What we can do, is to use %in%
to retrieve the categories that we want
For example, if we want to extract these patient groups: CONTROL, PSP, svPPA, we can do this:
myDx <- c("CONTROL", "PSP", "svPPA")
df_sfx$Dx %in% myDx
# [1] FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
# [13] FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
# [25] TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE
# [37] TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE
# [49] FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
# [61] FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
# [73] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
# [85] TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE
# [97] FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
# [109] FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
# [121] FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
# [133] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
# [145] FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
# [157] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
# [169] TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE FALSE
# [181] TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
# [193] FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE
This returns a vector of TRUE and FALSE on which row has either CONTROL, PSP, or svPPA.
Hence we can use this to filter the rows!
demographics <- c("Dx", "Age", "Sex", "Education")
df_sfx %>% select(all_of(demographics), MMSE) %>% filter(MMSE > 23 & Dx %in% myDx)
Use the %in%
method to get:
# Type your code after this line!
demographics <- c("ID", "Age", "Sex", "Education", "Dx", "MMSE", "CDR", "COG1")
myDx <- c("CONTROL", "bvFTD", "nfvPPA")
df_sfx %>% select(all_of(demographics), MMSE) %>%
filter(MMSE > 16 & Dx %in% myDx)
arrange()
works similarly to filter()
except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by.
df_sfx %>% arrange(Sex, MMSE)
Arrange df_sfx by diagnosis, followed by their MMSE
# Type your code after this line!
df_sfx %>% arrange(Dx, MMSE)
We can also use desc() to reorder by a column in descending order:
df_sfx %>% arrange(desc(Dx), MMSE)
Use desc()
to arrange the COG1 in descending order.
# Type your code after this line!
df_sfx %>% arrange(desc(COG1))
Sometimes we have questionnaire items and we may want to compute the total score. It is useful to add new columns that are functions of existing columns. That’s the job of mutate()
.
mutate()
always adds new columns at the end of your dataset so we’ll start by creating a narrower dataset so we can see the new variables.
df_sfx %>% select(demographics, q1:q6) %>% mutate(q_total = q1 + q2 + q3 + q4 + q5 + q6)
What if you have 20 items in your questionnaires that you have to sum? Yes, there is a more efficient way to do this. We can use the combination of rowSums()
and select()
.
q_items <- c("q1", "q2", "q3", "q4", "q5", "q6")
df_sfx %>% select(demographics, q1:q6) %>% mutate(q_total = rowSums(.[q_items]))
For our questionnaire items, we want to reverse q2, q4, and q6. Here are some description: Minimum score: 1 Maximum score: 5
df_sfx_2
Hint: Use MaxScore + MinScore - score to reverse
# Type your code after this line!
df_sfx_2 <- df_sfx %>% mutate(q2r = 5+1-q2,
q4r = 5+1-q4,
q6r = 5+1-q6)
df_sfx_2
Using mutate multiple times can waste time if you have many items to reverse. There is another method you can use to mutate over mutliple variables. Here is an example:
# First create a function to reverse score
reverseScore <- function(item, MaxScore, MinScore) {
# if max is 5 and mix is 1, reversed scores should be 5, 4, 3, 2, 1
# 5+1 - 1 = 5, 5+1 - 2 = 4, 5+1 - 3 = 3, 5+1 - 2 = 4, 5+1 - 1 = 5
MaxScore + MinScore - item
}
# Then we create the new variables in the data frame
MaxScore <- 5; MinScore <- 1
df_sfx_2 <- df_sfx %>% select(demographics, q1:q6) %>%
mutate_at(vars("q2", "q4", "q6"), list(r = ~reverseScore(., MaxScore, MinScore))) # append _r at the end of the variable
# Compute the total score
new_q_items <- c("q1", "q2_r", "q3", "q4_r", "q5", "q6_r")
df_sfx_2 %>% mutate(q_total = rowSums(.[new_q_items]))
Here’s another method:
toReverse <- c("q2", "q4", "q6")
Reverse_newName <- c("q2_r", "q4_r", "q6_r")
# Simple loop to create new variables
df_sfx_2 <- df_sfx
MaxScore <- 5; MinScore <- 1
for (i in 1:length(toReverse)) {
df_sfx_2[Reverse_newName[i]] <- reverseScore(df_sfx[, toReverse[i]], MaxScore, MinScore)
}
df_sfx_2 %>% mutate(q_total = rowSums(.[new_q_items]))
q1 to q6 doesn’t really mean anything to us at this point, as they are random variables I create just for this workshop.
Let’s assume q1 to q6 represents some well-being scale, with higher scores indicating better well-being. The minimum score patients can achieve on this scale is 6, and the maximum score they can achieve is 30.
So let’s rename q1 to q6 into WB_1 to WB_6 using rename()
so we know what questionnaire it is.
I will show you an example on one item:
df_sfx_2 %>% rename(WB_1 = q1,
WB_2 = q2)
Rename q1 to q6 into WB_1 to WB_6, including the reversed items (q2_r, q4_r, q6_r). Remember, we have not created a variable for the total score yet.
# Type your code after this line!
df_sfx_2 %>% rename(WB_1 = q1,
WB_2 = q2,
WB_2_r = q2_r,
WB_3 = q3,
WB_4 = q4,
WB_4_r = q4_r,
WB_5 = q5,
WB_6 = q6,
WB_6_r = q6_r)
We can also use %in%
to rename these variables
# Simple loop to create new variables
oldnames <- c("q1", "q2", "q3", "q4", "q5", "q6", "q2_r", "q4_r","q6_r")
newnames <- c("WB_1", "WB_2", "WB_3", "WB_4", "WB_5", "WB_6", "WB_2_r", "WB_4_r","WB_6_r")
names(df_sfx_2)[names(df_sfx_2) %in% oldnames] <- newnames
Sometimes you may want to split your variables into categorical (e.g. high, medium, low) to run some interaction effects. In this case, we can use cut()
along with mutate()
to create a new variable, and convert it from continuous to categorical.
df_sfx_2 %>% mutate(category = cut(MMSE,
breaks = c(0, 15, 23, 30),
labels = c("Severe","Moderate","Functioning")))
The above code creates 3 categories: Severe MMSE (0-15), Moderate MMSE (16-23), and Functioning MMSE (24-30). For n categories, you will need n + 1 breaks as the scores in between the breaks are used to form the categories.
Hence if you want to create High vs Low MMSE, then you will need 3 break values:
df_sfx_2 %>% mutate(category = cut(MMSE,
breaks = c(0, 15, 30),
labels = c("Low","High")))
Create a new variable called education_cat that contains education level in categories.
# Type your code after this line!
df_sfx_2 %>% mutate(education_cat = cut(Education,
breaks = c(0, 11, 15, 19, 99),
labels = c("less than high school",
"high school",
"college graduation",
"advanced graduate degree")))
These are the steps you will need to perform to get to the final data.
Create an object called df_sfx_3
with ages above 65, MMSE > 16, CDR < 1, education = 17 (use filter()
)
Then, create the total score for WB (use mutate()
)
Select these variables: RS1, RS3, RS4, Age, MMSE, Dx, CDR, COG2, WB total score (use select()
)
Arrage the data by CDR, with ascending order (use arrange()
).
print or skim the data (type df_sfx_3)
Hint: Use the magrittr pipe %>%
to continue your code from the previous step
# Type your code after this line!
new_wb_items <- c("WB_1", "WB_2_r", "WB_3", "WB_4_r", "WB_5", "WB_6_r")
df_sfx_3 <- df_sfx_2 %>% filter(Age>65,
MMSE>16,
CDR<1,
Education == 17) %>%
mutate(WB_total = rowSums(.[new_wb_items])) %>%
select(RS1, RS3, RS4, Age, MMSE, Dx, CDR, COG2, WB_total) %>%
arrange(CDR)
df_sfx_3
Two functions that are usually used together are group_by()
and summarize()
. You don’t really use group_by()
alone, as there is no reason to group data unless you want to create a summary such as mean, sd, or median.
df_sfx_2 %>% group_by(Dx) %>% summarize(CDR_mean = mean(CDR),
CDR_SD = sd(CDR))
What do you notice about the CDR for the CONTROLS? Which group has the highest CDR?
Use group_by()
, summarise()
to summarize the mean, sd, and median of MMSE in each diagnostic group. Use df_sfx_2 for this exercise.
Hint: mean: mean()
, sd: sd()
, median: median()
.
# Type your code after this line!
df_sfx_2 %>% group_by(Dx) %>%
summarize(MEAN=mean(MMSE),
SD=sd(MMSE),
MEDIAN=median(MMSE))
Use group_by()
, summarise()
, and arrange()
to display each group’s COG1, MMSE, and CDR mean score in descending order (Highest COG1 on top). Call this data frame df_sfx_groupsummary
. Use df_sfx_2 for this exercise.
# Type your code after this line!
df_sfx_groupsummary <- df_sfx_2 %>% group_by(Dx) %>%
summarize(COG1MEAN=mean(COG1),
MMSEMEAN=sd(MMSE),
CDRMEAN=median(CDR))
df_sfx_groupsummary
Frequently data will be in formats that are unsuitable for analysis. One of the most common examples of this issue encountered is data where things that should be in the rows are in the columns or vice versa. In these cases, we need to rearrange the data we’ve received into a form that is easier to deal with.
When we want to turn wide data format into long data format, we can use gather()
.
data %>% gather(key = "NewVariableName", value = "NewResult")
is the typical structure of using this code, but it can get complicated with keys and values.
In short:
key
: the name of the column that will consist of the variables you want in long formatvalue
: the values of the columns that you want in long format.Here’s an example:
df_sfx_2 %>% select(ID, demographics, MMSE, CDR, COG1) %>% # Select only relevent variables to keep columns short
gather(key = "CogTest", value = "Score", MMSE, CDR, COG1) %>%
arrange(ID)
Did you see that the variables MMSE, CDR, and COG1 has now disappeared? It is now contained in a new variable called CogTest. Furthermore, we now have 600 rows instead of 200! This is because we have 200 patients and 3 tests that have to be rearranged into long format.
The opposite of gather()
is spread()
. As you might have guessed, it turns long data into wide data format.
Here’s an example:
# Set an example long data from previous line
df_sfx_long <- df_sfx_2 %>%
select(ID, demographics, MMSE, CDR, COG1) %>%
gather(key = "CogTest", value = "Score", MMSE, CDR, COG1)
# Now turn the long data back to wide data
df_sfx_long %>% spread(key = "CogTest", value = "Score") # Note that it spreads by alphabetical order unless you specify otherwise.
The format is similar to gather()
:
key
: The column you want to split (should be categorical)value
: The column you want to use to populate the new columns (the value
column we just created in the spread step)Using the df_sfx_groupsummary
we have just created, gather the MMSE, CDR, and COG1 into one column called CogTest
.
# Type your code after this line!
df_sfx_groupsummary %>% gather(key = "CogTest", value = "Score", MMSEMEAN, CDRMEAN, COG1MEAN)
The functions above are preparation steps to get your data ready to run statistical analyses. Once these are completed, you can begin to run your data analysis using the tidyverse format!
For example, you might be interested in:
diagnostic groups: CONTROLS, bvFTD, svPPA
MMSE > 16
CDR < 1 (early stage of disease)
The variables you might want to analyze are COG2, RS1, RS2, and WB (questionnaire).
Your analysis plan might look like this:
# Set interested variables
demo <- c("ID", "Age", "Sex", "Dx", "Education", "CDR", "MMSE")
myDx <- c("CONTROLS", "bvFTD", "svPPA")
RSs <- c("RS1", "RS2")
WBs_old <- c("q1", "q2", "q3", "q4", "q5", "q6")
WBs_new <- c("q1", "q2_r", "q3", "q4_r", "q5", "q6_r")
COG2 <- c("COG2")
# Reverse scoring function and prep
reverseScore <- function(item, MaxScore, MinScore) {
# if max is 5 and mix is 1, reversed scores should be 5, 4, 3, 2, 1
# 5+1 - 1 = 5, 5+1 - 2 = 4, 5+1 - 3 = 3, 5+1 - 2 = 4, 5+1 - 1 = 5
MaxScore + MinScore - item
}
MaxScore <- 5; MinScore <- 1
# Subset data
df <- df_sfx %>% # Take the data
select(all_of(demo),
all_of(RSs),
all_of(WBs_old),
all_of(COG2)) %>% # Select interested vars
filter(Dx %in% myDx) %>% # Filter only myDx
mutate_at(vars("q2", "q4", "q6"),
list(r = ~reverseScore(., MaxScore, MinScore))) %>% # append _r at the end of the variable
mutate(WB_Score = rowSums(.[WBs_new])) %>% # Create WB total score
select(all_of(demo),
all_of(RSs),
WB_Score,
all_of(COG2))
skim(df)
Name | df |
Number of rows | 52 |
Number of columns | 11 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 9 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Sex | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
Dx | 0 | 1 | 5 | 5 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
ID | 0 | 1 | 97.40 | 63.03 | 2.00 | 38.75 | 89.50 | 148.25 | 198.00 | <U+2587><U+2585><U+2585><U+2585><U+2586> |
Age | 0 | 1 | 65.85 | 8.54 | 48.00 | 59.00 | 65.00 | 70.75 | 83.00 | <U+2582><U+2587><U+2587><U+2586><U+2583> |
Education | 0 | 1 | 17.48 | 2.84 | 12.00 | 15.00 | 18.00 | 19.25 | 23.00 | <U+2583><U+2586><U+2585><U+2587><U+2583> |
CDR | 0 | 1 | 0.75 | 0.25 | 0.50 | 0.50 | 0.75 | 1.00 | 1.00 | <U+2587><U+2581><U+2581><U+2581><U+2587> |
MMSE | 0 | 1 | 19.90 | 5.09 | 6.00 | 18.00 | 21.00 | 23.25 | 27.00 | <U+2581><U+2582><U+2585><U+2586><U+2587> |
RS1 | 0 | 1 | 0.10 | 0.05 | 0.00 | 0.07 | 0.09 | 0.12 | 0.20 | <U+2582><U+2587><U+2587><U+2586><U+2582> |
RS2 | 0 | 1 | 0.12 | 0.07 | -0.01 | 0.08 | 0.11 | 0.16 | 0.33 | <U+2583><U+2587><U+2586><U+2582><U+2581> |
WB_Score | 0 | 1 | 17.83 | 3.58 | 11.00 | 15.00 | 18.00 | 20.25 | 27.00 | <U+2586><U+2587><U+2587><U+2586><U+2581> |
COG2 | 0 | 1 | 8.52 | 2.68 | 0.94 | 7.04 | 8.21 | 10.29 | 13.89 | <U+2581><U+2582><U+2587><U+2585><U+2583> |
Let’s say you want to examine:
the differences in MMSE between the patient groups (ANOVA)
the relationship between RS1 and COG2, controlling for MMSE, Education, Age, and Sex (linear regression)
mod1 <- aov(MMSE ~ Dx, data = df)
mod1 %>% broom::tidy()# or mod1 %>% summary()
summary(mod1)
# Df Sum Sq Mean Sq F value Pr(>F)
# Dx 1 128.1 128.13 5.364 0.0247 *
# Residuals 50 1194.4 23.89
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
reg_cov <- lm(COG2 ~ MMSE + Education + Age + Sex, data = df)
reg_cov %>% summary()
#
# Call:
# lm(formula = COG2 ~ MMSE + Education + Age + Sex, data = df)
#
# Residuals:
# Min 1Q Median 3Q Max
# -7.047 -1.623 -0.409 1.508 4.971
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 2.66742 3.83268 0.696 0.490
# MMSE -0.04737 0.07518 -0.630 0.532
# Education 0.16296 0.13371 1.219 0.229
# Age 0.05788 0.04444 1.302 0.199
# SexMale 0.24386 0.76357 0.319 0.751
#
# Residual standard error: 2.684 on 47 degrees of freedom
# Multiple R-squared: 0.07631, Adjusted R-squared: -0.002302
# F-statistic: 0.9707 on 4 and 47 DF, p-value: 0.4325
reg_mod <- lm(COG2 ~ MMSE + Education + Age + Sex + RS1, data = df)
reg_mod %>% summary()
#
# Call:
# lm(formula = COG2 ~ MMSE + Education + Age + Sex + RS1, data = df)
#
# Residuals:
# Min 1Q Median 3Q Max
# -5.3503 -1.4753 -0.0593 1.4560 5.7189
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 7.90817 4.05959 1.948 0.05753 .
# MMSE -0.06287 0.07061 -0.890 0.37786
# Education 0.17571 0.12527 1.403 0.16744
# Age 0.01391 0.04455 0.312 0.75619
# SexMale 0.30799 0.71526 0.431 0.66877
# RS1 -23.14161 8.38336 -2.760 0.00826 **
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 2.513 on 46 degrees of freedom
# Multiple R-squared: 0.2076, Adjusted R-squared: 0.1214
# F-statistic: 2.41 on 5 and 46 DF, p-value: 0.05059
stats::anova(reg_cov, reg_mod)
stats::anova(reg_cov, reg_mod) %>% broom::tidy()
Use tidy()
from the broom
package to summarize information about the components of a model
Extract variables with select()
Extract cases with filter()
Arrange cases, with arrange()
Connect operations with %>%
Make tables of summaries with summarise()
Make new variables with mutate()
Dogroupwise operations with group_by()
turn data into long format using gather()
, and into wide format using spread()
.