R notebooks

This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

R code goes in code chunks, denoted by three backticks. Executing code chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Crtl+Shift+Enter (Windows) or Cmd+Shift+Enter (Mac).

Along the workshop, I will try to share as many best practices as possible. These will be denoted by an fyi box like this:

This is a best practice in R coding (or at least i think it is)

This is an exercise for you to complete

This is a is a hint for the exercise

Setup

The first chunk in an R Notebook is usually titled “setup,” and by convention includes the R packages you want to load. Remember, in order to use an R package you have to run some library() code every session. Execute these lines of code to load the packages.

To install a package, you have to type install.packages("packagename").

I’ve actually set the packages up in a way that if the package is not installeed yet if(!require(packagename)) (!require() means package does not exist), it will automatically install it install.packages("packagename"), and run load the package library(packagename).

Here is an example: if (!require(knitr)) install.packages("knitr"); library(knitr).

if (!require(knitr)) install.packages("knitr"); library(knitr)
## Loading required package: knitr
## Warning: package 'knitr' was built under R version 4.0.5
knitr::opts_chunk$set(comment = "#", warning = FALSE, message = FALSE)
if (!require(kableExtra)) install.packages("kableExtra"); library(kableExtra)
if (!require(tidyverse)) install.packages("tidyverse") ; library(tidyverse)
if (!require(skimr)) install.packages("skimr") ; library(skimr)

Loading data

First we have to know how to load our data. Without this step, there is no data to even begin manipulating!

With the readr, haven, readxl packages, we can load various type of data just by typing read_*(), where * could represent csv, txt, sav, sas, dta, or excel.

Here’s an example:

# Set directories 
root_dir <- ".." # .. denotes the parent directory of 01-Data-Wrangling.Rmd
# You can set your root_dir to the directory of where RWorkshop is
# e.g. /User/YOURNAME/RWorkshop
data_dir <- file.path(root_dir, "data")

# Set files
sample_csvfile <- "sample_data1.csv"

# Load data
df_sfx <- read_csv(file.path(data_dir, sample_csvfile))

Notice that I separated and comment each part of the process to make it clear for the readers and I to follow. Also, I use file.path() to create several subfolders within the project directory (i call it root_dir). This type of naming will make things clear for more complex projects where you have several data sets in several folders, especially for longitudinal studies!

It is good to always create an Rproject for each project you are working on so R will automatically go to the directory where the RProject is located at when you open your project. Some common directories you may find are: data, output, R, plots, and report.

Now, you try!

Now let’s try loading these few files into separate data frames:

  1. load the sample_data.sav file with read_spss() and name it df_spss
  2. load the spss_datadictionary.xlsx file with read_excel() and name it spss_datadict

Hint 1: First, we have to load the haven and readxl packages as they are not loaded in the core tidyverse collection. Use the library() function to load these packages

Hint 2: use file.path(filepath, filename) to get the full path to a file rather than changing your directory! Do it like this:

# Type your code after this line! 
library(haven)
library(readxl)
spss_file <- "sample_data3.sav"
spss_datadictfile <- "sample_data3_datadictionary.xlsx"

df_spss <- read_spss(file.path(data_dir, spss_file))
spss_datadict <- read_excel(file.path(data_dir, spss_datadictfile))

Inspecting your data

The first thing we want to do is to inspect the data. We have two data sets and one data dictionary. Let’s look at what each dataset contains.

df_sfx

Did you notice how under every column or variable, the variable type is written there. For example, Dx is a character, and Age is a double. This format looks like a data frame, but it is in tibble format.

Tibbles are data frames, but they tweak some older behaviors to make life a little easier. I’ll use the term data frame and tibble interchangeably here since we are working with data frames overall.

The main difference between a tibble and a data.frame is the manner it is printed. Tibbles show only the first 10 rows, while data.frame shows everything! Hence, tibbles are designed so that you don’t accidentally overwhelm your console when you print large data frames.

Try typing df_sfx and as.data.frame(df_sfx) in the console and see the difference.

Looking at summary statistics of the data.

With typing your variable name in the console or editor, you print out the data as you imported it. This step helps you look at whether you have imported your data correctly.

The next thing you might want to do, is to examine some descriptives, such as the quartiles, min/max values, median, mean, and sd. We can do this by running sumary() or skim().

summary(df_sfx) # Gets the summary of each variable (Min/max values, quartiles, median, mean)
#        ID              Dx                 Age            Sex           
#  Min.   :  1.00   Length:200         Min.   :48.00   Length:200        
#  1st Qu.: 50.75   Class :character   1st Qu.:59.00   Class :character  
#  Median :100.50   Mode  :character   Median :65.00   Mode  :character  
#  Mean   :100.50                      Mean   :64.97                     
#  3rd Qu.:150.25                      3rd Qu.:70.00                     
#  Max.   :200.00                      Max.   :84.00                     
#    Education          MMSE            CDR              COG1       
#  Min.   : 9.00   Min.   : 5.00   Min.   :0.0000   Min.   : 0.229  
#  1st Qu.:15.00   1st Qu.:18.00   1st Qu.:0.5000   1st Qu.: 3.648  
#  Median :17.00   Median :22.00   Median :0.5000   Median : 6.757  
#  Mean   :17.08   Mean   :21.09   Mean   :0.5675   Mean   : 7.748  
#  3rd Qu.:19.00   3rd Qu.:25.00   3rd Qu.:1.0000   3rd Qu.:11.749  
#  Max.   :25.00   Max.   :30.00   Max.   :1.0000   Max.   :23.777  
#       COG2              COG3             RS1                RS2          
#  Min.   : 0.9432   Min.   :-7.565   Min.   :-0.00291   Min.   :-0.09319  
#  1st Qu.: 7.2045   1st Qu.: 3.774   1st Qu.: 0.07026   1st Qu.: 0.07501  
#  Median : 8.7569   Median : 7.634   Median : 0.10032   Median : 0.11069  
#  Mean   : 8.9015   Mean   : 7.796   Mean   : 0.10243   Mean   : 0.11643  
#  3rd Qu.:10.6878   3rd Qu.:11.546   3rd Qu.: 0.13020   3rd Qu.: 0.15951  
#  Max.   :14.7452   Max.   :21.738   Max.   : 0.23003   Max.   : 0.32720  
#       RS3                RS4                 q1             q2       
#  Min.   :-0.06079   Min.   :-0.01458   Min.   :1.00   Min.   :1.000  
#  1st Qu.: 0.03723   1st Qu.: 0.04233   1st Qu.:2.00   1st Qu.:2.000  
#  Median : 0.07329   Median : 0.06891   Median :3.00   Median :3.000  
#  Mean   : 0.07456   Mean   : 0.06749   Mean   :3.03   Mean   :2.935  
#  3rd Qu.: 0.11244   3rd Qu.: 0.09359   3rd Qu.:4.00   3rd Qu.:4.000  
#  Max.   : 0.22846   Max.   : 0.13337   Max.   :5.00   Max.   :5.000  
#        q3             q4              q5             q6       
#  Min.   :1.00   Min.   :1.000   Min.   :1.00   Min.   :1.000  
#  1st Qu.:2.00   1st Qu.:2.000   1st Qu.:2.00   1st Qu.:2.000  
#  Median :3.00   Median :3.000   Median :3.00   Median :3.000  
#  Mean   :3.04   Mean   :2.855   Mean   :2.97   Mean   :2.985  
#  3rd Qu.:4.00   3rd Qu.:4.000   3rd Qu.:4.00   3rd Qu.:4.000  
#  Max.   :5.00   Max.   :5.000   Max.   :5.00   Max.   :5.000
skim(df_sfx) # tells you similar things to summar(), but also gives you more! 
Data summary
Name df_sfx
Number of rows 200
Number of columns 20
_______________________
Column type frequency:
character 2
numeric 18
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Dx 0 1 2 7 0 6 0
Sex 0 1 4 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ID 0 1 100.50 57.88 1.00 50.75 100.50 150.25 200.00 <U+2587><U+2587><U+2587><U+2587><U+2587>
Age 0 1 64.97 7.71 48.00 59.00 65.00 70.00 84.00 <U+2583><U+2585><U+2587><U+2585><U+2582>
Education 0 1 17.08 2.97 9.00 15.00 17.00 19.00 25.00 <U+2582><U+2586><U+2587><U+2586><U+2582>
MMSE 0 1 21.09 5.58 5.00 18.00 22.00 25.00 30.00 <U+2582><U+2582><U+2586><U+2587><U+2585>
CDR 0 1 0.57 0.37 0.00 0.50 0.50 1.00 1.00 <U+2583><U+2581><U+2587><U+2581><U+2586>
COG1 0 1 7.75 5.14 0.23 3.65 6.76 11.75 23.78 <U+2587><U+2587><U+2585><U+2582><U+2581>
COG2 0 1 8.90 2.63 0.94 7.20 8.76 10.69 14.75 <U+2581><U+2582><U+2587><U+2587><U+2582>
COG3 0 1 7.80 4.84 -7.57 3.77 7.63 11.55 21.74 <U+2581><U+2585><U+2587><U+2586><U+2581>
RS1 0 1 0.10 0.04 0.00 0.07 0.10 0.13 0.23 <U+2582><U+2587><U+2587><U+2583><U+2581>
RS2 0 1 0.12 0.06 -0.09 0.08 0.11 0.16 0.33 <U+2581><U+2583><U+2587><U+2583><U+2581>
RS3 0 1 0.07 0.05 -0.06 0.04 0.07 0.11 0.23 <U+2582><U+2585><U+2587><U+2585><U+2581>
RS4 0 1 0.07 0.03 -0.01 0.04 0.07 0.09 0.13 <U+2582><U+2586><U+2586><U+2587><U+2583>
q1 0 1 3.03 1.37 1.00 2.00 3.00 4.00 5.00 <U+2586><U+2587><U+2587><U+2587><U+2586>
q2 0 1 2.94 1.46 1.00 2.00 3.00 4.00 5.00 <U+2587><U+2587><U+2587><U+2586><U+2587>
q3 0 1 3.04 1.35 1.00 2.00 3.00 4.00 5.00 <U+2586><U+2585><U+2587><U+2586><U+2585>
q4 0 1 2.86 1.35 1.00 2.00 3.00 4.00 5.00 <U+2586><U+2587><U+2585><U+2586><U+2585>
q5 0 1 2.97 1.42 1.00 2.00 3.00 4.00 5.00 <U+2587><U+2586><U+2587><U+2587><U+2587>
q6 0 1 2.98 1.47 1.00 2.00 3.00 4.00 5.00 <U+2587><U+2587><U+2586><U+2586><U+2587>


Did you see the difference between summary() and skim()? Skim also provides you with a small historgram at the last column, which is helpful if you want to know the distribution of the variable. Furthermore, it also highlights negative values in red!

skim() is one of the more powerful tools compared to summary() for basic data descriptives. I would recommend using it whenever you load a dataset to check whether your data has been loaded correctly.

Now, you try!

First see what df_spss looks like, and then skim df_spss. What do you notice?

# Type your code after this line! 
df_spss
skim(df_spss)
Data summary
Name df_spss
Number of rows 325
Number of columns 7
_______________________
Column type frequency:
character 5
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
agegp2 2 0.99 1 1 0 2 0
educ 2 0.99 1 1 0 5 0
EDUCGP2 2 0.99 1 1 0 2 0
DASSdepgp2 2 0.99 1 1 0 2 0
EPDSgp2 5 0.98 1 1 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ID 0 1.00 163.00 93.96 1 82 163 244 325 <U+2587><U+2587><U+2587><U+2587><U+2587>
age 2 0.99 32.14 4.60 18 29 32 35 44 <U+2581><U+2583><U+2587><U+2587><U+2581>

Did you notice that there are some missing values (NA)? Most of our variables have this thing: <S3: haven_labelled>. This means that we also imported the labels from SPSS. These are in fact categorical variables rather than numerical variables.

This happens when you import SPSS files, so check them before you run further analysis! This is the type of data spss gives you, as spss likes numbers. Hence, we use skim() to check our data first.

From skimming, we see 5 character columns and 2 numeric columns. We also know that agegp2, educ, EDUCGP2, DASSdepgp2, and EPDSgp2 are characters, and ID and age are numerics.

So, how do we know what each number represents? We can see this by typing df_spss$agegp2. We use $ after the data frame to call out the column of the data frame.

df_spss$agegp2
# <labelled<double>[325]>: Under & over 35s
#   [1]  1  1  1  1  1  1  2  2  1  1  1  2  1 NA  2  2  1  2  2  1  1  1  1  1  2
#  [26]  1  1  1  1  1  1  1  1  2  1  1  1  2  2  2  1  1  1  2  1  1  2  1  1  2
#  [51]  1  1  1  2  1  1  1  1  2  1  1  1  1  2  2  1  2  2  1  1  1  1  1  1  2
#  [76]  1  1  2  1  1  1  2  2  1  2  2  1  1  1  2  1  1  1  1  2  1  1  1  1  1
# [101]  2  1  1  1  2  1  2  2  1  2  2  1  1  1  2  1  1  1  1  2  1  2  1  1  1
# [126]  2  1  2  1  2  2  2  1  1  2  1  1  2  1  1  1  2  2  2  2  1  1  1  1  2
# [151]  2  1  1  2  1  2  2  2  1  1  1  1  1  2  1  1  1  1  2  1  2  1  1  1  2
# [176]  2  1  1  1  1  2  1  2  2  1  1  2  1  1  2  1  2  2  2  1  1  1  2  1  1
# [201]  1  1  2  1  1  1  2  1  1  1  2  2  1  2  1  2  1  1  1  1  1  2  2  1  1
# [226]  1  1  2  1  2  2  2  1  1  1  1  1  1  2  1  1  1  1  1  1  1  1  1  1  1
# [251]  2  1  2  2  1  2 NA  1  2  2  2  1  2  1  2  1  1  1  1  1  1  2  1  1  1
# [276]  1  2  1  2  1  1  1  1  1  1  1  2  1  1  1  1  1  2  1  2  1  2  1  1  1
# [301]  2  1  1  1  1  2  1  1  1  1  1  2  2  2  2  1  1  1  1  2  2  1  2  1  2
# 
# Labels:
#  value       label
#      1   Under 35s
#      2 35 and over
Now, you try!

Look at the other variables: educ, EDUCGP2, DASSdepgp2, and EPDSgp2

# Type your code after this line! 
df_spss$educ
# <labelled<double>[325]>
#   [1]  6 NA  6  5  5  6  5  6  4  4  4  3  5  5  6  5  2  5  5  6  6  5  6  6  6
#  [26]  5  5  3  5  6  6  4  3  6  5  6  5  6  6  6  6  5  6  2  5  3  6  5  3  3
#  [51]  5  6  4  3  6  6  6  6  6  2  3  4  6  6  6  6  6  6  3  4  3  6  5  5  4
#  [76]  3  5  6  5  6  5  3  6  5  5  4  4  5  4  4  3  6  6  6  6  4  6  4  6  5
# [101]  4  4  2  6  6  4  6  6  4  5  6  4  3  6  4  6  5  4  6  4  5  3  5  6  6
# [126]  6  4  5  5  3  6  6  5  5  6  5  5  5  6  3  4  5  3  3  5  6  6  6  5  6
# [151]  6  2  6  3  5  4  5  4  4  4  6  4  5  4  5  4  6  5  6  6  4  5  6  5  3
# [176]  2  5  3  6  2  6  4  4  6  3  4  4  4  5  3  4  6  6  6  6  6  5  6  5  4
# [201]  6  6  6  5  5  3  4  5  5  5  5  5  4  6  6  5  5  2  5  5  5  5  6  6  6
# [226]  4  4  4  3  5  3  5  4  5  5  3  3  5  4  5  5  3  2  5  6  5  6  2  5  5
# [251]  6  5  4  4  6  5 NA  5  6  5  6  5  6  5  5  6  5  6  6  6  4  5  6  3  5
# [276]  5  5  6  6  4  5  4  4  3  5  5  5  5  5  5  4  4  6  4  6  6  5  3  5  5
# [301]  6  3  6  5  3  4  6  5  6  5  5  5  6  6  6  4  5  5  5  4  6  6  5  6  6
# 
# Labels:
#  value               label
#      1      primary school
#      2      some secondary
#      3 completed secondary
#      4 additional training
#      5        udergrad uni
#      6        postgrad uni
df_spss$EDUCGP2
# <labelled<double>[325]>: Education groups (2) Non-tertiary & tertiary
#   [1]  2 NA  2  2  2  2  2  2  1  1  1  1  2  2  2  2  1  2  2  2  2  2  2  2  2
#  [26]  2  2  1  2  2  2  1  1  2  2  2  2  2  2  2  2  2  2  1  2  1  2  2  1  1
#  [51]  2  2  1  1  2  2  2  2  2  1  1  1  2  2  2  2  2  2  1  1  1  2  2  2  1
#  [76]  1  2  2  2  2  2  1  2  2  2  1  1  2  1  1  1  2  2  2  2  1  2  1  2  2
# [101]  1  1  1  2  2  1  2  2  1  2  2  1  1  2  1  2  2  1  2  1  2  1  2  2  2
# [126]  2  1  2  2  1  2  2  2  2  2  2  2  2  2  1  1  2  1  1  2  2  2  2  2  2
# [151]  2  1  2  1  2  1  2  1  1  1  2  1  2  1  2  1  2  2  2  2  1  2  2  2  1
# [176]  1  2  1  2  1  2  1  1  2  1  1  1  1  2  1  1  2  2  2  2  2  2  2  2  1
# [201]  2  2  2  2  2  1  1  2  2  2  2  2  1  2  2  2  2  1  2  2  2  2  2  2  2
# [226]  1  1  1  1  2  1  2  1  2  2  1  1  2  1  2  2  1  1  2  2  2  2  1  2  2
# [251]  2  2  1  1  2  2 NA  2  2  2  2  2  2  2  2  2  2  2  2  2  1  2  2  1  2
# [276]  2  2  2  2  1  2  1  1  1  2  2  2  2  2  2  1  1  2  1  2  2  2  1  2  2
# [301]  2  1  2  2  1  1  2  2  2  2  2  2  2  2  2  1  2  2  2  1  2  2  2  2  2
# 
# Labels:
#  value        label
#      1 Non-tertiary
#      2     Tertiary
df_spss$DASSdepgp2
# <labelled<double>[325]>: DASS depress gps
#   [1]  0  1  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0
#  [26]  0  0  1  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  1  0  1  0  0  1  0
#  [51]  0  0  0  0  1  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
#  [76]  1  0  1  1  0  1  1  0  0  0  0  1  0  0  0  1  0  0  0  1  0  0  0  0  0
# [101]  0  0  1  1  0  0  1  0  0  0  0  0  1  0 NA  0  0  1  0  1  0  0  0  0  0
# [126]  0  0  0  1  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0
# [151]  0  0  1  1  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  1  0  0  0  0  0
# [176]  0  0  1  0  0  1  0  0  0  0  0  1  1  0  1  0  0  0  0  0  0  0  1  0  0
# [201]  1  0  0  0  0  0  1  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  1  0
# [226]  0  1  0  0  1  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0
# [251]  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1 NA  0  0
# [276]  1  0  1  0  0  0  1  1  0  1  0  1  1  0  0  0  0  0  0  0  0  1  0  0  0
# [301]  0  0  0  0  0  1  1  0  0  0  1  1  0  1  0  1  0  0  0  1  0  0  0  0  0
# 
# Labels:
#  value                     label
#      0             not depressed
#      1 mild to severe depression
df_spss$EPDSgp2
# <labelled<double>[325]>: EPDS gps
#   [1]  0  1  0  0  0  0  0  0  1  1  0  0  0  0  1  1  0  0  0  1  1  0  0  0  0
#  [26]  0  0  0  0 NA  0  0  0  1  0  0  1  0  0  0  0  0  0  0  1  1  0  1  1  0
#  [51]  0  0  0  0  0  1  0  0  0  1  0  0  1  1  0  0  0  1  0  0  0  0  0  0  0
#  [76]  1  0  0  1  0  1  1  0  1  0  0  1  1  0  0  1  0  1  0  0  0  0  0  0  0
# [101]  0  0  0  1  0  0  0  0  0  0  0  0  1  0  0  0  0  1  0  1  0  0  0  0  0
# [126]  0  0 NA  1  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0
# [151]  0  0  0  1  0  0  0  1  0  0  0  0  0  1  0  1  0  0  0  1  0  0  0  1  0
# [176]  0  0  1  0  0  1  1  0  0  0  0  1  1  0  1  0  0  0  0  0  0  0  1  1  0
# [201]  0  0  0  0  0  0  1  0  0  0  0  0  1  0  0  1  0  0  0 NA  0  0  0  1  0
# [226]  0  1  0  0  1  0  0  0  1  0  1  0  0  1  1  0  0  1  0  0  0  0  0  1  0
# [251]  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  1  0  0  0  0  1 NA  0  0
# [276]  0  0  1  0  0  0  1 NA  0  1  0  1  1  0  1  1  0  0  0  1  0  1  0  1  0
# [301]  0  0  0  0  0  0  0  0  1  0  1  1  0  0  0  1  0  0  0  1  0  1  0  0  1
# 
# Labels:
#  value            label
#      0    not depressed
#      1 likely depressed

How do we convert these numerics into characters??

We can extract the labels from the dataset, but this takes a long time if we have a lot of variables. Hence, it is important to keep a data dictionary for our project. We have imported

Here’s the Haven method.

We can apply as_factor() to the entire SPSS data to convert all the numerics with factor labels into characters. It will be easier for us in the future steps of the analysis.

df_spss_converted <- as_factor(df_spss)
df_spss
df_spss_converted

Five commonly used functions for data manipulation

There are five key functions in tidyverse that allow you to solve the vast majority of your data-manipulation challenges. These functions specifically come from the dplyr package.

  • Pick observations by their values filter().
  • Pick variables by their names select().
  • Reorder the rows arrange().
  • Create new variables with functions of existing variables mutate().
  • Collapse many values down to a single summary summarize().

Select

Sometimes we want to select certain variables to work with (your data may be huge). In that case, we can use select() to filter those variables out.

df_sfx %>% select(ID)

We use %>% to “pipe” a value. The %>% comes from the magrittr package in tidyverse. This operator will forward a value, or the result of an expression, into the next function call/expression. In the above example, we take the df_sfx object, and forward it to the next step, which is to select the ID variable using select(ID).

Running analysis without %>% is like going to New York for a foodie hunt without trying New York’s Pizza.

Use CTRL + SHIFT + M (WINDOWS) or CMD + SHIFT + M (MAC) to pipe!

Now, you try!

Use the %>% function and select() to select these variables from df_sfx: Dx, Age, Sex

Hint: Separate the variables in select() with a comma!

# Type your code after this line! 
df_sfx %>% select(Dx, Age, Sex)

Sometimes, you don’t want some variables in your new data. For example, we do not want q1 to q6 in our new dataset. Hence we can use select(-variable) to subtract these variables from the data frame.

df_sfx %>% select(-q1, -q2, -q3, -q4, -q5, -q6)

Okay let’s look above: It is too long to type all the variables in the function to extract them out. In this case we can use put these variables in vector format and parse them as an object for later use.

q_vars <- c("q1", "q2", "q3", "q4", "q5", "q6")
df_sfx %>% select(-q_vars)

See that df_sfx %>% select(-q_vars) gives you the same outcome as df_sfx %>% select(-q1, -q2, -q3, -q4, -q5, -q6).

Now, you try!

We want to remove the RS1, RS2, RS3, RS4 variables this time.

Hint: Parse the variables you want to remove into an object called RS_networks.

# Type your code after this line! 
RS_networks <- c("RS1", "RS2", "RS3", "RS4")
df_sfx %>% select(-RS_networks)

Filter

We use filter() to remove or select rows depending on their values.

df_sfx %>% filter(Sex == "Female")

Comparisons and Logical Operators

Comparisons: We can use >, >=, <, <=, != (not equal), and == (equal) to filter our values. Logical Operators: We can also use & (and), && (and something else) | (or) and ! (not) to filter our variables too!

Now, you try!

Select and filter:

  • All IDs, Age, Sex, Dx, MMSE, CDR, COG1, and education of patients with MMSE > 23
  • All IDs, Age, Sex, Dx, MMSE, CDR, COG1, and education of patients with MMSE > 16 and CDR = 0
  • All IDs, Age, Sex, Dx, MMSE, CDR, COG1, and education of patients with MMSE > 16 & are either CONTROL, bvFTD, or nfvPPA

Use SHIFT + ALT + UP/DOWN (windows) or COMMAND + OPTION + UP/DOWN (Mac) to copy a line above/below

Hint: For no.2, you need to use two filter combinations: > , & Hint: For no.3, you need to use three filter combinations: > , & , | Hint: for characters such as CONTROL, you will need to use “CONTROL” instead of CONTROL

# Type your code after this line! 
df_sfx %>% filter(MMSE > 23)
df_sfx %>% filter(MMSE > 16 & CDR == 0)
df_sfx %>% filter(MMSE > 16 & CDR == 0 & (Dx == "CONTROL" | Dx == "bvFTD" | Dx == "nfvPPA"))

Typing Dx == "CONTROL" | Dx == "bvFTD" | Dx == "nfvPPA" takes a very long time. What if you have 10 categorical variables (we have 6 here), and you want 7 of them? It is not the best practice to use this method.

What we can do, is to use %in% to retrieve the categories that we want

For example, if we want to extract these patient groups: CONTROL, PSP, svPPA, we can do this:

myDx <- c("CONTROL", "PSP", "svPPA")
df_sfx$Dx %in% myDx
#   [1] FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE
#  [13] FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#  [25]  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE
#  [37]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
#  [49] FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
#  [61] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
#  [73] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
#  [85]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
#  [97] FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
# [109] FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
# [121] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
# [133] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
# [145] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE
# [157]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
# [169]  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE
# [181]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
# [193] FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE

This returns a vector of TRUE and FALSE on which row has either CONTROL, PSP, or svPPA.

Hence we can use this to filter the rows!

demographics <- c("Dx", "Age", "Sex", "Education")
df_sfx %>% select(all_of(demographics), MMSE) %>% filter(MMSE > 23 & Dx %in% myDx)
Now, you try!

Use the %in% method to get:

  • All IDs, Age, Sex, Dx, MMSE, CDR, COG1, and education of patients with MMSE > 16 & are either CONTROL, bvFTD, or nfvPPA
# Type your code after this line! 
demographics <- c("ID", "Age", "Sex", "Education", "Dx", "MMSE", "CDR", "COG1")
myDx <- c("CONTROL", "bvFTD", "nfvPPA")
df_sfx %>% select(all_of(demographics), MMSE) %>% 
  filter(MMSE > 16 & Dx %in% myDx)

Arrange

arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by.

df_sfx %>% arrange(Sex, MMSE)
Now, you try!

Arrange df_sfx by diagnosis, followed by their MMSE

# Type your code after this line! 
df_sfx %>% arrange(Dx, MMSE)

desc

We can also use desc() to reorder by a column in descending order:

df_sfx %>% arrange(desc(Dx), MMSE)
Now, you try!

Use desc() to arrange the COG1 in descending order.

# Type your code after this line! 
df_sfx %>% arrange(desc(COG1))

Mutate

Sometimes we have questionnaire items and we may want to compute the total score. It is useful to add new columns that are functions of existing columns. That’s the job of mutate().

mutate() always adds new columns at the end of your dataset so we’ll start by creating a narrower dataset so we can see the new variables.

df_sfx %>% select(demographics, q1:q6) %>% mutate(q_total = q1 + q2 + q3 + q4 + q5 + q6)

What if you have 20 items in your questionnaires that you have to sum? Yes, there is a more efficient way to do this. We can use the combination of rowSums() and select().

q_items <- c("q1", "q2", "q3", "q4", "q5", "q6")
df_sfx %>% select(demographics, q1:q6) %>% mutate(q_total = rowSums(.[q_items]))
Now, you try!

For our questionnaire items, we want to reverse q2, q4, and q6. Here are some description: Minimum score: 1 Maximum score: 5

  1. Create reversed scores for q2, q4, and q6 in a new data frame called df_sfx_2
  2. Sum the new quesionnaire items with the new data frame

Hint: Use MaxScore + MinScore - score to reverse

# Type your code after this line! 
df_sfx_2 <- df_sfx %>% mutate(q2r = 5+1-q2,
                              q4r = 5+1-q4,
                              q6r = 5+1-q6)
df_sfx_2

Using mutate multiple times can waste time if you have many items to reverse. There is another method you can use to mutate over mutliple variables. Here is an example:

# First create a function to reverse score
reverseScore <- function(item, MaxScore, MinScore) {
  # if max is 5 and mix is 1, reversed scores should be 5, 4, 3, 2, 1
  # 5+1 - 1 = 5, 5+1 - 2 = 4, 5+1 - 3 = 3, 5+1 - 2 = 4, 5+1 - 1 = 5
  MaxScore + MinScore - item
}


# Then we create the new variables in the data frame
MaxScore <- 5; MinScore <- 1
df_sfx_2 <- df_sfx %>% select(demographics, q1:q6) %>% 
  mutate_at(vars("q2", "q4", "q6"), list(r = ~reverseScore(., MaxScore, MinScore))) # append _r at the end of the variable

# Compute the total score
new_q_items <- c("q1", "q2_r", "q3", "q4_r", "q5", "q6_r")
df_sfx_2 %>% mutate(q_total = rowSums(.[new_q_items]))

Here’s another method:

toReverse <- c("q2", "q4", "q6")
Reverse_newName <- c("q2_r", "q4_r", "q6_r")

# Simple loop to create new variables
df_sfx_2 <- df_sfx
MaxScore <- 5; MinScore <- 1
for (i in 1:length(toReverse)) {
  df_sfx_2[Reverse_newName[i]] <- reverseScore(df_sfx[, toReverse[i]], MaxScore, MinScore)
}

df_sfx_2 %>% mutate(q_total = rowSums(.[new_q_items]))

Renaming Variables

q1 to q6 doesn’t really mean anything to us at this point, as they are random variables I create just for this workshop.

Let’s assume q1 to q6 represents some well-being scale, with higher scores indicating better well-being. The minimum score patients can achieve on this scale is 6, and the maximum score they can achieve is 30.

So let’s rename q1 to q6 into WB_1 to WB_6 using rename() so we know what questionnaire it is.

I will show you an example on one item:

df_sfx_2 %>% rename(WB_1 = q1, 
                    WB_2 = q2)

Now, you try!

Rename q1 to q6 into WB_1 to WB_6, including the reversed items (q2_r, q4_r, q6_r). Remember, we have not created a variable for the total score yet.

# Type your code after this line! 
df_sfx_2 %>% rename(WB_1 = q1, 
                    WB_2 = q2,
                    WB_2_r = q2_r,
                    WB_3 = q3,
                    WB_4 = q4,
                    WB_4_r = q4_r,
                    WB_5 = q5,
                    WB_6 = q6,
                    WB_6_r = q6_r)

We can also use %in%to rename these variables

# Simple loop to create new variables
oldnames <- c("q1", "q2", "q3", "q4", "q5", "q6", "q2_r", "q4_r","q6_r")
newnames <- c("WB_1", "WB_2", "WB_3", "WB_4", "WB_5", "WB_6", "WB_2_r", "WB_4_r","WB_6_r")
names(df_sfx_2)[names(df_sfx_2) %in% oldnames] <- newnames

Splitting continuous variables to categorical variables

Sometimes you may want to split your variables into categorical (e.g. high, medium, low) to run some interaction effects. In this case, we can use cut() along with mutate() to create a new variable, and convert it from continuous to categorical.

df_sfx_2 %>% mutate(category = cut(MMSE, 
                                   breaks = c(0, 15, 23, 30),
                                   labels = c("Severe","Moderate","Functioning")))

The above code creates 3 categories: Severe MMSE (0-15), Moderate MMSE (16-23), and Functioning MMSE (24-30). For n categories, you will need n + 1 breaks as the scores in between the breaks are used to form the categories.

Hence if you want to create High vs Low MMSE, then you will need 3 break values:

df_sfx_2 %>% mutate(category = cut(MMSE, 
                                   breaks = c(0, 15, 30),
                                   labels = c("Low","High")))
Now you try

Create a new variable called education_cat that contains education level in categories.

  • less than high school (< 12 years)
  • high school (12 - 15 years)
  • college graduation (16 - 19 years)
  • advanced graduate degree (=> 20 years)
# Type your code after this line! 
df_sfx_2 %>% mutate(education_cat = cut(Education, 
                                   breaks = c(0, 11, 15, 19, 99),
                                   labels = c("less than high school",
                                              "high school", 
                                              "college graduation", 
                                              "advanced graduate degree")))

Let’s try some exercises from scratch!

These are the steps you will need to perform to get to the final data.

  1. Create an object called df_sfx_3 with ages above 65, MMSE > 16, CDR < 1, education = 17 (use filter())

  2. Then, create the total score for WB (use mutate())

  3. Select these variables: RS1, RS3, RS4, Age, MMSE, Dx, CDR, COG2, WB total score (use select())

  4. Arrage the data by CDR, with ascending order (use arrange()).

print or skim the data (type df_sfx_3)

Hint: Use the magrittr pipe %>% to continue your code from the previous step

# Type your code after this line! 
new_wb_items <- c("WB_1", "WB_2_r", "WB_3", "WB_4_r", "WB_5", "WB_6_r")
df_sfx_3 <-  df_sfx_2 %>% filter(Age>65,
                     MMSE>16,
                     CDR<1,
                     Education == 17) %>% 
  mutate(WB_total = rowSums(.[new_wb_items])) %>% 
  select(RS1, RS3, RS4, Age, MMSE, Dx, CDR, COG2, WB_total) %>% 
  arrange(CDR)
df_sfx_3

Grouping variables and summarizing data

Two functions that are usually used together are group_by() and summarize(). You don’t really use group_by() alone, as there is no reason to group data unless you want to create a summary such as mean, sd, or median.

df_sfx_2 %>% group_by(Dx) %>% summarize(CDR_mean = mean(CDR),
                                        CDR_SD = sd(CDR))

What do you notice about the CDR for the CONTROLS? Which group has the highest CDR?

Now, you try!

Use group_by(), summarise() to summarize the mean, sd, and median of MMSE in each diagnostic group. Use df_sfx_2 for this exercise.

Hint: mean: mean(), sd: sd(), median: median().

# Type your code after this line! 
df_sfx_2 %>% group_by(Dx) %>% 
  summarize(MEAN=mean(MMSE),
            SD=sd(MMSE),
            MEDIAN=median(MMSE))

Now, let’s put everything together!

Use group_by(), summarise(), and arrange() to display each group’s COG1, MMSE, and CDR mean score in descending order (Highest COG1 on top). Call this data frame df_sfx_groupsummary. Use df_sfx_2 for this exercise.

# Type your code after this line! 
df_sfx_groupsummary <- df_sfx_2 %>% group_by(Dx) %>% 
  summarize(COG1MEAN=mean(COG1),
            MMSEMEAN=sd(MMSE),
            CDRMEAN=median(CDR))
df_sfx_groupsummary

Turning wide to long data format and vice versa

Frequently data will be in formats that are unsuitable for analysis. One of the most common examples of this issue encountered is data where things that should be in the rows are in the columns or vice versa. In these cases, we need to rearrange the data we’ve received into a form that is easier to deal with.

When we want to turn wide data format into long data format, we can use gather().

data %>% gather(key = "NewVariableName", value = "NewResult") is the typical structure of using this code, but it can get complicated with keys and values.

In short:

  • key: the name of the column that will consist of the variables you want in long format
  • value: the values of the columns that you want in long format.

Here’s an example:

df_sfx_2 %>% select(ID, demographics, MMSE, CDR, COG1) %>% # Select only relevent variables to keep columns short
  gather(key = "CogTest", value = "Score", MMSE, CDR, COG1) %>% 
  arrange(ID)

Did you see that the variables MMSE, CDR, and COG1 has now disappeared? It is now contained in a new variable called CogTest. Furthermore, we now have 600 rows instead of 200! This is because we have 200 patients and 3 tests that have to be rearranged into long format.

The opposite of gather() is spread(). As you might have guessed, it turns long data into wide data format.

Here’s an example:

# Set an example long data from previous line
df_sfx_long <- df_sfx_2 %>% 
  select(ID, demographics, MMSE, CDR, COG1) %>% 
  gather(key = "CogTest", value = "Score", MMSE, CDR, COG1)

# Now turn the long data back to wide data
df_sfx_long %>% spread(key = "CogTest", value = "Score") # Note that it spreads by alphabetical order unless you specify otherwise.

The format is similar to gather():

  • key: The column you want to split (should be categorical)
  • value: The column you want to use to populate the new columns (the value column we just created in the spread step)
Now, you try!

Using the df_sfx_groupsummary we have just created, gather the MMSE, CDR, and COG1 into one column called CogTest.

# Type your code after this line!
df_sfx_groupsummary %>% gather(key = "CogTest", value = "Score", MMSEMEAN, CDRMEAN, COG1MEAN)

Example of running statistical analysis using Tidyverse

The functions above are preparation steps to get your data ready to run statistical analyses. Once these are completed, you can begin to run your data analysis using the tidyverse format!

For example, you might be interested in:

  1. diagnostic groups: CONTROLS, bvFTD, svPPA

  2. MMSE > 16

  3. CDR < 1 (early stage of disease)

The variables you might want to analyze are COG2, RS1, RS2, and WB (questionnaire).

Your analysis plan might look like this:

Getting the data ready

# Set interested variables
demo <- c("ID", "Age", "Sex", "Dx", "Education", "CDR", "MMSE")
myDx <- c("CONTROLS", "bvFTD", "svPPA")
RSs <- c("RS1", "RS2")
WBs_old <- c("q1", "q2", "q3", "q4", "q5", "q6")
WBs_new <- c("q1", "q2_r", "q3", "q4_r", "q5", "q6_r")
COG2 <- c("COG2")

# Reverse scoring function and prep
reverseScore <- function(item, MaxScore, MinScore) {
  # if max is 5 and mix is 1, reversed scores should be 5, 4, 3, 2, 1
  # 5+1 - 1 = 5, 5+1 - 2 = 4, 5+1 - 3 = 3, 5+1 - 2 = 4, 5+1 - 1 = 5
  MaxScore + MinScore - item
}
MaxScore <- 5; MinScore <- 1

# Subset data
df <- df_sfx %>% # Take the data
  select(all_of(demo), 
         all_of(RSs), 
         all_of(WBs_old), 
         all_of(COG2)) %>% # Select interested vars
  filter(Dx %in% myDx) %>% # Filter only myDx
  mutate_at(vars("q2", "q4", "q6"), 
            list(r = ~reverseScore(., MaxScore, MinScore))) %>% # append _r at the end of the variable 
  mutate(WB_Score = rowSums(.[WBs_new])) %>%  # Create WB total score
  select(all_of(demo), 
         all_of(RSs), 
         WB_Score,
         all_of(COG2))

skim(df)
Data summary
Name df
Number of rows 52
Number of columns 11
_______________________
Column type frequency:
character 2
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Sex 0 1 4 6 0 2 0
Dx 0 1 5 5 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ID 0 1 97.40 63.03 2.00 38.75 89.50 148.25 198.00 <U+2587><U+2585><U+2585><U+2585><U+2586>
Age 0 1 65.85 8.54 48.00 59.00 65.00 70.75 83.00 <U+2582><U+2587><U+2587><U+2586><U+2583>
Education 0 1 17.48 2.84 12.00 15.00 18.00 19.25 23.00 <U+2583><U+2586><U+2585><U+2587><U+2583>
CDR 0 1 0.75 0.25 0.50 0.50 0.75 1.00 1.00 <U+2587><U+2581><U+2581><U+2581><U+2587>
MMSE 0 1 19.90 5.09 6.00 18.00 21.00 23.25 27.00 <U+2581><U+2582><U+2585><U+2586><U+2587>
RS1 0 1 0.10 0.05 0.00 0.07 0.09 0.12 0.20 <U+2582><U+2587><U+2587><U+2586><U+2582>
RS2 0 1 0.12 0.07 -0.01 0.08 0.11 0.16 0.33 <U+2583><U+2587><U+2586><U+2582><U+2581>
WB_Score 0 1 17.83 3.58 11.00 15.00 18.00 20.25 27.00 <U+2586><U+2587><U+2587><U+2586><U+2581>
COG2 0 1 8.52 2.68 0.94 7.04 8.21 10.29 13.89 <U+2581><U+2582><U+2587><U+2585><U+2583>

Run some statistical analyses

Let’s say you want to examine:

  1. the differences in MMSE between the patient groups (ANOVA)

  2. the relationship between RS1 and COG2, controlling for MMSE, Education, Age, and Sex (linear regression)

1. ANOVA: Differences in MMSE between the patient groups

mod1 <- aov(MMSE ~ Dx, data = df)
mod1 %>% broom::tidy()# or mod1 %>% summary()
summary(mod1)
#             Df Sum Sq Mean Sq F value Pr(>F)  
# Dx           1  128.1  128.13   5.364 0.0247 *
# Residuals   50 1194.4   23.89                 
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2. Regression: Relationship between RS1 and COG2,controlling for MMSE, Education, Age, and Sex

Covariates model

reg_cov <- lm(COG2 ~ MMSE + Education + Age + Sex, data = df)
reg_cov %>% summary()
# 
# Call:
# lm(formula = COG2 ~ MMSE + Education + Age + Sex, data = df)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
# -7.047 -1.623 -0.409  1.508  4.971 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)
# (Intercept)  2.66742    3.83268   0.696    0.490
# MMSE        -0.04737    0.07518  -0.630    0.532
# Education    0.16296    0.13371   1.219    0.229
# Age          0.05788    0.04444   1.302    0.199
# SexMale      0.24386    0.76357   0.319    0.751
# 
# Residual standard error: 2.684 on 47 degrees of freedom
# Multiple R-squared:  0.07631, Adjusted R-squared:  -0.002302 
# F-statistic: 0.9707 on 4 and 47 DF,  p-value: 0.4325

Full model

reg_mod <- lm(COG2 ~ MMSE + Education + Age + Sex + RS1, data = df)
reg_mod %>% summary()
# 
# Call:
# lm(formula = COG2 ~ MMSE + Education + Age + Sex + RS1, data = df)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -5.3503 -1.4753 -0.0593  1.4560  5.7189 
# 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)   
# (Intercept)   7.90817    4.05959   1.948  0.05753 . 
# MMSE         -0.06287    0.07061  -0.890  0.37786   
# Education     0.17571    0.12527   1.403  0.16744   
# Age           0.01391    0.04455   0.312  0.75619   
# SexMale       0.30799    0.71526   0.431  0.66877   
# RS1         -23.14161    8.38336  -2.760  0.00826 **
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 2.513 on 46 degrees of freedom
# Multiple R-squared:  0.2076,  Adjusted R-squared:  0.1214 
# F-statistic:  2.41 on 5 and 46 DF,  p-value: 0.05059

Model change comparison

stats::anova(reg_cov, reg_mod)
stats::anova(reg_cov, reg_mod) %>% broom::tidy()

Use tidy() from the broom package to summarize information about the components of a model

Take aways

  • Extract variables with select()

  • Extract cases with filter()

  • Arrange cases, with arrange()

  • Connect operations with %>%

  • Make tables of summaries with summarise()

  • Make new variables with mutate()

  • Dogroupwise operations with group_by()

  • turn data into long format using gather(), and into wide format using spread().