# load libraries
library(broom) # extract results from glm()
library(datawizard) #summarize variables
library(emmeans) # for simple slopes
library(interactions) # For plotting conditional effects
library(ggfortify) # for nice residual plots
library(kableExtra) # style tables
library(patchwork) # collate plots
library(tidyverse) # loads ggplot2, dplyr, Hmisc, and other data wrangling packages
# load data
covid_tib <- here::here("data/long_covid.csv") |>
readr::read_csv() |>
mutate(
asthma = forcats::as_factor(asthma) |> forcats::fct_relevel("No Asthma"), # convert to factor and set reference category
birth_sex = as_factor(birth_sex) |> fct_relevel("Female") #convert to factor and set reference category
)Model answer for TAP and Report
Notes
This document aims to give you an idea of how to write up your coursework on this module. At various points I make annotated notes, which look like this:
- For this example, I have deliberately chosen a research question not related to psychology
- I have also fitted a statistical model that you are not taught, and used lots of code and techniques not taught on this module
- This might seem like evidence of my inherent evil. However, the problem with using an example closely related to the actual assessments is that it will be very tempting not to deviate from my model answer and to adapt my code. It will give you code and text that you might be tempted to re-purpose with minimal thought, it potentially stifles your creativity and prevents you from being your incredible and unique self.
Because this model answer uses statistical models and techniques that we don’t teach you, please internalise the following very important points:
- It doesn’t matter if you don’t understand this model answer. In fact, I wouldn’t expect you to. Instead, aim to take from this model answer a sense of the general structure, level of detail, and type of material I am expecting you to submit.
- I have written this sample as someone with many years experience of executing and writing about statistical models. If anyone submits anything as good as this model answer then I need to give them my job! I encourage you to aspire to having my job, but rest assured my expectations are a lot lower than this model answer might imply.
Assignment
The research scenario
A team of researchers was interested in predictors of long Covid (defined as Covid-19 symptoms persisting more than 10 days). They collected information recently-infected people who provided information about their symptoms over 5 months. At 5 months, 1,398 had reported symptoms for more than 10 days and the current dataset contains these people. (These data are fabricated, but the results loosely match the results of Sudre et al. 2021)
The data file (long_covid.csv) contains the following variables:
id: participant idduration: the number of days that each person experienced symptoms of Covid-19birth_sex: the person’s sex assigned at birthage: the person’s age (years)asthma: whether the person suffered from asthma (or not)symptoms: the number of Covid-19 symptoms experiences in the first 7 days of infection
The research team wishes to test the following hypothesis:
- H1: The
durationof Covid symptoms will be predicted by the person’sage,birth_sex,asthmastatus and the number of Covid-19symptomsthey experienced during the first 7 days of infection. - H2: The effect
asthmawill diminish with increasing initial symptoms. - H3: Differences between those assigned as male at birth and those assigned as female will diminish with age.
The outcome variable is expected to display extreme positive skew.
For this assignment, you will need to use all of the variable in the data set except for id.
Tasks
The research team do not know about statistics, and want a report that they can take to policy makers (i.e. people who also don’t understand statistics). The report should include a model that tests their hypothesis and explains the results not just in terms of statistical significance but also the real world meaning (e.g., are the effects large enough to be meaningful in the real world?).
You must produce this report! It is suggested that you follow this structure:
- Analysis plan: Write an analysis plan. This plan should be a logical account of how you will test the hypothesis. You must include the equation describing the model and any assumptions that the model makes. The analysis plan should answer the following questions:
- What summary statistics will you use and why?
- How will you visualise the data?
- What model will you fit and why?
- How will you test for bias in the model and what will you do if you find it?
- What are the possible problems with fitting the model and what are your contingency plans under different scenarios?
- Think of this task like pre-registering your analysis.
- Coding: Execute your analysis plan and create all of the objects you need to write the report. Create your table of summary statistics and fit your model here, but do not print the relevant tables to the document until the next section.
- Report: Write a succinct report that interprets the results of your model. Essentially, follow your analysis plan interpreting relevant tables and plots as you go along. You should print tables and plots of relevant statistical information (the code you used to create these should be in the previous section). Your report should be a little like a results section from a research paper but including more detail. I want you to interpret the size of the parameter estimates that test the hypotheses, and their real world substance. It is not enough to focus only on p-values. You should also evaluate whether the model is good (is it biased, for example?).
Model answer
Code
# -----------
# IMPORTANT
# Below are custom functions I've written to help me to report results directly from the r objects I create. I do NOT expect you to write custom functions, or use these functions. That's why I have folded/hidden this code.
# custom function to print rounded values
print_x <- function(x, dp = 2){
fmt <- paste0("%.", dp, "f")
sprintf(fmt, x)
}
# custom function to print rounded p-values
print_p <- function(p, dp = 3, include_p = TRUE){
ps <- tibble::tibble(
p = p
)
if(include_p){
ps <- ps |>
mutate(
p = ifelse(p < 0.001, "*p* < 0.001", paste0("*p* = ", print_x(x = p, dp = dp)))
)
} else {
ps <- ps |>
mutate(
p = ifelse(p < 0.001, "< 0.001", print_x(x = p, dp = dp))
)
}
ps$p
}
# custom function to report parameter estimates from a model
report_pars <- function(tidyobject, row = 2, digits = 2, p_digits = 3, report = "all"){
n <- nrow(tidyobject)
if(is.null(attr(tidyobject, "estName"))){
stat_text = ", *z* = "
} else {
tidyobject <- tidyobject |>
rename(
estimate = contains("trend")
)
stat_text = ", *t* = "
}
row <- tidyobject |>
dplyr::mutate(
row_number = 1:n,
p.value = print_p(p.value, dp = p_digits),
ci = paste0("[", print_x(conf.low, dp = digits), ", ", print_x(conf.high, dp = digits), "]"),
across(.cols = where(is.double), \(x) print_x(x, dp = digits))
) |>
dplyr::filter(row_number == row)
case_when(
report == "all" ~ paste0(row$estimate, " ", row$ci, stat_text, row$statistic, ", ", row$p.value),
report == "est" ~ row$estimate,
report == "cil" ~ row$conf.low,
report == "cih" ~ row$conf.high
)
}Analysis plan
Given the expectation that symptom duration will be extremely positively skewed, a generalized linear model will be fit based on symptom duration being modelled by an exponential distribution. The exponential distribution allows for occasional large values of the response variable and so should be a useful model of extreme positive skew. In addition, symptom duration can be thought of as the time between events (the beginning and ending of symptoms) and duration is a non-negative continuous variable. Therefore, the link function that would be most appropriate is the log link because it matches the possible values of duration (unlike the negative reciprocal link function, which takes only negative values). The hypotheses are:
- H1: The
durationof Covid symptoms will be predicted by the person’sage,birth_sex,asthmastatus and the number of Covid-19symptomsthey experienced during the first 7 days of infection.- To test this hypotheses these variables will be entered into the model as predictors.
- H2: The effect
asthmawill diminish with increasing initial symptoms.- To test this hypothesis the symptom × asthma interaction will be included.
- H3: Differences between those assigned as male at birth and those assigned as female will diminish with age.
- To test this hypothesis the birth sex × age interaction will be included.
The model to be fitted is, therefore,
\[ \begin{aligned} \log\big(E(\text{duration}_i)\big) &= \eta_i \\ \eta_i &= \beta_0 + \beta_1\text{age}_i + \beta_2\text{birth sex}_i + \beta_3\text{asthma}_{i} + \\ &\ \quad \beta_4\text{symptoms}_i + \beta_5(\text{age} \times \text{birth sex})_i + \\ &\ \quad \beta_{6}(\text{symptoms} \times \text{asthma})_i \end{aligned} \]
Data will be summarized and visualized as follows:
- A density plot of
durationto verify the appropriate distribution for the GLM. - A density plot of
ageandsymptomsto assess their distributions - Plots of the log mean duration against both
ageandsymptomsto verify the linearity of the logit assumption. - Descriptive statistics (mean, standard deviation, inter-quartile range, skewness and kurtosis) of duration split by sex assigned at birth and asthma status to look for group differences in duration, and to further explore the distributional properties of symptom duration.
The generalized linear model will be fit. Any significant interactions will be followed up using a simple slopes analysis. Both interaction terms consist of one continuous and one categorical variable. The simple slopes will look at the the effect of the continuous predictor within each level of the categorical variables. For each interaction, the predicted value will be plotted as a function of the continuous predictor, with separate lines for levels of the categorical predictor. For significance tests \(\alpha\) = 0.05 will be used as a criterion.
The assumptions of the generalized linear model will be checked as follows:
- Distribution of the response variable: This assumption will have partially been checked by the density plot of duration. If this assumption is reasonable, then the standardised deviance residuals should be distributed approximately as \(N(0, 1)\). This will be checked with a q-q plot of the standardised deviance residuals. We expect the theoretical quantiles to map onto the observed quantiles (i.e. the dots will sit along the diagonal line).
- Linearity: This assumption will have been partially checked by the plots of the log mean duration against both
ageandsymptoms(above). However, a further check is using a plot of a transformation of the predicted values (\(2\log(\hat{\mu})\)) against the standardised deviance residuals (McCullagh and Nelder 1989). A random scatter of dots indicates that the assumption is reasonable. - Independence: This assumption can be pseudo-checked with plots of the standardised deviance residuals against the case number (i.e. an index of the location of the data within the data frame). We expect to see a random pattern.
If the assumption about the distribution of the response variable is not met (i.e. the exponential distribution is found to be a poor approximation of the distribution of duration) a different generalized linear model will be fit. Some examples of alternative GLMs based on potential distributional approximations of symptom duration: a Gaussian model with identity link function, a Poisson model with log link function. If linearity cannot be assumed or there is evidence of non-independence the analysis will be abandoned.
This analysis plan answers all of the required questions: What summary statistics will you use and why? How will you visualise the data? What model will you fit and why? How will you test for bias in the model and what will you do if you find it? What are the possible problems with fitting the model and what are your contingency plans under different scenarios?
Note that
- I have not included trivial details or details to do with coding such as which functions or packages I will use, how I will name things, how I will prepare variables or data wrangle.
- I have included an equation describing the model I will fit.
- I have said why I will do things (e.g. “A density plot of
durationto verify the appropriate distribution for the GLM”) - I have said what I’m looking for (e.g. “A random scatter of dots indicates that the assumption is reasonable”)
- I have indicated contingency plans (e.g., “If the assumption about the distribution of the response variable is not met (i.e. the exponential distribution is found to be a poor approximation of the distribution of
duration) a different generalized linear model will be fit …”)
Coding
Summary statistics and data visualization
# Initial plots
# Density plot of duration
duration_dty <- ggplot(covid_tib, aes(x = duration)) +
geom_density(colour = "#01779c") +
coord_cartesian(xlim = c(0, 130)) +
scale_x_continuous(breaks = seq(0, 130, 10)) +
labs(y = "Density", x = "Symptom duration (days)") +
theme_minimal()
# Bar chart of symptoms
symptom_bar <- ggplot(covid_tib, aes(x = symptoms)) +
geom_bar(fill = "#01779c", alpha = 0.5, colour = "#01779c") +
scale_x_continuous(breaks =0:10) +
scale_y_continuous(breaks = seq(0, 350, 50)) +
labs(y = "Frequency", x = "Number of symptoms in first 7 days") +
theme_minimal()
# Density plot of age
age_dty <- ggplot(covid_tib, aes(x = age)) +
geom_density(colour = "#01779c") +
coord_cartesian(xlim = c(0, 90)) +
scale_x_continuous(breaks = seq(0, 90, 5)) +
labs(y = "Density", x = "Age (years)") +
theme_minimal()
# table of descriptives by birth sex and asthma status
desc_tbl <- covid_tib |>
group_by(birth_sex, asthma) |>
describe_distribution(duration) |>
mutate(
Asthma = gl(2, 2, labels = c("No Asthma", "Asthma")),
`Birth sex` = gl(2, 1, 4, labels = c("Female", "Male"))
) |>
select(Asthma, `Birth sex`, Mean:n)
# Duration as a function of symptoms
dur_symp_gg <- ggplot(covid_tib, aes(y = log(duration), x = symptoms)) +
stat_summary(fun = "mean", colour = "#136CB9", alpha = 0.5) +
geom_smooth(method = "lm", se = F, formula = y ~ x, colour = "#CA3E34") +
labs(y = "Log(Mean symptom duration (days))", x = "Number of initial symptoms") +
scale_y_continuous(breaks = seq(3, 3.7, 0.1)) +
scale_x_continuous(breaks = 0:10) +
coord_cartesian(xlim = c(0, 10), ylim = c(3, 3.7)) +
theme_minimal()
# Mean duration by age
dur_age_gg <- ggplot(covid_tib, aes(y = log(duration), x = age)) +
stat_summary(fun = "mean", colour = "#136CB9", alpha = 0.5) +
geom_smooth(method = "lm", se = F, formula = y ~ x, colour = "#CA3E34") +
labs(y = "Log(Mean symptom duration (days))", x = "Age (years)") +
scale_y_continuous(breaks = seq(2.5, 4, 0.1)) +
scale_x_continuous(breaks = seq(0, 80, 5)) +
coord_cartesian(ylim = c(2.5, 4), xlim = c(0, 80)) +
theme_minimal()Statistical models
# Fit the model
covid_glm <- glm(duration ~ birth_sex + age + asthma + symptoms + age:birth_sex + symptoms:asthma, data = covid_tib, family = "Gamma"(link = "log"))
#formal test of fit
chi_fit_p <- pchisq(covid_glm$deviance, df = covid_glm$df.residual, lower.tail = FALSE) |>
print_p(dp = 2)
# chi_fit_p
# fit statistics and parameter estimates
covid_fit <- broom::glance(covid_glm) #get fit statistics
covid_pe <- broom::tidy(covid_glm, conf.int = T) #get raw parameter estimates
covid_exp <- broom::tidy(covid_glm, conf.int = T, exponentiate = T) #get exponentiated parameter estimates
# fit values to print in results
mod_dev <- print_x(covid_fit$deviance)
mod_df <- print_x(covid_fit$df.residual, dp = 0)
# get simple slopes
symptoms_asthma_emm <- emmeans::emtrends(covid_glm, spec = "asthma", var = "symptoms")
age_sex_emm <- emmeans::emtrends(covid_glm, spec = "birth_sex", var = "age")
symptoms_asthma_tbl <- broom::tidy(symptoms_asthma_emm, conf.int = T)
age_sex_tbl <- broom::tidy(age_sex_emm, conf.int = T)
symptoms_asthma_gg <- interactions::interact_plot(
covid_glm,
pred = symptoms,
modx = asthma,
x.label = "Symptoms in first 7 days",
y.label = "Predicted sympton duration (days)",
legend.main = "Asthma status"
) +
coord_cartesian(xlim = c(0, 10), ylim = c(25, 60)) +
scale_x_continuous(breaks = 0:10) +
scale_y_continuous(breaks = seq(25, 60, 5)) +
theme_minimal()
age_sex_gg <- interactions::interact_plot(
covid_glm,
pred = age,
modx = birth_sex,
x.label = "Age (years)",
y.label = "Predicted sympton duration (days)",
legend.main = "Sex assigned at birth"
) +
coord_cartesian(xlim = c(10, 80), ylim = c(20, 43)) +
scale_x_continuous(breaks = seq(10, 80, 10)) +
scale_y_continuous(breaks = seq(20, 40, 5)) +
theme_minimal()Model assumptions
# This plot shows a half-normal Q-Q plot of the absolute value of the standardized deviance residuals is shown
glm_qq <- ggplot2::autoplot(covid_glm,
which = 2,
alpha = 0.3,
size = 1,
label.size = 0,
colour = "#01779c") +
labs(y = "Standardized deviance residual", x = "Theoretical quantile") +
theme_bw()
# use augment() to store predicted probabilities (.fitted) and raw and standardized deviance residuals. Use mutate to add a variable of the transformed estimated probabilities.
covid_glm_resids <- covid_tib |>
na.omit() |>
tibble::rowid_to_column(var = "rowid") |>
broom::augment(covid_glm, data = _, type.predict = "response") |>
mutate(
trans_p = 2*log(.fitted)
)
# plot transformed estimated probabilities against standardized deviance residuals and plots a summary line using a loess function
covid_glm_zresid_gg <- ggplot(covid_glm_resids, aes(x = trans_p, .std.resid)) +
geom_point(colour = "#01779c", alpha = 0.3, size = 1) +
geom_smooth(colour = "#CA3E34", method = "loess", se = F) +
labs(x = expression(2*log(hat(mu))), y = "Standardized deviance residual") +
theme_minimal()
# standardized deviance residuals against participant ID
covid_glm_dev_gg <- ggplot(covid_glm_resids, aes(x = rowid, y = .std.resid)) +
geom_point(colour = "#01779c", alpha = 0.3, size = 1) +
geom_hline(colour = "#CA3E34", yintercept = 0) +
scale_x_continuous(breaks = seq(0, 1400, 200)) +
scale_y_continuous(breaks = seq(-5, 5, 1)) +
coord_cartesian(xlim = c(0, 1400)) +
labs(x = "Participant ID", y = "Standardized deviance residual") +
theme_minimal()
covid_glm_sq_dev_gg <- ggplot(covid_glm_resids, aes(x = rowid, y = (.std.resid)^2, colour = .resid < 0)) +
geom_point(alpha = 0.5, size = 1) +
scale_x_continuous(breaks = seq(0, 1400, 200)) +
scale_y_continuous(breaks = seq(0, 25, 5)) +
coord_cartesian(xlim = c(0, 1400)) +
labs(x = "Participant ID", y = "Standardized deviance residual") +
discovr::scale_color_senjutsu() +
theme_minimal() +
theme(legend.position = "none")- Note that this section only contains code.
- I have not printed tables/plots/output in this section. I have, however, created everything I need to print these tables and plots in the next section.
- Everything I do here has been previously justified in my analysis plan.
- My code is visible!
Results
Summary statistics and data visualization
Figure 1 shows that the outcome variable is highly positively skewed, as predicted. There is a very long tail indicative of people who experienced symptoms for most of the duration of the study. The distribution most resembles a \(\text{Gamma}(k = 2, \theta = 2)\) distribution, but is also a reasonable approximation of an exponential distribution (\(M(\lambda)\)). The main difference is a dip in frequencies to the left of the peak which is not characteristic of the exponential distribution. However, modelling this outcome with an exponential distribution is not an unreasonable choice.
duration_dtyFigure 2 (left) shows that the number of symptoms in the first 7 days is right-skewed; that is, as the number of symptoms increases so does the corresponding frequency. Figure 2 (right) shows that age is very approximately normal around a mean of 45. However, the distribution appears bimodal because the density drops in the range 30-40 years suggesting a lack of participants recruited within this age range.
symptom_bar + age_dtyThe generalized linear model using the logit link function assumes linearity of the logit. Figure 3 (left) shows the log mean symptom duration for each initial symptom count. The relationship between these variables is linear: symptom duration increases linearly as a function of initial symptoms (blue line) and this relationship is not better described by a quadratic trend (red line). Figure 3 (right) shows the log mean symptom duration at each age. The spread of log mean symptom duration varies as a function of age with there being more variability in symptom duration under the age of 30 compared to over 30. This pattern is indicative of potential heteroscedasticity, but this is not an assumption of a generalized linear model (because it uses MLE). Broadly speaking age has a positive linear relationship to the log mean symptom duration. In short, for both continuous predictors linearity of the logit can be assumed.
dur_symp_gg + dur_age_ggTable 1 shows that there were around double the numbers of males assigned at birth (n = 966) in the sample than females (n = 432). There were almost exactly equal numbers of participants with asthma (n = 698) as without (n = 700). From the table it looks as though symptom duration was lowest for males without asthma and highest for females with asthma. The distribution of symptom duration showed extreme positive skew and kurtosis for females with asthma.
knitr::kable(desc_tbl, digits = 2) |>
kableExtra::kable_styling(bootstrap_options = "striped")| Asthma | Birth sex | Mean | SD | IQR | Min | Max | Skewness | Kurtosis | n |
|---|---|---|---|---|---|---|---|---|---|
| No Asthma | Female | 35.97 | 8.44 | 11 | 20 | 64 | 0.58 | 0.24 | 210 |
| No Asthma | Male | 28.13 | 7.73 | 9 | 14 | 75 | 1.25 | 3.77 | 490 |
| Asthma | Female | 40.91 | 12.67 | 14 | 23 | 125 | 2.58 | 11.57 | 222 |
| Asthma | Male | 34.14 | 10.74 | 13 | 11 | 92 | 1.50 | 3.61 | 476 |
Statistical models
In terms of model fit (Table 2), the residual deviance was \(D\) = 85.65 with \(r\) = 1363. The ‘rule of thumb’ is that the model is a good fit if \(D \le r\). For this fitted model \(D\) is many times smaller than \(r\) suggesting that the model is likely to be a good fit. A formal test of the null hypothesis that the model is a good fit was extremely non-significant suggesting that the hypothesis that the model is a good fit is plausible, \(\chi^2\)(1363) = 85.65, p = 1.00. Also because \(\frac{D}{r}\) = 85.65/1363 = 0.06 < 2 overdispersion is not a problem.
knitr::kable(covid_fit, digits = 3) |>
kableExtra::kable_styling(bootstrap_options = "striped")| null.deviance | df.null | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|
| 125.301 | 1369 | -4798.457 | 9612.914 | 9654.694 | 85.649 | 1363 | 1370 |
Table 3 shows the model parameter estimates and Table 4 exponentiates these values (i.e applies the inverse link function) to express them in the raw units. These tables express the same information but in different units, so I will use only Table 4 to interpret the effects.
Symptom duration was significantly predicted by the birth_sex of the runner, \(\hat{\beta}\) = 0.73 [0.64, 0.82], z = -4.91, p < 0.001. Specifically, when all other explanatory variables are fixed, the mean symptom duration for males was 0.73 that for females (see note). Symptoms persisted for significantly longer in females. If this sample is one of the 95% for which the confidence interval contains the population value then symptom duration in men might be as small as 0.64 that of women or as large as 0.82 that of women. In real terms, men’s symptoms lasting around 25% shorter than females is a dramatic effect.
Symptom duration was significantly predicted by age, \(\hat{\beta}\) = 1.01 [1.00, 1.01], z = 3.98, p < 0.001. Specifically, when all other explanatory variables are fixed for every year older the symptom duration is multiplied by 1.005. If this sample is one of the 95% for which the confidence interval contains the population value then this multiplier might be as small as 1.00 or as large as 1.01. In real terms it is a small effect.
The adjusted mean symptom duration was significantly different across the asthma groups, \(\hat{\beta}\) = 1.74 [1.61, 1.89], z = 13.57, p < 0.001. Specifically, when all other explanatory variables are fixed, the mean symptom duration for those with asthma was 1.74 that of those without asthma. Symptoms persisted for significantly longer in asthma sufferers. If this sample is one of the 95% for which the confidence interval contains the population value then symptom duration in those with asthma might be as small as 1.61 that of people without asthma or as large as 1.89 that people without asthma. In real terms, if a non-asthma sufferer experienced symptoms for 30 days, then had they suffered from asthma their expected symptom duration would be 1.74 × 30 = 52 days, which is a dramatic difference.
The symptom duration was significantly predicted by the number of symptoms during the first 7 days of infection, \(\hat{\beta}\) = 1.03 [1.02, 1.04], z = 8.11, p < 0.001. Specifically, when all other explanatory variables are fixed for every additional symptom experienced in the first 7 days the overall symptom duration is multiplied by 1.031. If this sample is one of the 95% for which the confidence interval contains the population value then this multiplier might be as small as 1.02 or as large as 1.04. In real terms it is a small effect.
knitr::kable(covid_pe, digits = 3) |>
kableExtra::kable_styling(bootstrap_options = "striped")| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 3.117 | 0.060 | 51.834 | 0.00 | 3.003 | 3.231 |
| birth_sexMale | -0.320 | 0.065 | -4.910 | 0.00 | -0.445 | -0.196 |
| age | 0.005 | 0.001 | 3.980 | 0.00 | 0.003 | 0.008 |
| asthmaAsthma | 0.556 | 0.041 | 13.572 | 0.00 | 0.475 | 0.636 |
| symptoms | 0.031 | 0.004 | 8.110 | 0.00 | 0.023 | 0.038 |
| birth_sexMale:age | 0.002 | 0.001 | 1.311 | 0.19 | -0.001 | 0.005 |
| asthmaAsthma:symptoms | -0.052 | 0.005 | -9.871 | 0.00 | -0.063 | -0.042 |
knitr::kable(covid_exp, digits = 3) |>
kableExtra::kable_styling(bootstrap_options = "striped")| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 22.570 | 0.060 | 51.834 | 0.00 | 20.139 | 25.315 |
| birth_sexMale | 0.726 | 0.065 | -4.910 | 0.00 | 0.641 | 0.822 |
| age | 1.005 | 0.001 | 3.980 | 0.00 | 1.003 | 1.008 |
| asthmaAsthma | 1.743 | 0.041 | 13.572 | 0.00 | 1.609 | 1.889 |
| symptoms | 1.031 | 0.004 | 8.110 | 0.00 | 1.024 | 1.039 |
| birth_sexMale:age | 1.002 | 0.001 | 1.311 | 0.19 | 0.999 | 1.005 |
| asthmaAsthma:symptoms | 0.949 | 0.005 | -9.871 | 0.00 | 0.939 | 0.959 |
You wouldn’t put this in your own report, but to help you understand this model answer, when the model parameters in a generalized linear model are converted back to raw units this has implications for the interpretation of parameters. Our current model uses a log link function, so we’re predicted log(symptom duration). For the predictor birth_sex, let \(M\) be the mean symptom duration for males assigned at birth and \(F\) the mean symptom duration for females. The parameter estimate from the model is -0.32 and this value is the difference between the log(M) and log(F):
\[ \log(\text{M}) - \log(\text{F}) = -0.32. \]
The interpretation of this value is, therefore, that the log of the mean male symptom duration was 0.32 less than that for females. In other words, the parameter estimate represents a difference between log means.
No-one understands logs, so we can exponentiate the parameter estimate to express the effect in raw units; that is, symptom duration expressed as days rather than log(days). If we exponentiate the parameter estimate we need to also exponentiate the other (left hand) side of the equation, and to do this we also have to use some rules about logs to get the left-hand side in a form where we can exponentiate it:
\[ \begin{aligned} \log(\text{M}) - \log(\text{F}) &= -0.32 \\ \log\Bigg(\frac{\text{M}}{\text{F}}\Bigg) &= -0.32 \\ \frac{\text{M}}{\text{F}} &= \exp(-0.32) \\ \frac{\text{M}}{\text{F}} &= 0.726 \\ \end{aligned} \]
Note this value matches the one in Table 4. Therefore, the exponentiated parameter estimate does not represents the difference between mean symptom durations, but instead is a multiplier; that is the mean symptom duration for males was 0.73 times the mean symptom duration for females.
The rate of change of the symptom duration with respect to the initial symptoms significantly differed across those with and without asthma, \(\hat{\beta}\) = 0.95 [0.94, 0.96], z = -9.87, p < 0.001. Table 5 shows the results of simple slopes analysis, which are also shown in Figure 4. The simple slopes analysis showed that (1) for people without asthma, symptom duration was significantly predicted by the number of symptoms during the first 7 days of infection, \(\hat{\beta}\) = 0.03 [0.02, 0.04], t = 8.11, p < 0.001; (2) for people with asthma, symptom duration was also significantly predicted by the number of symptoms during the first 7 days of infection, \(\hat{\beta}\) = -0.02 [-0.03, -0.01], t = -5.55, p < 0.001. As Figure 4 shows, and noting that the simple slopes in Table 5 are expressed in log units, as for those without asthma, as the number of initial symptoms increased so did the log duration of overall symptoms; conversely, for those with asthma as the number of initial symptoms increased the log duration of overall symptoms decreased. Note though, that regardless of initial symptoms, symptom duration is always longer for those with asthma.
knitr::kable(symptoms_asthma_tbl, digits = 3) |>
kableExtra::kable_styling(bootstrap_options = "striped")| asthma | symptoms.trend | std.error | df | conf.low | conf.high | statistic | p.value |
|---|---|---|---|---|---|---|---|
| No Asthma | 0.031 | 0.004 | 1363 | 0.023 | 0.038 | 8.110 | 0 |
| Asthma | -0.021 | 0.004 | 1363 | -0.029 | -0.014 | -5.547 | 0 |
symptoms_asthma_ggThe rate of change of the symptom duration with respect to age did not significantly differ across those assigned at birth as male or female, \(\hat{\beta}\) = 1.00 [1.00, 1.00], z = 1.31, p = 0.190. This is evident in Figure 5 in which the slopes for the effect of age on symptom duration are very similar for males and females.
age_sex_ggModel assumptions
The Q-Q plot (fig-glm_qq) shows extreme deviation of deviance residuals from normal (the dots deviate from the diagonal line, especially for quantiles greater than 2). It is not reasonable to assume that the exponential distribution is a good model of the response variable.
glm_qqFigure 7 plot a transformation of the predicted values (\(2\log(\hat{\mu})\)) against the standardised deviance residuals. The dots appear in a fairly random pattern and the smooth summary line is fairly flat at around zero suggesting that linearity is a plausible assumption. Earlier, we noted that Figure 3 suggested that linearity of the logit could be assumed for continuous predictors.
covid_glm_zresid_ggFigure 8 plots the standardised deviance residuals against the case number (i.e. an index of the location of the data within the data frame). This acts as a pseudo-check of the assumption of independence. The two plots show no obvious pattern: ignoring outliers the vertical spread of dots is consistent across the range of index values. There are a lot of outliers but nevertheless these plots are reassuring.
covid_glm_dev_gg + covid_glm_sq_dev_ggSummary
To sum up, hypothesis 1 was supported: symptom severity was significantly predicted by the age of participants, their sex assigned at birth, the number of initial symptoms and their asthma status. Asthmatic patients and female patients experienced significantly longer symptom durations than male and non-asthmatic patients respectively. In practical terms these differences were quite substantial. Although symptom severity was also significantly predicted by the number of initial symptoms and age, these effects were, in practical terms, fairly weak.
The model did support hypothesis 2, that the effect asthma will diminish with increasing initial symptoms. The interaction between asthma status and initial symptoms was significant and Figure 4 shows that with low numbers of initial symptoms symptom duration is much higher for people with asthma than those without but this difference has all but gone with 10 initial symptoms.
The model did not support hypothesis 3, that differences between those assigned as male at birth and those assigned as female will diminish with age. The interaction between sex assignment and age was not significant and Figure 5 shows clearly that the trajectory of symptom duration as a function of age is very similar in those assigned at birth as male or female.
Overall, the model was a good fit and the assumptions of linearity and independence are reasonable. However, deviance residuals did not appear to be normally distributed, which may indicate that it is not reasonable to model the response variable with an exponential distribution. There were also a large number of extreme deviance residuals. This being so, it could be argued we have fitted an incorrect model and should disregard the results. In real life it would be reasonable to try re-fitting the models using a \(\text{Gamma}(k = 2, \theta = 2)\) distribution, but that is beyond the scope of this module.
Note that there is hardly any code in this section. The only code i have used is to print the tables/plots that I created in the previous section.
- I have used
kable(my_table, digits = x)to print tables created and stored in the coding section and round the values within them. - There’s no duplication of code here from the coding section, so I don’t suddenly produce a plot from scratch, I only ever use objects I created in the previous section.
- My code is visible!
- Each table and plot uses a unique code chunk so that I can use things like
#| fig-cap: my captionand#| label: fig-duration_dtyto create captions and cross reference the tables in the text. - I have followed through my analysis plan interpreting everything in order as I go.
- I haven’t simply commented on significance, I have interpreted parameter estimates, confidence intervals and reflecting on what their size means in the context of the example.
- I have written a succinct summary that relates everything back to the original hypotheses.