Smart Alex solutions Chapter 7
This document contains abridged sections from Discovering Statistics Using R and RStudio by Andy Field so there are some copyright considerations. You can use this material for teaching and non-profit activities but please do not meddle with it or claim it as your own work. See the full license terms at the bottom of the page.
Task 7.1
A student was interested in whether there was a positive relationship between the time spent doing an essay and the mark received. He got 45 of his friends and timed how long they spent writing an essay (hours) and the percentage they got in the essay (essay). He also translated these grades into their degree classifications (grade): in the UK, a student can get a first-class mark (the best), an upper-second-class mark, a lower second, a third, a pass or a fail (the worst). Using the data in the file essay_marks.csv find out what the relationship was between the time spent doing an essay and the eventual mark in terms of percentage and degree class (draw a scatterplot too).
Load the data
essay_tib <- readr::read_csv("../data/essay_marks.csv") %>%
dplyr::mutate(
grade = forcats::as_factor(grade)
)
Alternative, load the data directly from the discovr
package:
essay_tib <- discovr::essay_marks
Visualise the data
We’re interested in looking at the relationship between hours spent on an essay and the grade obtained. We could create a scatterplot of hours spent on the essay (x-axis) and essay mark (y-axis). I’ve chosen to highlight the degree classification grades using different colours.
ggplot2::ggplot(essay_tib, aes(hours, essay, colour = grade)) +
geom_point() +
geom_smooth(method = "lm", alpha = 0.1) +
coord_cartesian(xlim = c(0, 15), ylim = c(20, 80)) +
scale_x_continuous(breaks = seq(0, 15, 1)) +
scale_y_continuous(breaks = seq(20, 80, 5)) +
labs(x = "Time spent on essay (hours)", y = "Essay mark (%)", colour = "Grade", fill = "Grade") +
theme_minimal()
Correlations
essay_tib %>%
dplyr::select(hours, essay) %>%
correlation::correlation() %>%
knitr::kable(digits = 3)
Parameter1 | Parameter2 | r | CI_low | CI_high | t | df | p | Method | n_Obs |
---|---|---|---|---|---|---|---|---|---|
hours | essay | 0.267 | -0.029 | 0.52 | 1.814 | 43 | 0.077 | Pearson | 45 |
The results in the table above indicate that the relationship between time spent writing an essay and grade awarded was not significant, Pearson’s r = 0.27, p = 0.077.
The second part of the question asks us to do the same analysis but when the percentages are recoded into degree classifications. The degree classifications are ordinal data (not interval): they are ordered categories. So we shouldn’t use Pearson’s test statistic, but Spearman’s and Kendall’s ones instead. Grade is a factor:
essay_tib$grade
## [1] Upper second class First class Third class First class
## [5] Lower second class Upper second class Upper second class Upper second class
## [9] Upper second class Third class Upper second class First class
## [13] First class Lower second class Lower second class Upper second class
## [17] Upper second class Upper second class Upper second class Lower second class
## [21] Lower second class First class Upper second class Upper second class
## [25] First class Upper second class Lower second class Lower second class
## [29] Lower second class First class Upper second class Upper second class
## [33] Upper second class First class First class Upper second class
## [37] Upper second class Upper second class Upper second class Upper second class
## [41] Upper second class Upper second class Lower second class Lower second class
## [45] First class
## Levels: First class Upper second class Lower second class Third class
We need to convert it into numbers using as.numeric
:
essay_tib <- essay_tib %>%
dplyr::mutate(
grade_num = as.numeric(grade)
)
Let’s check the numeric grade matches the description (i.e. first = 1, upper second = 2 etc.):
id | essay | hours | grade | grade_num |
---|---|---|---|---|
qxbcjg | 61.68 | 10.63 | Upper second class | 2 |
csykgu | 69.55 | 7.29 | First class | 1 |
npxgpl | 48.23 | 5.05 | Third class | 4 |
gomiwl | 70.68 | 2.89 | First class | 1 |
qdywaa | 59.90 | 9.55 | Lower second class | 3 |
cmaprb | 61.16 | 11.31 | Upper second class | 2 |
nrvhix | 67.62 | 7.47 | Upper second class | 2 |
ybytlf | 64.78 | 8.47 | Upper second class | 2 |
talwvg | 63.21 | 8.72 | Upper second class | 2 |
ceylea | 49.69 | 6.20 | Third class | 4 |
berjky | 63.72 | 4.77 | Upper second class | 2 |
cjxgpl | 72.24 | 10.98 | First class | 1 |
gufenh | 70.59 | 5.22 | First class | 1 |
uaxxdm | 58.64 | 6.86 | Lower second class | 3 |
unnhpl | 58.72 | 9.80 | Lower second class | 3 |
Now the correlation:
essay_tib %>%
dplyr::select(hours, grade_num) %>%
correlation::correlation(method = "spearman") %>%
knitr::kable(digits = 3)
Parameter1 | Parameter2 | rho | CI_low | CI_high | S | p | Method | n_Obs |
---|---|---|---|---|---|---|---|---|
hours | grade_num | -0.193 | -0.461 | 0.106 | 18110.93 | 0.204 | Spearman | 45 |
There was no significant relationship between degree grade classification for an essay and the time spent doing it, $ \rho_s = -0.19 $, p = 0.204. Note that the direction of the relationship has reversed. This has happened because the essay marks were recoded as 1 (first), 2 (upper second), 3 (lower second), and 4 (third), so high grades were represented by low numbers.
Task 7.2
Using the notebook.csv data from Chapter 3, find out if the size of relationship between the participant’s sex and arousal.
Load the data
library(tidyverse)
notebook_tib <- readr::read_csv("../data/essay_marks.csv") %>%
dplyr::mutate(
sex = forcats::as_factor(sex),
film = forcats::as_factor(film)
)
Alternative, load the data directly from the discovr
package:
notebook_tib <- discovr::notebook
Estimate the correlation
Sex is a categorical variable with two categories, therefore, we need to quantify this relationship using a point-biserial correlation.
notebook_tib %>%
dplyr::mutate(
sex_bin = ifelse(sex == "Male", 0, 1),
) %>%
dplyr::select(arousal, sex_bin) %>%
correlation::correlation() %>%
knitr::kable(digits = 3)
Parameter1 | Parameter2 | r | CI_low | CI_high | t | df | p | Method | n_Obs |
---|---|---|---|---|---|---|---|---|---|
arousal | sex_bin | -0.196 | -0.478 | 0.123 | -1.232 | 38 | 0.226 | Pearson | 40 |
I used a two-tailed test because one-tailed tests should never really be used. There was no significant relationship between biological sex and arousal because the p-value is larger than 0.05, \(r_\text{pb}\)
= –0.20, p = 0.226.
Task 7.3
Using the notebook data again, quantify the relationship between the film watched and arousal.
notebook_tib %>%
dplyr::mutate(
film_bin = ifelse(film == "The notebook", 0, 1),
) %>%
dplyr::select(arousal, film_bin) %>%
correlation::correlation() %>%
knitr::kable(digits = 3)
Parameter1 | Parameter2 | r | CI_low | CI_high | t | df | p | Method | n_Obs |
---|---|---|---|---|---|---|---|---|---|
arousal | film_bin | -0.865 | -0.927 | -0.757 | -10.606 | 38 | 0 | Pearson | 40 |
There was a significant relationship between the film watched and arousal, \(r_\text{pb}\)
= –0.86, p < 0.001. Looking at how the groups were coded, you should see that The Notebook had a code of 0, and the documentary about notebooks had a code of 1, therefore the negative coefficient reflects the fact that as film goes up (changes from 0 to 1) arousal goes down. Put another way, as the film changes from The Notebook to a documentary about notebooks, arousal decreases. So The Notebook gave rise to the greater arousal levels.
Task 7.4
As a statistics lecturer I am interested in the factors that determine whether a student will do well on a statistics course. Imagine I took 25 students and looked at their grades for my statistics course at the end of their first year at university: first, upper second, lower second and third class (see Task 1). I also asked these students what grade they got in their high school maths exams. In the UK GCSEs are school exams taken at age 16 that are graded A, B, C, D, E or F (an A grade is the best). The data for this study are in the file grades.csv. To what degree does GCSE maths grade correlate with first-year statistics grade?
Load the data
The categories in the data file have a meaningful order, so after importing the data we transform stats and gcse to factors but also order the factor levels in the correct order (this is important!).
grade_tib <- readr::read_csv("../data/grades.csv") %>%
dplyr::mutate(
stats = forcats::as_factor(stats) %>% forcats::fct_relevel("First class", "Upper second class", "Lower second class", "Third class", "Pass", "Fail"),
gcse = forcats::as_factor(gcse) %>% forcats::fct_relevel("A", "B", "C", "D", "E", "F")
)
Alternative, load the data directly from the discovr
package:
grade_tib <- discovr::grades
Estimate the correlation
Let’s look at these variables. In the UK, GCSEs are school exams taken at age 16 that are graded A, B, C, D, E or F. These grades are categories that have an order of importance (an A grade is better than all of the lower grades). In the UK, a university student can get a first-class mark, an upper second, a lower second, a third, a pass or a fail. These grades are categories, but they have an order to them (an upper second is better than a lower second). When you have categories like these that can be ordered in a meaningful way, the data are said to be ordinal. The data are not interval, because a first-class degree encompasses a 30% range (70–100%), whereas an upper second only covers a 10% range (60–70%). When data have been measured at only the ordinal level they are said to be non-parametric and Pearson’s correlation is not appropriate. Therefore, the Spearman correlation coefficient is used.
In the file, the scores are in two columns: one labelled stats and one labelled gcse.
Table: Table 1: First 10 rows of grade_tib
id | stats | gcse |
---|---|---|
wbmew | First class | A |
ypusy | First class | A |
sbumv | First class | C |
wqicp | Upper second class | C |
xnobi | Upper second class | C |
glxdw | Upper second class | D |
eomcn | Upper second class | E |
bjhcl | Lower second class | A |
wcype | Lower second class | B |
grkvu | Lower second class | C |
The first thing we need to do is to convert stats and gcse to numbers using as.numeric
, which will return the numeric value attached to each factor level:
grade_tib <- grade_tib %>%
dplyr::mutate(
stat_num = as.numeric(stats),
gcse_num = as.numeric(gcse)
)
Let’s check the numeric grade matches the description (i.e. first = 1, upper second = 2 etc.):
Table: Table 2: First 10 rows of grade_tib
id | stats | gcse | stat_num | gcse_num |
---|---|---|---|---|
wbmew | First class | A | 1 | 1 |
ypusy | First class | A | 1 | 1 |
sbumv | First class | C | 1 | 3 |
wqicp | Upper second class | C | 2 | 3 |
xnobi | Upper second class | C | 2 | 3 |
glxdw | Upper second class | D | 2 | 4 |
eomcn | Upper second class | E | 2 | 5 |
bjhcl | Lower second class | A | 3 | 1 |
wcype | Lower second class | B | 3 | 2 |
grkvu | Lower second class | C | 3 | 3 |
Now the correlation:
grade_tib %>%
dplyr::select(stat_num, gcse_num) %>%
correlation::correlation(method = "spearman") %>%
knitr::kable(digits = 3)
Parameter1 | Parameter2 | rho | CI_low | CI_high | S | p | Method | n_Obs |
---|---|---|---|---|---|---|---|---|
stat_num | gcse_num | 0.455 | 0.072 | 0.72 | 1418.035 | 0.022 | Spearman | 25 |
In the question I predicted that better grades in GCSE maths would correlate with better degree grades for my statistics course. This hypothesis is directional and so a one-tailed test could be selected; however, in the chapter I advised against one-tailed tests so I have done two-tailed.
We could report these results as follows:
- There was a significant positive relationship between a person’s statistics grade and their GCSE maths grade, $ r_\text{s} $ = 0.45, *p* = 0.022.
Task 7.5
In Figure 3.4 we saw some data relating to people’s ratings of dishonest acts and the likeableness of the perpetrator (for a full description see Jane Superbrain Box 3.1). Compute the Spearman correlation between ratings of dishonesty and likeableness of the perpetrator. The data are in honesty_lab.csv.
Load the data
honesty_tib <- readr::read_csv("../data/honesty_lab.csv") %>%
dplyr::mutate(
sex = forcats::as_factor(sex),
film = forcats::as_factor(film)
)
Alternative, load the data directly from the discovr
package:
honesty_tib <- discovr::honesty_lab
Estimate the correlation
honesty_tib %>%
dplyr::select(deed, likeableness) %>%
correlation::correlation(method = "spearman") %>%
knitr::kable(digits = 3)
Parameter1 | Parameter2 | rho | CI_low | CI_high | S | p | Method | n_Obs |
---|---|---|---|---|---|---|---|---|
deed | likeableness | 0.844 | 0.776 | 0.892 | 26054.24 | 0 | Spearman | 100 |
The relationship between ratings of dishonesty and likeableness of the perpetrator was significant because the p-value is less than 0.05 (p = 0.000). The value of Spearman’s correlation coefficient is quite large and positive (0.84), indicating a large positive effect: the more likeable the perpetrator was, the more positively their dishonest acts were viewed.
We could report the results as follows:
- There was a large and significant positive relationship between the likeableness of a perpetrator and how positively their dishonest acts were viewed,
\(r_\text{s}\)
= 0.84, p < 0.001.
Task 7.6
In Chapter 1 (Task 6) we looked at data from people who had been forced to marry goats and dogs and measured their life satisfaction and, also, how much they like animals (goat_or_dog.csv). Is there a significant correlation between life satisfaction and the type of animal to which a person was married
Load the data
goat_tib <- readr::read_csv("../data/goat_or_dog.csv") %>%
dplyr::mutate(
wife = forcats::as_factor(wife)
)
Alternative, load the data directly from the discovr
package:
goat_tib <- discovr::animal_bride
Estimate the correlation
Wife is a categorical variable with two categories (goat or dog). Therefore, we need to look at this relationship using a point-biserial correlation.
goat_tib <- goat_tib %>%
dplyr::mutate(
wife_bin = ifelse(wife == "Goat", 0, 1),
)
goat_tib %>%
dplyr::select(life_satisfaction, wife_bin) %>%
correlation::correlation() %>%
knitr::kable(digits = 3)
Parameter1 | Parameter2 | r | CI_low | CI_high | t | df | p | Method | n_Obs |
---|---|---|---|---|---|---|---|---|---|
life_satisfaction | wife_bin | 0.63 | 0.261 | 0.839 | 3.446 | 18 | 0.003 | Pearson | 20 |
I used a two-tailed test because one-tailed tests should never really be used (see book chapter for more explanation). As you can see there, was a significant relationship between type of animal wife and life satisfaction because our p-value is less than 0.05, \(r_\text{pb}\)
= 0.63, p = 0.003. Looking at how the groups were coded, you should see that goat had a code of 0 and dog had a code of 1, therefore this result reflects the fact that as wife goes up (changes from 0 to 1) life satisfaction goes up. Put another way, as wife changes from goat to dog, life satisfaction increases. So, being married to a dog was associated with greater life satisfaction.
Task 7.7
Repeat the analysis above taking account of animal liking when computing the correlation between life satisfaction and the animal to which a person was married.
We can conduct a partial correlation between life satisfaction and the animal to which a person was married while ‘adjusting’ for the effect of liking animals. Remember that we added the variable wife_bin
to goat_tib
in the previous solution.
goat_tib %>%
dplyr::select(-wife) %>%
correlation::correlation(partial = TRUE) %>%
knitr::kable(digits = 3)
Parameter1 | Parameter2 | r | CI_low | CI_high | t | df | p | Method | n_Obs |
---|---|---|---|---|---|---|---|---|---|
animal | life_satisfaction | 0.615 | 0.236 | 0.831 | 3.306 | 18 | 0.008 | Pearson | 20 |
animal | wife_bin | -0.399 | -0.715 | 0.053 | -1.846 | 18 | 0.081 | Pearson | 20 |
life_satisfaction | wife_bin | 0.701 | 0.375 | 0.873 | 4.173 | 18 | 0.002 | Pearson | 20 |
Notice that the partial correlation between wife and life satisfaction is 0.70, which is greater than the correlation when the effect of animal liking is not adjusted for (r = 0.630 from the previous task). The correlation has become more statistically significant (its p-value has decreased from 0.003 to 0.002..)
Running this analysis has shown us that type of wife alone explains a large portion of the variation in life satisfaction. In other words, the relationship between wife and life satisfaction is not due to animal liking.
Task 7.8
In Chapter 1 (Task 7) we looked at data based on findings that the number of cups of tea drunk was related to cognitive functioning (Feng et al., 2010). The data are in the file tea_makes_you_brainy_15.csv. What is the correlation between tea drinking and cognitive functioning? Is there a significant effect?
Load the data
tea15_tib <- readr::read_csv("../data/tea_makes_you_brainy_15.csv")
Alternative, load the data directly from the discovr
package:
tea15_tib <- discovr::tea_15
Estimate the correlation
Because the numbers of cups of tea and cognitive function are both interval variables, we can conduct a Pearson’s correlation coefficient.
tea15_tib %>%
dplyr::select(tea, cog_fun) %>%
correlation::correlation() %>%
knitr::kable(digits = 3)
Parameter1 | Parameter2 | r | CI_low | CI_high | t | df | p | Method | n_Obs |
---|---|---|---|---|---|---|---|---|---|
tea | cog_fun | 0.078 | -0.453 | 0.567 | 0.281 | 13 | 0.783 | Pearson | 15 |
I chose a two-tailed test because it is never really appropriate to conduct a one-tailed test (see the book chapter). The results in the table above indicate that the relationship between number of cups of tea drunk per day and cognitive function was not significant. We can tell this because our p-value is greater than 0.05. Pearson’s r = 0.08, p = 0.783.
Task 7.9
The research in the previous task was replicated but in a larger sample (N = 716), which is the same as the sample size in Feng et al.’s research (tea_makes_you_brainy_716.csv). Conduct a correlation between tea drinking and cognitive functioning. Compare the correlation coefficient and significance in this large sample, with the previous task. What statistical point do the results illustrate?
Load the data
tea716_tib <- readr::read_csv("../data/tea_makes_you_brainy_716.csv")
Alternative, load the data directly from the discovr
package:
tea716_tib <- discovr::tea_716
Estimate the correlation
tea716_tib %>%
dplyr::select(tea, cog_fun) %>%
correlation::correlation() %>%
knitr::kable(digits = 3)
Parameter1 | Parameter2 | r | CI_low | CI_high | t | df | p | Method | n_Obs |
---|---|---|---|---|---|---|---|---|---|
tea | cog_fun | 0.078 | 0.004 | 0.15 | 2.081 | 714 | 0.038 | Pearson | 716 |
We can see that although the value of Pearson’s r has not changed, it is still very small (0.08), the relationship between the number of cups of tea drunk per day and cognitive function is now just significant (p = 0.038). This example indicates one of the downfalls of significance testing; you can get significant results when you have large sample sizes even if the effect is very small. Basically, whether you get a significant result or not is entirely subject to the sample size.
Task 7.10
In Chapter 6 we looked at hygiene scores over three days of a rock music festival (download_festival.csv). Using Spearman’s correlation, were hygiene scores on day 1 of the festival significantly correlated with those on day 3?
Load the data
download_tib <- readr::read_csv("../data/download_festival.csv") %>%
dplyr::mutate(
gender = forcats::as_factor(gender)
)
Alternative, load the data directly from the discovr
package:
download_tib <- discovr::download
Estimate the correlation
First, we need to make the data messy so that scores for different days are in different columns (rather than different rows):
download_tib %>%
dplyr::select(day_1, day_3) %>%
correlation::correlation(method = "spearman") %>%
knitr::kable(digits = 3)
Parameter1 | Parameter2 | rho | CI_low | CI_high | S | p | Method | n_Obs |
---|---|---|---|---|---|---|---|---|
day_1 | day_3 | 0.344 | 0.178 | 0.491 | 203307.3 | 0 | Spearman | 123 |
The hygiene scores on day 1 of the festival correlated significantly with hygiene scores on day 3. The value of Spearman’s correlation coefficient is 0.344, which is a positive value suggesting that the smellier you are on day 1, the smellier you will be on day 3, \(r_\text{s}\)
= 0.34, p < 0.001.
Task 7.11
Using the data in shopping_exercise.csv (Chapter 1, Task 5), find out if there is a significant relationship between the time spent shopping and the distance covered.
Load the data
shopping_tib <- readr::read_csv("../data/shopping_exercise.csv") %>%
dplyr::mutate(
sex = forcats::as_factor(sex)
)
Alternative, load the data directly from the discovr
package:
shopping_tib <- discovr::shopping
Estimate the correlation
The variables Time and Distance are both interval. Therefore, we can conduct a Pearson’s correlation. I chose a two-tailed test because it is never really appropriate to conduct a one-tailed test (see the book chapter).
shopping_tib %>%
dplyr::select(distance, time) %>%
correlation::correlation(method = "spearman") %>%
knitr::kable(digits = 3)
Parameter1 | Parameter2 | rho | CI_low | CI_high | S | p | Method | n_Obs |
---|---|---|---|---|---|---|---|---|
distance | time | 0.83 | 0.421 | 0.959 | 28 | 0.003 | Spearman | 10 |
The output indicates that there was a significant positive relationship between time spent shopping and distance covered. We can tell that the relationship was significant because the p-value is smaller than 0.05. Also, our value for Pearson’s r is very large (0.83) indicating a large effect. Pearson’s r = 0.83, p = 0.003.
Task 7.12
What effect does accounting for the participant’s biological sex have on the relationship between the time spent shopping and the distance covered?
To answer this question, we need to conduct a partial correlation between the time spent shopping (interval variable) and the distance covered (interval variable) while ‘adjusting’ for the effect of sex (dicotomous variable).
First, let’s turn sex into a numeric variable:
shopping_tib <- shopping_tib %>%
dplyr::mutate(
sex_bin = ifelse(sex == "Male", 0, 1),
)
The data look like this:
Table: Table 3: First 10 rows of shopping_tib
sex | distance | time | sex_bin |
---|---|---|---|
Male | 0.16 | 15 | 0 |
Male | 0.40 | 30 | 0 |
Male | 1.36 | 37 | 0 |
Male | 1.99 | 65 | 0 |
Male | 3.61 | 103 | 0 |
Female | 1.40 | 22 | 1 |
Female | 1.81 | 140 | 1 |
Female | 1.96 | 160 | 1 |
Female | 3.02 | 183 | 1 |
Female | 4.82 | 245 | 1 |
Now lets run the partial correlation:
shopping_tib %>%
dplyr::select(-sex) %>%
correlation::correlation(partial = TRUE) %>%
knitr::kable(digits = 3)
Parameter1 | Parameter2 | r | CI_low | CI_high | t | df | p | Method | n_Obs |
---|---|---|---|---|---|---|---|---|---|
distance | time | 0.820 | 0.395 | 0.956 | 4.06 | 8 | 0.011 | Pearson | 10 |
distance | sex_bin | -0.351 | -0.803 | 0.358 | -1.06 | 8 | 0.320 | Pearson | 10 |
time | sex_bin | 0.644 | 0.024 | 0.906 | 2.38 | 8 | 0.089 | Pearson | 10 |
The partial correlation between time and distance is 0.82, which is slightly smaller than the correlation when the effect of sex is not adjusted for (r = 0.830). The correlation has become slightly less statistically significant (its p-value has increased from 0.003 to 0.011). However, not a lot has changed, so even after adjusting for biological sex, the relationship between distance covered and time spent shopping is very strong.