Smart Alex solutions Chapter 7

Smart Alex charatcer from Discovering Statistics using R and RStudio

This document contains abridged sections from Discovering Statistics Using R and RStudio by Andy Field so there are some copyright considerations. You can use this material for teaching and non-profit activities but please do not meddle with it or claim it as your own work. See the full license terms at the bottom of the page.

Task 7.1

A student was interested in whether there was a positive relationship between the time spent doing an essay and the mark received. He got 45 of his friends and timed how long they spent writing an essay (hours) and the percentage they got in the essay (essay). He also translated these grades into their degree classifications (grade): in the UK, a student can get a first-class mark (the best), an upper-second-class mark, a lower second, a third, a pass or a fail (the worst). Using the data in the file essay_marks.csv find out what the relationship was between the time spent doing an essay and the eventual mark in terms of percentage and degree class (draw a scatterplot too).

Load the data

essay_tib <- readr::read_csv("../data/essay_marks.csv") %>% 
  dplyr::mutate(
    grade = forcats::as_factor(grade)
  )

Alternative, load the data directly from the discovr package:

essay_tib <- discovr::essay_marks

Visualise the data

We’re interested in looking at the relationship between hours spent on an essay and the grade obtained. We could create a scatterplot of hours spent on the essay (x-axis) and essay mark (y-axis). I’ve chosen to highlight the degree classification grades using different colours.

ggplot2::ggplot(essay_tib, aes(hours, essay, colour = grade)) +
  geom_point() +
  geom_smooth(method = "lm", alpha = 0.1) +
  coord_cartesian(xlim = c(0, 15), ylim = c(20, 80)) +
  scale_x_continuous(breaks = seq(0, 15, 1)) +
  scale_y_continuous(breaks = seq(20, 80, 5)) +
  labs(x = "Time spent on essay (hours)", y = "Essay mark (%)", colour = "Grade", fill = "Grade") +
  theme_minimal()

Correlations

essay_tib %>% 
  dplyr::select(hours, essay) %>%
  correlation::correlation() %>% 
  knitr::kable(digits = 3)

Parameter1	Parameter2	r	CI_low	CI_high	t	df	p	Method	n_Obs
hours	essay	0.267	-0.029	0.52	1.814	43	0.077	Pearson	45

The results in the table above indicate that the relationship between time spent writing an essay and grade awarded was not significant, Pearson’s r = 0.27, p = 0.077.

The second part of the question asks us to do the same analysis but when the percentages are recoded into degree classifications. The degree classifications are ordinal data (not interval): they are ordered categories. So we shouldn’t use Pearson’s test statistic, but Spearman’s and Kendall’s ones instead. Grade is a factor:

essay_tib$grade

##  [1] Upper second class First class        Third class        First class       
##  [5] Lower second class Upper second class Upper second class Upper second class
##  [9] Upper second class Third class        Upper second class First class       
## [13] First class        Lower second class Lower second class Upper second class
## [17] Upper second class Upper second class Upper second class Lower second class
## [21] Lower second class First class        Upper second class Upper second class
## [25] First class        Upper second class Lower second class Lower second class
## [29] Lower second class First class        Upper second class Upper second class
## [33] Upper second class First class        First class        Upper second class
## [37] Upper second class Upper second class Upper second class Upper second class
## [41] Upper second class Upper second class Lower second class Lower second class
## [45] First class       
## Levels: First class Upper second class Lower second class Third class

We need to convert it into numbers using as.numeric:

essay_tib <- essay_tib %>% 
  dplyr::mutate(
    grade_num = as.numeric(grade)
    )

Let’s check the numeric grade matches the description (i.e. first = 1, upper second = 2 etc.):

id	essay	hours	grade	grade_num
qxbcjg	61.68	10.63	Upper second class	2
csykgu	69.55	7.29	First class	1
npxgpl	48.23	5.05	Third class	4
gomiwl	70.68	2.89	First class	1
qdywaa	59.90	9.55	Lower second class	3
cmaprb	61.16	11.31	Upper second class	2
nrvhix	67.62	7.47	Upper second class	2
ybytlf	64.78	8.47	Upper second class	2
talwvg	63.21	8.72	Upper second class	2
ceylea	49.69	6.20	Third class	4
berjky	63.72	4.77	Upper second class	2
cjxgpl	72.24	10.98	First class	1
gufenh	70.59	5.22	First class	1
uaxxdm	58.64	6.86	Lower second class	3
unnhpl	58.72	9.80	Lower second class	3

Now the correlation:

essay_tib %>% 
  dplyr::select(hours, grade_num) %>%
  correlation::correlation(method = "spearman") %>% 
  knitr::kable(digits = 3)

Parameter1	Parameter2	rho	CI_low	CI_high	S	p	Method	n_Obs
hours	grade_num	-0.193	-0.461	0.106	18110.93	0.204	Spearman	45

There was no significant relationship between degree grade classification for an essay and the time spent doing it, $ \rho_s = -0.19 $, p = 0.204. Note that the direction of the relationship has reversed. This has happened because the essay marks were recoded as 1 (first), 2 (upper second), 3 (lower second), and 4 (third), so high grades were represented by low numbers.

Task 7.2

Using the notebook.csv data from Chapter 3, find out if the size of relationship between the participant’s sex and arousal.

Load the data

library(tidyverse)
notebook_tib <- readr::read_csv("../data/essay_marks.csv") %>% 
  dplyr::mutate(
    sex = forcats::as_factor(sex),
    film = forcats::as_factor(film)
  )

Alternative, load the data directly from the discovr package:

notebook_tib <- discovr::notebook

Estimate the correlation

Sex is a categorical variable with two categories, therefore, we need to quantify this relationship using a point-biserial correlation.

notebook_tib %>% 
  dplyr::mutate(
    sex_bin = ifelse(sex == "Male", 0, 1),
  ) %>% 
  dplyr::select(arousal, sex_bin) %>%
  correlation::correlation() %>% 
  knitr::kable(digits = 3)

Parameter1	Parameter2	r	CI_low	CI_high	t	df	p	Method	n_Obs
arousal	sex_bin	-0.196	-0.478	0.123	-1.232	38	0.226	Pearson	40

I used a two-tailed test because one-tailed tests should never really be used. There was no significant relationship between biological sex and arousal because the p-value is larger than 0.05, $r_\text{pb}$ = –0.20, p = 0.226.

Task 7.3

Using the notebook data again, quantify the relationship between the film watched and arousal.

notebook_tib %>% 
  dplyr::mutate(
    film_bin = ifelse(film == "The notebook", 0, 1),
  ) %>% 
  dplyr::select(arousal, film_bin) %>%
  correlation::correlation() %>%
  knitr::kable(digits = 3)

Parameter1	Parameter2	r	CI_low	CI_high	t	df	p	Method	n_Obs
arousal	film_bin	-0.865	-0.927	-0.757	-10.606	38	0	Pearson	40

There was a significant relationship between the film watched and arousal, $r_\text{pb}$ = –0.86, p < 0.001. Looking at how the groups were coded, you should see that The Notebook had a code of 0, and the documentary about notebooks had a code of 1, therefore the negative coefficient reflects the fact that as film goes up (changes from 0 to 1) arousal goes down. Put another way, as the film changes from The Notebook to a documentary about notebooks, arousal decreases. So The Notebook gave rise to the greater arousal levels.

Task 7.4

As a statistics lecturer I am interested in the factors that determine whether a student will do well on a statistics course. Imagine I took 25 students and looked at their grades for my statistics course at the end of their first year at university: first, upper second, lower second and third class (see Task 1). I also asked these students what grade they got in their high school maths exams. In the UK GCSEs are school exams taken at age 16 that are graded A, B, C, D, E or F (an A grade is the best). The data for this study are in the file grades.csv. To what degree does GCSE maths grade correlate with first-year statistics grade?

Load the data

The categories in the data file have a meaningful order, so after importing the data we transform stats and gcse to factors but also order the factor levels in the correct order (this is important!).

grade_tib <- readr::read_csv("../data/grades.csv") %>% 
  dplyr::mutate(
    stats = forcats::as_factor(stats) %>% forcats::fct_relevel("First class", "Upper second class", "Lower second class", "Third class", "Pass", "Fail"),
    gcse = forcats::as_factor(gcse) %>% forcats::fct_relevel("A", "B", "C", "D", "E", "F")
  )

Alternative, load the data directly from the discovr package:

grade_tib <- discovr::grades

Estimate the correlation

Let’s look at these variables. In the UK, GCSEs are school exams taken at age 16 that are graded A, B, C, D, E or F. These grades are categories that have an order of importance (an A grade is better than all of the lower grades). In the UK, a university student can get a first-class mark, an upper second, a lower second, a third, a pass or a fail. These grades are categories, but they have an order to them (an upper second is better than a lower second). When you have categories like these that can be ordered in a meaningful way, the data are said to be ordinal. The data are not interval, because a first-class degree encompasses a 30% range (70–100%), whereas an upper second only covers a 10% range (60–70%). When data have been measured at only the ordinal level they are said to be non-parametric and Pearson’s correlation is not appropriate. Therefore, the Spearman correlation coefficient is used.

In the file, the scores are in two columns: one labelled stats and one labelled gcse.

Table: Table 1: First 10 rows of grade_tib

id	stats	gcse
wbmew	First class	A
ypusy	First class	A
sbumv	First class	C
wqicp	Upper second class	C
xnobi	Upper second class	C
glxdw	Upper second class	D
eomcn	Upper second class	E
bjhcl	Lower second class	A
wcype	Lower second class	B
grkvu	Lower second class	C

The first thing we need to do is to convert stats and gcse to numbers using as.numeric, which will return the numeric value attached to each factor level:

grade_tib <- grade_tib %>% 
  dplyr::mutate(
    stat_num = as.numeric(stats),
    gcse_num = as.numeric(gcse)
    )

Let’s check the numeric grade matches the description (i.e. first = 1, upper second = 2 etc.):

Table: Table 2: First 10 rows of grade_tib

id	stats	gcse	stat_num	gcse_num
wbmew	First class	A	1	1
ypusy	First class	A	1	1
sbumv	First class	C	1	3
wqicp	Upper second class	C	2	3
xnobi	Upper second class	C	2	3
glxdw	Upper second class	D	2	4
eomcn	Upper second class	E	2	5
bjhcl	Lower second class	A	3	1
wcype	Lower second class	B	3	2
grkvu	Lower second class	C	3	3

Now the correlation:

grade_tib %>% 
  dplyr::select(stat_num, gcse_num) %>%
  correlation::correlation(method = "spearman") %>% 
  knitr::kable(digits = 3)

Parameter1	Parameter2	rho	CI_low	CI_high	S	p	Method	n_Obs
stat_num	gcse_num	0.455	0.072	0.72	1418.035	0.022	Spearman	25

In the question I predicted that better grades in GCSE maths would correlate with better degree grades for my statistics course. This hypothesis is directional and so a one-tailed test could be selected; however, in the chapter I advised against one-tailed tests so I have done two-tailed.

We could report these results as follows:

There was a significant positive relationship between a person’s statistics grade and their GCSE maths grade, $ r_\text{s} $ = 0.45, *p* = 0.022.

Task 7.5

In Figure 3.4 we saw some data relating to people’s ratings of dishonest acts and the likeableness of the perpetrator (for a full description see Jane Superbrain Box 3.1). Compute the Spearman correlation between ratings of dishonesty and likeableness of the perpetrator. The data are in honesty_lab.csv.

Load the data

honesty_tib <- readr::read_csv("../data/honesty_lab.csv") %>% 
  dplyr::mutate(
    sex = forcats::as_factor(sex),
    film = forcats::as_factor(film)
  )

Alternative, load the data directly from the discovr package:

honesty_tib <- discovr::honesty_lab

Estimate the correlation

honesty_tib %>% 
  dplyr::select(deed, likeableness) %>%
  correlation::correlation(method = "spearman") %>% 
  knitr::kable(digits = 3)

Parameter1	Parameter2	rho	CI_low	CI_high	S	p	Method	n_Obs
deed	likeableness	0.844	0.776	0.892	26054.24	0	Spearman	100

The relationship between ratings of dishonesty and likeableness of the perpetrator was significant because the p-value is less than 0.05 (p = 0.000). The value of Spearman’s correlation coefficient is quite large and positive (0.84), indicating a large positive effect: the more likeable the perpetrator was, the more positively their dishonest acts were viewed.

We could report the results as follows:

There was a large and significant positive relationship between the likeableness of a perpetrator and how positively their dishonest acts were viewed, $r_\text{s}$ = 0.84, p < 0.001.

Task 7.6

In Chapter 1 (Task 6) we looked at data from people who had been forced to marry goats and dogs and measured their life satisfaction and, also, how much they like animals (goat_or_dog.csv). Is there a significant correlation between life satisfaction and the type of animal to which a person was married

Load the data

goat_tib <- readr::read_csv("../data/goat_or_dog.csv") %>% 
  dplyr::mutate(
    wife = forcats::as_factor(wife)
  )

Alternative, load the data directly from the discovr package:

goat_tib <- discovr::animal_bride

Estimate the correlation

Wife is a categorical variable with two categories (goat or dog). Therefore, we need to look at this relationship using a point-biserial correlation.

goat_tib <- goat_tib %>% 
  dplyr::mutate(
    wife_bin = ifelse(wife == "Goat", 0, 1),
  )

goat_tib %>%
  dplyr::select(life_satisfaction, wife_bin) %>%
  correlation::correlation() %>% 
  knitr::kable(digits = 3)

Parameter1	Parameter2	r	CI_low	CI_high	t	df	p	Method	n_Obs
life_satisfaction	wife_bin	0.63	0.261	0.839	3.446	18	0.003	Pearson	20

I used a two-tailed test because one-tailed tests should never really be used (see book chapter for more explanation). As you can see there, was a significant relationship between type of animal wife and life satisfaction because our p-value is less than 0.05, $r_\text{pb}$ = 0.63, p = 0.003. Looking at how the groups were coded, you should see that goat had a code of 0 and dog had a code of 1, therefore this result reflects the fact that as wife goes up (changes from 0 to 1) life satisfaction goes up. Put another way, as wife changes from goat to dog, life satisfaction increases. So, being married to a dog was associated with greater life satisfaction.

Task 7.7

Repeat the analysis above taking account of animal liking when computing the correlation between life satisfaction and the animal to which a person was married.

We can conduct a partial correlation between life satisfaction and the animal to which a person was married while ‘adjusting’ for the effect of liking animals. Remember that we added the variable wife_bin to goat_tib in the previous solution.

goat_tib %>% 
  dplyr::select(-wife) %>% 
  correlation::correlation(partial = TRUE) %>% 
  knitr::kable(digits = 3)

Parameter1	Parameter2	r	CI_low	CI_high	t	df	p	Method	n_Obs
animal	life_satisfaction	0.615	0.236	0.831	3.306	18	0.008	Pearson	20
animal	wife_bin	-0.399	-0.715	0.053	-1.846	18	0.081	Pearson	20
life_satisfaction	wife_bin	0.701	0.375	0.873	4.173	18	0.002	Pearson	20

Notice that the partial correlation between wife and life satisfaction is 0.70, which is greater than the correlation when the effect of animal liking is not adjusted for (r = 0.630 from the previous task). The correlation has become more statistically significant (its p-value has decreased from 0.003 to 0.002..)

Running this analysis has shown us that type of wife alone explains a large portion of the variation in life satisfaction. In other words, the relationship between wife and life satisfaction is not due to animal liking.

Task 7.8

In Chapter 1 (Task 7) we looked at data based on findings that the number of cups of tea drunk was related to cognitive functioning (Feng et al., 2010). The data are in the file tea_makes_you_brainy_15.csv. What is the correlation between tea drinking and cognitive functioning? Is there a significant effect?

Load the data

tea15_tib <- readr::read_csv("../data/tea_makes_you_brainy_15.csv")

Alternative, load the data directly from the discovr package:

tea15_tib <- discovr::tea_15

Estimate the correlation

Because the numbers of cups of tea and cognitive function are both interval variables, we can conduct a Pearson’s correlation coefficient.

tea15_tib %>% 
  dplyr::select(tea, cog_fun) %>%
  correlation::correlation() %>% 
  knitr::kable(digits = 3)

Parameter1	Parameter2	r	CI_low	CI_high	t	df	p	Method	n_Obs
tea	cog_fun	0.078	-0.453	0.567	0.281	13	0.783	Pearson	15

I chose a two-tailed test because it is never really appropriate to conduct a one-tailed test (see the book chapter). The results in the table above indicate that the relationship between number of cups of tea drunk per day and cognitive function was not significant. We can tell this because our p-value is greater than 0.05. Pearson’s r = 0.08, p = 0.783.

Task 7.9

The research in the previous task was replicated but in a larger sample (N = 716), which is the same as the sample size in Feng et al.’s research (tea_makes_you_brainy_716.csv). Conduct a correlation between tea drinking and cognitive functioning. Compare the correlation coefficient and significance in this large sample, with the previous task. What statistical point do the results illustrate?

Load the data

tea716_tib <- readr::read_csv("../data/tea_makes_you_brainy_716.csv")

Alternative, load the data directly from the discovr package:

tea716_tib <- discovr::tea_716

Estimate the correlation

tea716_tib %>% 
  dplyr::select(tea, cog_fun) %>%
  correlation::correlation() %>% 
  knitr::kable(digits = 3)

Parameter1	Parameter2	r	CI_low	CI_high	t	df	p	Method	n_Obs
tea	cog_fun	0.078	0.004	0.15	2.081	714	0.038	Pearson	716

We can see that although the value of Pearson’s r has not changed, it is still very small (0.08), the relationship between the number of cups of tea drunk per day and cognitive function is now just significant (p = 0.038). This example indicates one of the downfalls of significance testing; you can get significant results when you have large sample sizes even if the effect is very small. Basically, whether you get a significant result or not is entirely subject to the sample size.

Task 7.10

In Chapter 6 we looked at hygiene scores over three days of a rock music festival (download_festival.csv). Using Spearman’s correlation, were hygiene scores on day 1 of the festival significantly correlated with those on day 3?

Load the data

download_tib <- readr::read_csv("../data/download_festival.csv") %>% 
  dplyr::mutate(
    gender = forcats::as_factor(gender)
  )

Alternative, load the data directly from the discovr package:

download_tib <- discovr::download

Estimate the correlation

First, we need to make the data messy so that scores for different days are in different columns (rather than different rows):

download_tib %>% 
  dplyr::select(day_1, day_3) %>%
  correlation::correlation(method = "spearman") %>% 
  knitr::kable(digits = 3)

Parameter1	Parameter2	rho	CI_low	CI_high	S	p	Method	n_Obs
day_1	day_3	0.344	0.178	0.491	203307.3	0	Spearman	123

The hygiene scores on day 1 of the festival correlated significantly with hygiene scores on day 3. The value of Spearman’s correlation coefficient is 0.344, which is a positive value suggesting that the smellier you are on day 1, the smellier you will be on day 3, $r_\text{s}$ = 0.34, p < 0.001.

Task 7.11

Using the data in shopping_exercise.csv (Chapter 1, Task 5), find out if there is a significant relationship between the time spent shopping and the distance covered.

Load the data

shopping_tib <- readr::read_csv("../data/shopping_exercise.csv") %>% 
  dplyr::mutate(
    sex = forcats::as_factor(sex)
  )

Alternative, load the data directly from the discovr package:

shopping_tib <- discovr::shopping

Estimate the correlation

The variables Time and Distance are both interval. Therefore, we can conduct a Pearson’s correlation. I chose a two-tailed test because it is never really appropriate to conduct a one-tailed test (see the book chapter).

shopping_tib %>% 
  dplyr::select(distance, time) %>%
  correlation::correlation(method = "spearman") %>% 
  knitr::kable(digits = 3)

Parameter1	Parameter2	rho	CI_low	CI_high	S	p	Method	n_Obs
distance	time	0.83	0.421	0.959	28	0.003	Spearman	10

The output indicates that there was a significant positive relationship between time spent shopping and distance covered. We can tell that the relationship was significant because the p-value is smaller than 0.05. Also, our value for Pearson’s r is very large (0.83) indicating a large effect. Pearson’s r = 0.83, p = 0.003.

Task 7.12

What effect does accounting for the participant’s biological sex have on the relationship between the time spent shopping and the distance covered?

To answer this question, we need to conduct a partial correlation between the time spent shopping (interval variable) and the distance covered (interval variable) while ‘adjusting’ for the effect of sex (dicotomous variable).

First, let’s turn sex into a numeric variable:

shopping_tib <- shopping_tib %>% 
  dplyr::mutate(
    sex_bin = ifelse(sex == "Male", 0, 1),
  )

The data look like this:

Table: Table 3: First 10 rows of shopping_tib

sex	distance	time	sex_bin
Male	0.16	15	0
Male	0.40	30	0
Male	1.36	37	0
Male	1.99	65	0
Male	3.61	103	0
Female	1.40	22	1
Female	1.81	140	1
Female	1.96	160	1
Female	3.02	183	1
Female	4.82	245	1

Now lets run the partial correlation:

shopping_tib %>% 
  dplyr::select(-sex) %>% 
  correlation::correlation(partial = TRUE) %>% 
  knitr::kable(digits = 3)

Parameter1	Parameter2	r	CI_low	CI_high	t	df	p	Method	n_Obs
distance	time	0.820	0.395	0.956	4.06	8	0.011	Pearson	10
distance	sex_bin	-0.351	-0.803	0.358	-1.06	8	0.320	Pearson	10
time	sex_bin	0.644	0.024	0.906	2.38	8	0.089	Pearson	10

The partial correlation between time and distance is 0.82, which is slightly smaller than the correlation when the effect of sex is not adjusted for (r = 0.830). The correlation has become slightly less statistically significant (its p-value has increased from 0.003 to 0.011). However, not a lot has changed, so even after adjusting for biological sex, the relationship between distance covered and time spent shopping is very strong.

Last updated on May 29, 2025