Code-a-long 3.1 - Statistical inference and two sample t-tests

Student learning outcomes

Students will understand basic concepts underlying inference
Students will be able to formulate statistical hypotheses using data
Students will be able to perform and interpret the results of two sample t-tests

Some review on t-tests and statistical hypotheses

A two sample t-test is a way of evaluating if the means of two populations are different, given our samples of those populations.

A t-test relies on the calculate a t-score. This quantity depends on our sample mean, our sample standard deviation, and the size of our sample.

The formula of the t-score for a two sample t-test:

\[t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_{2}}}}\]

The primary output of a t-test is a p-value. A p value represents the probability that the difference between our sample means would have occurred by chance.

We can use this p-value to assess a statistical hypothesis.

Statistical hypotheses are formulated as a null and an alternative hypotheses:

\(H_{0}\) (null hypothesis) - There is no difference in the means of the populations we sampled from

\(H_{a}\) (alternative hypothesis) - The means of the populations we sampled from are different

Again, our p-value is the decimal probability that our data occurred by chance. For instance, a p-value of 0.05 would mean there is a 5% probability that the null hypothesis is true, given our observations.

Sick fish example

Recall that there seemed to be a difference in the proportion of sick fish in a tank, when the tank was above or below a critical temperature value.

At the time, we just compared means and visualized the data, but that isn’t very statistically rigorous.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

So, lets validate our results using a t-test.

Fishing and leopard seals

As you are well aware by now, our main source of food (fish) has been compromised. Rather than starve or leave, we decide to source our fish from waters of Antarctica.

The problem is, the places we’d fish are also foraging grounds for leopard seals. To minimize the impact of our fishing on the seal population, we’d like to know where and when the presence of fish/seals are greater.

Luckily, we’ve been collecting relevant data for awhile. We have: radio tags on seals, and the number of seals at a given location. We also net traps to count humped rockcod, our shared food source.

Seal with transmitter Humped rockcod

Data exploration

Let’s take a look at our new data. Because we have two different collection schemes, our data are separated in to two data frames (and files).

Group Brainstorm

We’d like to know if seal and fish counts are different during times of day observed.
Based on our goals, what quantities do we want to compare?
Create null and alternative hypotheses to evaluate our data

H0: no difference between the mean count of fish between the times of day

Ha: there is a difference

Visualization and descriptive statistics

Generate descriptive statistics from our data, and visualize the data.

Performing our tests

Perform a t-test to evaluate our hypothesis
Interpret the results using the p-value