Lesson 3: Variability

1 Introduction

Welcome! In this lesson you will learn about measures of variability. You’ll learn how to find the standard deviation, variance, standard error, and confidence intervals. You’ll also practice subsetting data with the group_by() and filter() functions.

See the home page for more details on how to use this tutorial and for troubleshooting tips!

2 Variation Around a Central Tendency

You have already learned that one important way to summarize a set of data values is with a measure of central tendency, the mean or the median.

Another important summary statistic is the amount of variation around that central tendency.

2.1 Dataset

This lesson contains a dataset that you will work with, named data. You have already worked with these data, in lesson one and two.

Type the word data to recall what it contains.

The dataset contains physiological data for a sample of individuals.

You can determine how many entries are in the dataframe by using the command nrow(data). Try that now.

nrow(data)

How many different individuals are in the sample?

9

14

18

2.2 Variance

One common measure of variability is the variance. The variance is the square of the average difference between each data value and the mean.

You can compute the variance of a variable with the var() function.

Use var() to compute the variance of the RQ for the two conditions exercise (1) and resting (2), found in the Condition column.

Use group_by() and summarize(), and string the functions together with the pipe (|>).

Don’t forget to add na.rm=TRUE in the var() function.

data |> 
  group_by(_____) |> 
  summarize(varianceRQ = var(______, na.rm=TRUE))
data |> 
  group_by(Condition) |> 
  summarize(varianceRQ = var(RQ, na.rm=TRUE))

What is the variance of the respiratory quotient for the resting group (condition 2)?

1.80123

0.00176

0.001995

2.3 Standard Deviation

Another common measure of variance is the standard deviation, which is simply the square root of the variance. The function for standard deviation is sd().

Compute the standard deviation of RQ for each of the two conditions.

Use group_by() and summarize(), and string the functions together with the pipe (|>).

Don’t forget to add na.rm=TRUE in the sd() function.

data |> 
  group_by(_____) |> 
  summarize(stdDevRQ = sd(______, na.rm=TRUE))
data |> 
  group_by(Condition) |> 
  summarize(stdDevRQ = sd(RQ, na.rm=TRUE))

What is the variance of the respiratory quotient for the exercise group (condition 1)

0.0419

0.0447

0.0223

3 Standard error and confidence intervals

One reason to compute the variability in a data set is to get an idea of how accurately our sample has estimated the mean of a population. Two additional measures of variability, standard error and confidence intervals, do this more usefully than variance or standard deviation.

This is because both standard error and confidence intervals become smaller as the sample size (or number of data values) gets larger – which reflects our increased certainty that our sample mean represents the true population mean.

3.1 Standard Error

To compute the standard error of a set of data in R, a package called plotrix is required. In this tutorial, plotrix has already been installed for you.

The function for computing the standard error is std.error().

Compute the standard error of RQ for each of the two conditions.

Use the same syntax as you did for computing standard deviation and variance, just with a new formula. Don’t forget to use the pipe to string your functions together!

data |>
  group_by(______) |>
  summarize(______ = std.error(_____, na.rm=TRUE))
data |>
  group_by(Condition) |>
  summarize(stdErrorRQ = std.error(RQ, na.rm=TRUE))
data |>
  group_by(Condition) |>
  summarize(stdErrorRQ = sd(RQ, na.rm=TRUE)/sqrt(n()))

What is the standard error of RQ for the exercise condition?

0.0169

0.1238

0.0447

3.2 Confidence Intervals

Confidence intervals are closely related to standard error. The 95% confidence interval is a range of data values that enclose the sample mean. There is a 95% probability that the “true” mean lies within the 95% confidence interval of the sample mean.

The easiest way to compute a 95% confidence interval is to use the t.test() function.

The command for the t-test is t.test(data$variableName)

Let’s compute the 95% confidence interval of the variable heart_rate.

t.test(data$heart_rate)

In the output, you should see the name of your variable, followed by two lines of information that are not relevant if you are only interested in knowing the 95% confidence interval. The 95% confidence interval is on the following line, followed by a computation of the sample mean.

The 95% confidence interval for the heart rate variable is between 61.08295 and 75.91705. We read this as: There is a 95% probability that the “true” mean lies between 61.08 and 75.92.

3.3 Subsetting Data with the Filter() Function

What if we wanted to find a 95% confidence interval for each condition (exercise and resting)?

We will use the filter() function to extract data with a specific condition.

The syntax is: filter(data, columnName == "some value")

  • Arguments used:
    • data: Input your original data set

    • columnName == "some value" Choose a column you’d like to filter (columnName) and then replace some value with whatever condition you would like to specify.

    • You can also replace the == with other operators including:

      • > (greater than)
      • < (less than)
      • <= (less than or equal to)
      • >= (greater than or equal to)
      • == (equal to)
      • != (not equal to)

Let’s see how this works with an example:

If you run this example, you should see data, but it will only show rows where the condition is “1”.

Now, let’s save this data as a new variable. Recall you can use <- to assign data to a new variable. Try creating a new variable that includes all the rows for which the Condition is equal to “1”. Call your new variable exerciseData.

Use the same code in the example: filter(data, Condition == "1") but assign it to a new variable name exerciseData.

exerciseData <- filter(data, Condition == "1")

Now that you’ve created your variable, type exerciseData to see if it worked!

Finally, let’s find the 95% confidence interval for the RQ values measured under Condition 1.

Perform a t.test on your new dataset exerciseData, on the RQ column.

Try something like t.test(______$______)

t.test(exerciseData$RQ)

So for these data, the (rounded) sample mean is 0.876, and the 95% confidence interval ranges from a minimum value of 0.834 to a maximum of 0.917.

3.4 Congratulations!

You’ve completed lesson three, in which you learned how to compute several measures of variability: variance, standard deviation, standard error, and confidence intervals. You also learned ways to subset a data set using group_by() and filter().

In lesson four you will learn how to make plots. You’ll start with a box plot and histogram, which are a basic ways of graphically displaying the central tendency and the variability of any data set.