Missing Data Tutorial

Welcome

In this tutorial, you will learn how to deal with missing data using the naniar package. The naniar package helps visualize where there are missing or ‘NA’ values in a dataset. It allows you to see how many ‘NA’ values there are and where they show up.

Functions in Focus

Let’s start by looking at some of our key functions which we’ll use in this exercise.

na_if(): Converts values to NA. Can be combined with the mutate() function to create or replace a column with NA values.

Syntax: na_if(vector, value to replace)

replace_na(): Changes an NA value to a given value.

Syntax: replace_na(vector, value)

miss_var_summary(): Creates a summary table of missing values by variable. Includes the number of missing values per variable (called n_miss) and the percentage of missing values per variable (called pct_miss).

Syntax when used with a dataframe: miss_var_summary(df)

gg_miss_var(): Creates a bar graph of missing values by variable.

Syntax: gg_miss_var(df)

You can also choose one variable and create a different graph for each aspect of that variable using the syntax gg_miss_var(df, facet=variable) where variable is your chosen variable.

vis_miss(): Creates a visual of missing values by variable by position in the dataset.

Syntax: vis_miss(df)

Our Dataset

We will work with a dataset with measurements for 3 different species of penguins on islands in the Palmer Archipelago, Antarctica. We’ll be focusing on the measurements for species, location, body mass (g), sex, and year the data was taken. For this tutorial, we’ve also added a new column called eye_color. Here’s more information about the data set!

Here is a preview of the data you will be working with:

Notice that there are NA values as well as “X”s in our data set. Today we’ll be looking at those NA (or missing) values.

Exercises

Here are some practice exercises. You can check if your code works using the “check your work” button after each problem.

If you get stuck, you can also click for hints and for an example answer. Remember, there’s many ways to solve each problem, so the code in the “answer” box isn’t the only way to solve it!

Exercise 1

First, let’s deal with the “X” values in our data set. We assume that these are missing values, so we want to replace the “X” with an “NA” so that we can use the naniar package functions on those values.

Use the na_if() function to replace all the X’s with NA values.

Rewrite your data frame penguin so that when we pull up penguin in the future, all the “X” values will be gone.

Check your work

To check your work, run the following code in the code chunk below.

penguin

If your code works, this should display your new data set with NA instead of X.

Hint

Use the pipe function to specify the penguin data set and then use the mutate() function to mutate the column eye_color.

Hint

penguin |> mutate(____=na_if(_____, '____'))

Hint

Once you’ve mutated the eye_color column, you need to rewrite the penguin data frame with that new column. Use the variable assignment function (<-) to do this.

Answer

penguin <- penguin |> mutate(eye_color=na_if(eye_color, 'X'))

Exercise 2

Now, let’s find where we have NA values in our code. Use the command miss_var_summary() to find which variables have missing values and how many missing values each variable contains.

Hint

miss_var_summary(________)

Answer

miss_var_summary(penguin)

Exercise 3

You may have noticed that eye_color has the highest number of missing values, so we’ll focus on the missing values in eye_color for this exercise.

Use miss_var_summary() to display the number of missing eye_color values, separated into individual years. Don’t include any other variables.

Hint

Use the pipe function to select the specific data you want to display.

Hint

Use the select() function to choose the variables you want to focus on and use the group_by() function to choose how you want to layout the chart.

Hint

penguin |> 
  select(___, ____) |>
  group_by(_____) |>
  miss_var_summary()

Answer

penguin |> 
  select(year, eye_color) |>
  group_by(year) |>
  miss_var_summary()

Exercise 4

Next use gg_miss_var() to create bar graphs that show the number of missing values per variable, by island. Create 3 graphs total, 1 graph for each island.

Hint

Recall the syntax gg_miss_var(df, facet=variable) where variable is your chosen variable.

Hint

gg_miss_var(penguin, facet=____)

Answer

gg_miss_var(penguin, facet=island)

Exercise 5

Next, let’s look at where the missing values are showing up within the observations in the dataset.

First, arrange the data by year, then input it into the vis_miss() function to see where the NA values fall when the data is sorted by year.

Hint

Use the pipe function (|>) with the arrange() function to sort the data by year.

Hint

penguin |>
  arrange(____) |>
  _______(_____)

Answer

penguin |>
  arrange(year) |>
  vis_miss()

Exercise 6

Finally, let’s imagine that you don’t want any NA values in the body_mass column. Use replace_na() to convert all of the NA values in that column to the mean of the body_mass column.

Hint

Use the mean() function to find the mean of body_mass.

Use the pipe to specify the penguin data and the mutate() function to change the body_mass() column and the replace_na() function to convert the NA values to the mean.

Hint

penguin |>
  mutate(____ = replace_na(_____, mean(_____, na.rm=TRUE)))

Answer

penguin |> mutate(body_mass=replace_na(body_mass, mean(body_mass, na.rm=TRUE)))

Congratulations, you finished this tutorial!