penguin |> mutate(____=na_if(_____, '____'))Missing Data Tutorial
Welcome
In this tutorial, you will learn how to deal with missing data using the naniar package. The naniar package helps visualize where there are missing or ‘NA’ values in a dataset. It allows you to see how many ‘NA’ values there are and where they show up.
Functions in Focus
Let’s start by looking at some of our key functions which we’ll use in this exercise.
na_if(): Converts values toNA. Can be combined with themutate()function to create or replace a column withNAvalues.
Syntax: na_if(vector, value to replace)
replace_na(): Changes anNAvalue to a given value.
Syntax: replace_na(vector, value)
miss_var_summary(): Creates a summary table of missing values by variable. Includes the number of missing values per variable (calledn_miss) and the percentage of missing values per variable (calledpct_miss).
Syntax when used with a dataframe: miss_var_summary(df)
gg_miss_var(): Creates a bar graph of missing values by variable.
Syntax: gg_miss_var(df)
- You can also choose one variable and create a different graph for each aspect of that variable using the syntax 
gg_miss_var(df, facet=variable)wherevariableis your chosen variable. 
vis_miss(): Creates a visual of missing values by variable by position in the dataset.
Syntax: vis_miss(df)
Our Dataset
We will work with a dataset with measurements for 3 different species of penguins on islands in the Palmer Archipelago, Antarctica. We’ll be focusing on the measurements for species, location, body mass (g), sex, and year the data was taken. For this tutorial, we’ve also added a new column called eye_color. Here’s more information about the data set!
Here is a preview of the data you will be working with:
Notice that there are NA values as well as “X”s in our data set. Today we’ll be looking at those NA (or missing) values.
Exercises
Here are some practice exercises. You can check if your code works using the “check your work” button after each problem.
If you get stuck, you can also click for hints and for an example answer. Remember, there’s many ways to solve each problem, so the code in the “answer” box isn’t the only way to solve it!
Exercise 1
First, let’s deal with the “X” values in our data set. We assume that these are missing values, so we want to replace the “X” with an “NA” so that we can use the naniar package functions on those values.
Use the na_if() function to replace all the X’s with NA values.
Rewrite your data frame penguin so that when we pull up penguin in the future, all the “X” values will be gone.
Exercise 2
Now, let’s find where we have NA values in our code. Use the command miss_var_summary() to find which variables have missing values and how many missing values each variable contains.
Exercise 3
You may have noticed that eye_color has the highest number of missing values, so we’ll focus on the missing values in eye_color for this exercise.
Use miss_var_summary() to display the number of missing eye_color values, separated into individual years. Don’t include any other variables.
Exercise 4
Next use gg_miss_var() to create bar graphs that show the number of missing values per variable, by island. Create 3 graphs total, 1 graph for each island.
Exercise 5
Next, let’s look at where the missing values are showing up within the observations in the dataset.
First, arrange the data by year, then input it into the vis_miss() function to see where the NA values fall when the data is sorted by year.
Exercise 6
Finally, let’s imagine that you don’t want any NA values in the body_mass column. Use replace_na() to convert all of the NA values in that column to the mean of the body_mass column.
Congratulations, you finished this tutorial!