|> mutate(____=na_if(_____, '____')) penguin
Missing Data Tutorial
Welcome
In this tutorial, you will learn how to deal with missing data using the naniar
package. The naniar
package helps visualize where there are missing or ‘NA’ values in a dataset. It allows you to see how many ‘NA’ values there are and where they show up.
Functions in Focus
Let’s start by looking at some of our key functions which we’ll use in this exercise.
na_if()
: Converts values toNA
. Can be combined with themutate()
function to create or replace a column withNA
values.
Syntax: na_if(vector, value to replace)
replace_na()
: Changes anNA
value to a given value.
Syntax: replace_na(vector, value)
miss_var_summary()
: Creates a summary table of missing values by variable. Includes the number of missing values per variable (calledn_miss
) and the percentage of missing values per variable (calledpct_miss
).
Syntax when used with a dataframe: miss_var_summary(df)
gg_miss_var()
: Creates a bar graph of missing values by variable.
Syntax: gg_miss_var(df)
- You can also choose one variable and create a different graph for each aspect of that variable using the syntax
gg_miss_var(df, facet=variable)
wherevariable
is your chosen variable.
vis_miss()
: Creates a visual of missing values by variable by position in the dataset.
Syntax: vis_miss(df)
Our Dataset
We will work with a dataset with measurements for 3 different species of penguins on islands in the Palmer Archipelago, Antarctica. We’ll be focusing on the measurements for species, location, body mass (g), sex, and year the data was taken. For this tutorial, we’ve also added a new column called eye_color
. Here’s more information about the data set!
Here is a preview of the data you will be working with:
Notice that there are NA values as well as “X”s in our data set. Today we’ll be looking at those NA (or missing) values.
Exercises
Here are some practice exercises. You can check if your code works using the “check your work” button after each problem.
If you get stuck, you can also click for hints and for an example answer. Remember, there’s many ways to solve each problem, so the code in the “answer” box isn’t the only way to solve it!
Exercise 1
First, let’s deal with the “X” values in our data set. We assume that these are missing values, so we want to replace the “X” with an “NA” so that we can use the naniar
package functions on those values.
Use the na_if()
function to replace all the X’s with NA
values.
Rewrite your data frame penguin
so that when we pull up penguin
in the future, all the “X” values will be gone.
Exercise 2
Now, let’s find where we have NA
values in our code. Use the command miss_var_summary()
to find which variables have missing values and how many missing values each variable contains.
Exercise 3
You may have noticed that eye_color
has the highest number of missing values, so we’ll focus on the missing values in eye_color
for this exercise.
Use miss_var_summary()
to display the number of missing eye_color
values, separated into individual years. Don’t include any other variables.
Exercise 4
Next use gg_miss_var()
to create bar graphs that show the number of missing values per variable, by island. Create 3 graphs total, 1 graph for each island.
Exercise 5
Next, let’s look at where the missing values are showing up within the observations in the dataset.
First, arrange the data by year, then input it into the vis_miss()
function to see where the NA
values fall when the data is sorted by year.
Exercise 6
Finally, let’s imagine that you don’t want any NA
values in the body_mass
column. Use replace_na()
to convert all of the NA
values in that column to the mean of the body_mass
column.
Congratulations, you finished this tutorial!