In this tutorial, you will learn how to work with categorical data using the forcats package in the tidyverse. Categorical data is essential in data analysis, and forcats provides powerful tools for managing factors.
Functions in Focus
Before we dive into the exercises, let’s introduce the key functions we will be working with and provide some guidance on their syntax:
as.factor(): This function can convert a variable in another format into a factor variable. Syntax: as.factor(nonfactor_variable)
fct_expand(): This function is used to add new levels to a factor variable. Syntax: fct_expand(factor_variable, "new_levels")
fct_collapse(): Use this function to group levels in a factor variable into broader categories. Syntax: fct_collapse(factor_variable, new_levels=c("old_leve1", "old_level2"))
fct_recode(): This function allows you to recode levels in a factor variable. Syntax: fct_recode(factor_variable, new_level="old_level")
fct_lump_n(): Use fct_lump_n() to lump levels in a factor variable into broader categories. Syntax: fct_lump(factor_variable, n=X)
fct_infreq(): Use fct_infreq() to reorder the levels of the factor variable to follow the frequency of observations for different factor levels. Syntax: fct_infreq(factor_variable)
Now, let’s proceed to the exercises!
Our Dataset
We will work with a dataset that includes variables typically found in public opinion polls, such as age, ideology, race, gender, political party, religious denomination, and whether they voted.
Here is a preview of the data you will be working with:
Exercises
Here are some practice exercises. You can check if your code works using the “check your work” button after each problem.
If you get stuck, you can also click for hints and for an example answer. Remember, there’s many ways to solve each problem, so the code in the “answer” box isn’t the only way to solve it!
Exercise 1
Create a new variable called age_group that groups the “age” variable into three categories: “Young” (under 40), “Middle-Aged” (40-59), and “Senior” (60 and above). Make this new variable a factor variable.
Check your work
To check your work, run the following code in the code chunk below.
str(opinion_data$age_group)
This code will run successfully and indicate a factor variable was created if your code is right!
Hint
You might need to categorize the age data first before you can use it as a factor. Consider creating a new column with the age groups based on conditions.
opinion_data <- opinion_data |>mutate(age_group =case_when( age <40~"Young", age >=40& age <=59~"Middle-Aged", age >=60~"Senior"), age_group =as.factor(age_group)) # this is converting the new column to a factor
Exercise 2
Use the fct_expand() function to add one new level, “Libertarian”, to the “ideology” factor variable.
Check your work
To check your work, run the following code in the code chunk below.
levels(opinion_data$ideology)
If your code worked, you should see the new levels with this line of code
Hint
Use the fct_expand() function on the ideology column of opinion_data. Don’t forget to assign the new levels you want to add.
Reclassify the “religion” variable into two categories: “Christian” and “Non-Christian” by creating a new factor variable.
Check your work
To check your work, run the following code in the code chunk below.
levels(opinion_data$religion)
Only two levels should appear if done correctly.
Hint
You might need to manually recode the ‘religion’ levels into ‘Christian’ and ‘NonChristian’. Think about using a function that allows specifying conditions for recoding.
Currently a frequency table of the variable ‘party’ would look like this:
Reorder the levels of the “party” factor variable based on their frequencies and then create a frequency table to view the number of respondents associated with each party.
Check your work
To check your work, run the following code in the code chunk below.
count(opinion_data, party)
The table should appear in a different order if successfully coded.
Hint
Use the fct_infreq() function to reorder the levels of the ‘party’ variable in ‘opinion_data’ based on their frequencies.
Hint
# To reorder:opinion_data <- opinion_data |>mutate(party =fct_infreq(_____))# To create a frequency table:count(opinion_data, ______)
Answer
opinion_data <- opinion_data |>mutate(party =fct_infreq(party))# Create frequency tablecount(opinion_data, party)