Introduction to forcats Package

Welcome

In this tutorial, you will learn how to work with categorical data using the forcats package in the tidyverse. Categorical data is essential in data analysis, and forcats provides powerful tools for managing factors.

Functions in Focus

Before we dive into the exercises, let’s introduce the key functions we will be working with and provide some guidance on their syntax:

  1. as.factor(): This function can convert a variable in another format into a factor variable. Syntax: as.factor(nonfactor_variable)

  2. fct_expand(): This function is used to add new levels to a factor variable. Syntax: fct_expand(factor_variable, "new_levels")

  3. fct_collapse(): Use this function to group levels in a factor variable into broader categories. Syntax: fct_collapse(factor_variable, new_levels=c("old_leve1", "old_level2"))

  4. fct_recode(): This function allows you to recode levels in a factor variable. Syntax: fct_recode(factor_variable, new_level="old_level")

  5. fct_lump_n(): Use fct_lump_n() to lump levels in a factor variable into broader categories. Syntax: fct_lump(factor_variable, n=X)

  6. fct_infreq(): Use fct_infreq() to reorder the levels of the factor variable to follow the frequency of observations for different factor levels. Syntax: fct_infreq(factor_variable)

Now, let’s proceed to the exercises!

Our Dataset

We will work with a dataset that includes variables typically found in public opinion polls, such as age, ideology, race, gender, political party, religious denomination, and whether they voted.

Here is a preview of the data you will be working with:

Exercises

Here are some practice exercises. You can check if your code works using the “check your work” button after each problem.

If you get stuck, you can also click for hints and for an example answer. Remember, there’s many ways to solve each problem, so the code in the “answer” box isn’t the only way to solve it!

Exercise 1

Create a new variable called age_group that groups the “age” variable into three categories: “Young” (under 40), “Middle-Aged” (40-59), and “Senior” (60 and above). Make this new variable a factor variable.

To check your work, run the following code in the code chunk below.

str(opinion_data$age_group)

This code will run successfully and indicate a factor variable was created if your code is right!

You might need to categorize the age data first before you can use it as a factor. Consider creating a new column with the age groups based on conditions.

opinion_data <- opinion_data |> 
  mutate(age_group = case_when(
    ____ ~ "______",
    ____ & ______ ~ "______",
    ____ ~ "______"), 
  age_group = _________(age_group)) 
opinion_data <- opinion_data |> 
  mutate(age_group = case_when(
    age < 40 ~ "Young",
    age >= 40 & age <= 59 ~ "Middle-Aged",
    age >= 60 ~ "Senior"), 
  age_group = as.factor(age_group)) # this is converting the new column to a factor

Exercise 2

Use the fct_expand() function to add one new level, “Libertarian”, to the “ideology” factor variable.

To check your work, run the following code in the code chunk below.

levels(opinion_data$ideology)

If your code worked, you should see the new levels with this line of code

Use the fct_expand() function on the ideology column of opinion_data. Don’t forget to assign the new levels you want to add.

opinion_data <- opinion_data |> 
  mutate(ideology = fct_expand(______, "______"))
opinion_data <- opinion_data |> 
  mutate(ideology = fct_expand(ideology, "Libertarian"))

Exercise 3

Recategorize the “race” factor variable, replacing “Hispanic” with “Latino” using the fct_recode() function.

To check your work, run the following code in the code chunk below.

levels(opinion_data$race)

If your code worked, the recoded levels should appear here.

Use the fct_recode() function with the ‘race’ column and specify that “Latino” should replace “Hispanic”.

opinion_data <- opinion_data |> 
  mutate(race = ________(race, "______" = "______"))```
opinion_data <- opinion_data |> 
  mutate(race = fct_recode(race, "Latino" = "Hispanic"))

Exercise 4

Reclassify the “religion” variable into two categories: “Christian” and “Non-Christian” by creating a new factor variable.

To check your work, run the following code in the code chunk below.

levels(opinion_data$religion)

Only two levels should appear if done correctly.

You might need to manually recode the ‘religion’ levels into ‘Christian’ and ‘NonChristian’. Think about using a function that allows specifying conditions for recoding.

opinion_data <- opinion_data |>
  mutate(religion = _______(_______, 
                                  Christian = c("Catholic", "Protestant"),
                                  NonChristian = c("Atheist", "Hindu", "Jewish", "Buddhist")))
opinion_data <- opinion_data |> 
  mutate(religion = fct_collapse(religion, 
                                  Christian = c("Catholic", "Protestant"),
                                  NonChristian = c("Atheist", "Hindu", "Jewish", "Buddhist")))

Exercise 5

Currently a frequency table of the variable ‘party’ would look like this:

Reorder the levels of the “party” factor variable based on their frequencies and then create a frequency table to view the number of respondents associated with each party.

To check your work, run the following code in the code chunk below.

count(opinion_data, party)

The table should appear in a different order if successfully coded.

Use the fct_infreq() function to reorder the levels of the ‘party’ variable in ‘opinion_data’ based on their frequencies.

# To reorder:
opinion_data <- opinion_data |> 
   mutate(party = fct_infreq(_____))

# To create a frequency table:
count(opinion_data, ______)
opinion_data <- opinion_data |> 
  mutate(party = fct_infreq(party))

# Create frequency table
count(opinion_data, party)

Congratulations, you finished this tutorial!