Introduction to stringr Package

Welcome

In this tutorial, you will delve into the manipulation of text data with the stringr package in the tidyverse. Text data is ubiquitous and manipulating strings efficiently is a fundamental skill in data analysis. stringr provides a cohesive set of functions designed to make string manipulation easy and consistent.

Functions in Focus

Let’s familiarize ourselves with the essential functions from stringr that we’ll explore in this tutorial, along with their syntax:

str_sub(): This function allows you to extract substrings in a character vector.

Syntax: str_sub(string, start_position, end_position)

str_to_lower(): Use this function to transform text to all lowercase.

Syntax: str_to_lower(string)

word(): This function will extract a specific word from a string (or character vector).

Syntax: word(string, start, end, sep=fixed("___"))

str_remove(): Use str_remove() to eliminate the first instance of a pattern in a string.

Syntax: str_remove(string, pattern)

str_length(): This function will give the number of characters in each element of a character vector.

Syntax: str_length(string)

str_detect(): This function will look at each element of a string and determine if it matches a given pattern. For each element, it will output FALSE or TRUE depending on whether it matches the pattern.

Syntax: str_detect(string, pattern)

With these tools at your disposal, you’ll be able to tackle a wide array of text processing tasks. Let’s proceed to the exercises and start putting these functions to work!

Our Dataset

We will work with a dataset with size measurements for 3 different species of penguins on islands in the Palmer Archipelago, Antarctica. We’ll be focusing on the measurements for bill length (mm), body mass (g), as well as the location and year the data was taken. Here’s more information about the data set!

Here is a preview of the data you will be working with:

Exercises

Here are some practice exercises. You can check if your code works using the “check your work” button after each problem.

If you get stuck, you can also click for hints and for an example answer. Remember, there’s many ways to solve each problem, so the code in the “answer” box isn’t the only way to solve it!

Exercise 1

Create a variable called ‘location_lower’ that converts location in the penguin_sample data frame to lowercase using str_to_lower().

Check your work

To check your work, run the following code in the code chunk below.

str(penguin_sample$location_lower)

This code will run successfully and show a few example values in lower case if the variable was properly created.

Hint

Use the str_to_lower function on the ‘location’ column of ‘penguin_sample’.

Hint

penguin_sample$location_lower <- str_to_lower(penguin_sample$_____)

Answer

penguin_sample$location_lower <- str_to_lower(penguin_sample$location)

Exercise 2

Sometimes we may decide the input in one observation is too long. Use str_length() to check how many characters there are in the first entry of the location column in our penguin_sample data set.

Hint

Use the str_length() function to count the number of characters in each entry of a vector. Then use bracket notation to indicate the first entry.

Hint

str_length(____$____)[___]

Answer

str_length(penguin_sample$location)[1]

Exercise 3

Split the Island variable in the penguin_sample data frame into separate elements using the word() based on the comma seperator. In particular, create a new column called island that extracts only the island name (without Palmer Archipelago, Antarctica).

Check your work

To check your work, run the following code in the code chunk below.

str(penguin_sample$island)

If your code worked, you should see your new island variable with this code.

Hint

Use the word() function and specify the comma as the separator. Extract the first word by specifying where the start.

Hint

penguin_sample$island <- word(______$______, start=___, sep="___")

Answer

penguin_sample$island <- word(penguin_sample$location, start=1, sep=",")

Exercise 4

Create a new variable called location_short that removes the phrase “Antarctica” from all locations in the penguin_sample data frame using str_remove().

Check your work

To check your work, run the following code in the code chunk below.

str(penguin_sample$location_short)

If your code worked, this code will show you the structure of your new variable.

Hint

The str_remove function needs a pattern to match what should be removed.

Hint

penguin_sample$location_short <- str_remove(__$___, "_____")

Answer

penguin_sample$location_short <- str_remove(penguin_sample$location, ", Antarctica")

Exercise 5

Let’s double check your work on Exercise 4. Use str_detect() to check all the values in your location_short column and ensure that “Antarctica” does not appear in any of the entries in that column (or string).

Hint

str_detect(____$_____, "____")

Answer

str_detect(penguin_sample$location_short, "Antarctica")

Exercise 6

Create a variable called year_short that extracts the last two characters of the year using str_sub().

Check your work

To check your work, run the following code in the code chunk below.

str(penguin_sample$year_short)

If your code worked, the year should only display the last two numbers.

Hint

Use str_sub(). -1 will stand in for the last character.

Hint

penguin_sample$year_short <- str_sub(____, ____, ____)

Answer

penguin_sample$year_short <- str_sub(penguin_sample$year, 3, -1)

Congratulations, you finished this tutorial!