Introduction to stringr Package

Welcome

In this tutorial, you will delve into the manipulation of text data with the stringr package in the tidyverse. Text data is ubiquitous and manipulating strings efficiently is a fundamental skill in data analysis. stringr provides a cohesive set of functions designed to make string manipulation easy and consistent.

Functions in Focus

Letā€™s familiarize ourselves with the essential functions from stringr that weā€™ll explore in this tutorial, along with their syntax:

  1. str_sub(): This function allows you to extract substrings in a character vector.

Syntax: str_sub(string, start_position, end_position)

  1. str_to_lower(): Use this function to transform text to all lowercase.

Syntax: str_to_lower(string)

  1. word(): This function will extract a specific word from a string (or character vector).

Syntax: word(string, start, end, sep=fixed("___"))

  1. str_remove(): Use str_remove() to eliminate the first instance of a pattern in a string.

Syntax: str_remove(string, pattern)

  1. str_length(): This function will give the number of characters in each element of a character vector.

Syntax: str_length(string)

  1. str_detect(): This function will look at each element of a string and determine if it matches a given pattern. For each element, it will output FALSE or TRUE depending on whether it matches the pattern.

Syntax: str_detect(string, pattern)

With these tools at your disposal, youā€™ll be able to tackle a wide array of text processing tasks. Letā€™s proceed to the exercises and start putting these functions to work!

Our Dataset

We will work with a dataset with size measurements for 3 different species of penguins on islands in the Palmer Archipelago, Antarctica. Weā€™ll be focusing on the measurements for bill length (mm), body mass (g), as well as the location and year the data was taken. Hereā€™s more information about the data set!

Here is a preview of the data you will be working with:

Exercises

Here are some practice exercises. You can check if your code works using the ā€œcheck your workā€ button after each problem.

If you get stuck, you can also click for hints and for an example answer. Remember, thereā€™s many ways to solve each problem, so the code in the ā€œanswerā€ box isnā€™t the only way to solve it!

Exercise 1

Create a variable called ā€˜location_lowerā€™ that converts location in the penguin_sample data frame to lowercase using str_to_lower().

To check your work, run the following code in the code chunk below.

str(penguin_sample$location_lower)

This code will run successfully and show a few example values in lower case if the variable was properly created.

Use the str_to_lower function on the ā€˜locationā€™ column of ā€˜penguin_sampleā€™.

penguin_sample$location_lower <- str_to_lower(penguin_sample$_____)
penguin_sample$location_lower <- str_to_lower(penguin_sample$location)

Exercise 2

Sometimes we may decide the input in one observation is too long. Use str_length() to check how many characters there are in the first entry of the location column in our penguin_sample data set.

Use the str_length() function to count the number of characters in each entry of a vector. Then use bracket notation to indicate the first entry.

str_length(____$____)[___]
str_length(penguin_sample$location)[1]

Exercise 3

Split the Island variable in the penguin_sample data frame into separate elements using the word() based on the comma seperator. In particular, create a new column called island that extracts only the island name (without Palmer Archipelago, Antarctica).

To check your work, run the following code in the code chunk below.

str(penguin_sample$island)

If your code worked, you should see your new island variable with this code.

Use the word() function and specify the comma as the separator. Extract the first word by specifying where the start.

penguin_sample$island <- word(______$______, start=___, sep="___")
penguin_sample$island <- word(penguin_sample$location, start=1, sep=",")

Exercise 4

Create a new variable called location_short that removes the phrase ā€œAntarcticaā€ from all locations in the penguin_sample data frame using str_remove().

To check your work, run the following code in the code chunk below.

str(penguin_sample$location_short)

If your code worked, this code will show you the structure of your new variable.

The str_remove function needs a pattern to match what should be removed.

penguin_sample$location_short <- str_remove(__$___, "_____")
penguin_sample$location_short <- str_remove(penguin_sample$location, ", Antarctica")

Exercise 5

Letā€™s double check your work on Exercise 4. Use str_detect() to check all the values in your location_short column and ensure that ā€œAntarcticaā€ does not appear in any of the entries in that column (or string).

str_detect(____$_____, "____")
str_detect(penguin_sample$location_short, "Antarctica")

Exercise 6

Create a variable called year_short that extracts the last two characters of the year using str_sub().

To check your work, run the following code in the code chunk below.

str(penguin_sample$year_short)

If your code worked, the year should only display the last two numbers.

Use str_sub(). -1 will stand in for the last character.

penguin_sample$year_short <- str_sub(____, ____, ____)
penguin_sample$year_short <- str_sub(penguin_sample$year, 3, -1)

Congratulations, you finished this tutorial!