$location_lower <- str_to_lower(penguin_sample$_____) penguin_sample
Introduction to stringr Package
Welcome
In this tutorial, you will delve into the manipulation of text data with the stringr
package in the tidyverse. Text data is ubiquitous and manipulating strings efficiently is a fundamental skill in data analysis. stringr
provides a cohesive set of functions designed to make string manipulation easy and consistent.
Functions in Focus
Letās familiarize ourselves with the essential functions from stringr
that weāll explore in this tutorial, along with their syntax:
str_sub()
: This function allows you to extract substrings in a character vector.
Syntax: str_sub(string, start_position, end_position)
str_to_lower()
: Use this function to transform text to all lowercase.
Syntax: str_to_lower(string)
word()
: This function will extract a specific word from a string (or character vector).
Syntax: word(string, start, end, sep=fixed("___"))
str_remove()
: Usestr_remove()
to eliminate the first instance of a pattern in a string.
Syntax: str_remove(string, pattern)
str_length()
: This function will give the number of characters in each element of a character vector.
Syntax: str_length(string)
str_detect()
: This function will look at each element of a string and determine if it matches a given pattern. For each element, it will outputFALSE
orTRUE
depending on whether it matches the pattern.
Syntax: str_detect(string, pattern)
With these tools at your disposal, youāll be able to tackle a wide array of text processing tasks. Letās proceed to the exercises and start putting these functions to work!
Our Dataset
We will work with a dataset with size measurements for 3 different species of penguins on islands in the Palmer Archipelago, Antarctica. Weāll be focusing on the measurements for bill length (mm), body mass (g), as well as the location and year the data was taken. Hereās more information about the data set!
Here is a preview of the data you will be working with:
Exercises
Here are some practice exercises. You can check if your code works using the ācheck your workā button after each problem.
If you get stuck, you can also click for hints and for an example answer. Remember, thereās many ways to solve each problem, so the code in the āanswerā box isnāt the only way to solve it!
Exercise 1
Create a variable called ālocation_lowerā that converts location in the penguin_sample
data frame to lowercase using str_to_lower()
.
Exercise 2
Sometimes we may decide the input in one observation is too long. Use str_length()
to check how many characters there are in the first entry of the location
column in our penguin_sample
data set.
Exercise 3
Split the Island variable in the penguin_sample
data frame into separate elements using the word()
based on the comma seperator. In particular, create a new column called island
that extracts only the island name (without Palmer Archipelago, Antarctica).
Exercise 4
Create a new variable called location_short
that removes the phrase āAntarcticaā from all locations in the penguin_sample
data frame using str_remove()
.
Exercise 5
Letās double check your work on Exercise 4. Use str_detect()
to check all the values in your location_short
column and ensure that āAntarcticaā does not appear in any of the entries in that column (or string).
Exercise 6
Create a variable called year_short
that extracts the last two characters of the year using str_sub()
.
Congratulations, you finished this tutorial!