penguin_sample$location_lower <- str_to_lower(penguin_sample$_____)Introduction to stringr Package
Welcome
In this tutorial, you will delve into the manipulation of text data with the stringr package in the tidyverse. Text data is ubiquitous and manipulating strings efficiently is a fundamental skill in data analysis. stringr provides a cohesive set of functions designed to make string manipulation easy and consistent.
Functions in Focus
Letās familiarize ourselves with the essential functions from stringr that weāll explore in this tutorial, along with their syntax:
str_sub(): This function allows you to extract substrings in a character vector.
Syntax: str_sub(string, start_position, end_position)
str_to_lower(): Use this function to transform text to all lowercase.
Syntax: str_to_lower(string)
word(): This function will extract a specific word from a string (or character vector).
Syntax: word(string, start, end, sep=fixed("___"))
str_remove(): Usestr_remove()to eliminate the first instance of a pattern in a string.
Syntax: str_remove(string, pattern)
str_length(): This function will give the number of characters in each element of a character vector.
Syntax: str_length(string)
str_detect(): This function will look at each element of a string and determine if it matches a given pattern. For each element, it will outputFALSEorTRUEdepending on whether it matches the pattern.
Syntax: str_detect(string, pattern)
With these tools at your disposal, youāll be able to tackle a wide array of text processing tasks. Letās proceed to the exercises and start putting these functions to work!
Our Dataset
We will work with a dataset with size measurements for 3 different species of penguins on islands in the Palmer Archipelago, Antarctica. Weāll be focusing on the measurements for bill length (mm), body mass (g), as well as the location and year the data was taken. Hereās more information about the data set!
Here is a preview of the data you will be working with:
Exercises
Here are some practice exercises. You can check if your code works using the ācheck your workā button after each problem.
If you get stuck, you can also click for hints and for an example answer. Remember, thereās many ways to solve each problem, so the code in the āanswerā box isnāt the only way to solve it!
Exercise 1
Create a variable called ālocation_lowerā that converts location in the penguin_sample data frame to lowercase using str_to_lower().
Exercise 2
Sometimes we may decide the input in one observation is too long. Use str_length() to check how many characters there are in the first entry of the location column in our penguin_sample data set.
Exercise 3
Split the Island variable in the penguin_sample data frame into separate elements using the word() based on the comma seperator. In particular, create a new column called island that extracts only the island name (without Palmer Archipelago, Antarctica).
Exercise 4
Create a new variable called location_short that removes the phrase āAntarcticaā from all locations in the penguin_sample data frame using str_remove().
Exercise 5
Letās double check your work on Exercise 4. Use str_detect() to check all the values in your location_short column and ensure that āAntarcticaā does not appear in any of the entries in that column (or string).
Exercise 6
Create a variable called year_short that extracts the last two characters of the year using str_sub().
Congratulations, you finished this tutorial!