Continue gaining practice loading and using data sets in R
Create subsets of data based upon certain conditions (a.k.a. filtering)
Learn how to generate summary statistics using the group_by/summarize functions
Loading Data
Let’s start by reviewing some concepts from the last lesson:
# we need to make the tidyverse available with the library function:library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# load the dataset to a variableteamAntarcticaData<-read_csv("teamAntarcticaData.csv")
Rows: 75 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): Timestamp, school, swim, animals, parkaColor, teamFlag, distance
dbl (5): fishing, cold, remote, bedsideManner, cooking
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
In the last lesson, we looked at descriptive statistics for columns of data for the entire data set. But what if we were interested in pursuing answers to specific questions. Here’s one:
How does the cold tolerance differ for students at Lewis & Clark vs students at the University of Arizona?
One strategy we could take is to find the average cold tolerance responses for the LC students, and compare it to the average of the UA students. So we’ll need to create two subsets of the original data set, using the filter function. The filter function works like this:
Below, create a subset of Lewis & Clark students, assign it to the variable lcStudents, and print it to the screen:
#create lcStudents belowlcStudents<-filter(.data=teamAntarcticaData, school=="Lewis & Clark College")lcStudents
# A tibble: 47 × 12
Timestamp school fishing swim cold animals remote parkaColor teamFlag
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <chr>
1 8/31/2022 15:2… Lewis… 2 Yes 4 Yes 1 White Sea Spi…
2 8/31/2022 15:3… Lewis… 1 Yes 3 Maybe 4 Blue Penguin
3 8/31/2022 15:4… Lewis… 5 Yes 5 Yes 5 Green Bear
4 8/31/2022 16:0… Lewis… 2 Yes 4 Yes 4 Blue Seal
5 8/31/2022 16:3… Lewis… 3 Yes 4 Yes 5 Black Penguin
6 8/31/2022 17:0… Lewis… 1 Yes 3 Yes 2 Blue Penguin
7 8/31/2022 17:0… Lewis… 4 Yes 5 Yes 5 White Sea Spi…
8 8/31/2022 17:0… Lewis… 5 Yes 5 Yes 5 Blue Bear
9 8/31/2022 17:0… Lewis… 2 Yes 4 Yes 4 Green Penguin
10 8/31/2022 17:0… Lewis… 2 Yes 2 Yes 3 Blue Seal
# ℹ 37 more rows
# ℹ 3 more variables: distance <chr>, bedsideManner <dbl>, cooking <dbl>
So now we have two smaller data sets - one with U of A students (uaStudents), and one with LC students (lcStudents). How might we calculate the average cold tolerance of each? Using strategies from last week, try doing so below:
In the filter example above, we used “==” as part of our condition argument. The double equals is an example of a relational operator - it’s a character (or multiple characters) that represents a logical action or process. Practically speaking, the double equals means “is this field equal to this value?”. If the answer is “TRUE”, then the row is included as part of the filtered data set.
Here are some other relational operators:
> (greater than)
< (less than)
<= (less than or equal to)
>= (greater than or equal to)
!= (not equal to)
In the filter function, relational operators are used to define a condition.
Let’s say we’re interested in creating a subset of data that includes students with a self-reported aptitude in fishing of 4 or 5 (the students we should recruit to catch our fish). Create a subset of data called goodFishing that contains this list, and print to the screen:
Let’s say we also want to create a subset of data that includes students who are not particularly strong swimmers. Create a subset of data below called nonSwimmers that include students who did not answer “Yes” on the swimming question:
# A tibble: 7 × 12
Timestamp school fishing swim cold animals remote parkaColor teamFlag
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <chr>
1 8/30/2022 16:10… Unive… 1 I ca… 3 Maybe 2 Black Penguin
2 8/30/2022 16:10… Unive… 2 I ca… 5 Yes 4 Blue Penguin
3 8/30/2022 16:39… Unive… 2 No 1 Yes 4 White <NA>
4 8/31/2022 17:20… Lewis… 1 I ca… 4 Yes 2 Blue Penguin
5 8/31/2022 19:23… Lewis… 1 I ca… 4 No 2 Black Penguin
6 9/1/2022 12:46:… Lewis… 3 I ca… 3 Yes 3 Orange Sea Spi…
7 9/1/2022 14:08:… Lewis… 2 I ca… 2 Yes 4 <NA> Seal
# ℹ 3 more variables: distance <chr>, bedsideManner <dbl>, cooking <dbl>
Logical Operators
There may be cases in which we want to filter our dataset based on more than one condition. In these cases, we would use logical operators. Maybe we want to find the best University of Arizona chefs, or the students who want blue or orange parkas. Here are the main logical operators:
& (and)
| (or)
In the filter function, logical operators are used to join conditions together.
Here’s an example of how to use a logical operator with the filter function:
uaChefs<-filter(.data=teamAntarcticaData, school=="University of Arizona"& cooking>=4)uaChefs
Below, use the filter function to create a data subset of students who want blue or orange parkas. Assign it to the variable blueOrangeParkas, and print to the screen.
The Tidyverse introduced a new convention to R called the “pipe”:
%>%
The purpose of the pipe is to string functions and data together. You can think of it as sort of the glue that joins pieces of an assembly line together. Another way to think of it is to read it as “AND THEN”.
Below we can rewrite a command using the filter function with the pipe. After the assignment symbol (<-) we start with the data set, followed by the pipe, followed by the filter function. It’s common practice to put the pipe at the end of one command, then indent the command it’s pointing to on the next line. What’s different about the arguments in the filter function in this case?
uaStudents2<-teamAntarcticaData %>%filter(school=="University of Arizona")uaStudents2
Try using the pipe in the code chunk below to create a data subset of students who answered “Maybe” in the animals question (call the variable maybeAnimals). Print it to the screen as well.
# create and print maybeAnimals:maybeAnimals <- teamAntarcticaData %>%filter(animals=="Maybe")maybeAnimals
Generating summary statistics with group_by / summarize
One reason for introducing the %>% now is because of how instrumental it is for chaining together two functions for generating summary statistics by group:
group_by: a function that takes a data set and groups it by a variable/column
summarize: uses the grouped data set from group_by, and lets you define summary statistics columns for that group
Let’s say we want to calculate the mean and standard deviation of self-reported tolerance for cold, comparing Lewis & Clark to University of Arizona students. We sort of did this earlier, but let’s try it again using group_by / summarize:
# A tibble: 2 × 3
school avgCold sdCold
<chr> <dbl> <dbl>
1 Lewis & Clark College 3.43 0.950
2 University of Arizona 3.29 1.08
Let’s break down what’s going on here:
First declare our variable (coldSummary)
Initially assign it to the teamAntarcticaData data set
“Pipe” that data to group_by, where we choose to group the data by the school column
Then “pipe” that to summarize, where we define two new columns:
avgCold, set equal to mean(cold)
sdCold, set equal to sd(cold)
When printing out coldSummary, we see it’s a new data set with just summary statistics for cold tolerance, grouped by the school.
Try using the group_by / summarize technique by finding the mean and standard deviation of self-reported cooking skill, comparing Lewis & Clark to University of Arizona students. Print to the screen.
# A tibble: 2 × 3
school avgCooking sdCooking
<chr> <dbl> <dbl>
1 Lewis & Clark College 3.13 1.12
2 University of Arizona 3.56 1.09
We can also use this technique to calculate percentage. Let’s say we want to know the different percentage of responses to the swimming question. We can calculate this by first defining the total number of rows (using nrow, below), and use it with n() in summarize
# first calculate total rows, to be used as denominator for percentagetotalRows<-nrow(teamAntarcticaData)# n() generates the count of responses per/groupswimmingPercentage<-teamAntarcticaData %>%group_by(swim) %>%summarise(percent=n()/totalRows)swimmingPercentage
# A tibble: 3 × 2
swim percent
<chr> <dbl>
1 I can dog paddle 0.08
2 No 0.0133
3 Yes 0.907
Below, calculate the percentage of responses for each of the different parka colors: