Analyzing external data set

Author

Ethan Davis, Keaton Wilson, Ellen Bledsoe

Learning Outcomes

  • Students will be able to use some functions of the tidyverse: select, filter, mutate, summarize and the pipe %>%.
  • Students will be able to explore and summarize complex, real-world data

Functions used in this lesson

  • read_csv()

  • select()

  • filter()

  • glimpse()

  • unique()

  • length()

  • group_by()

  • summarize()

  • mutate()

  • arrange()

More practice exploring and summarizing data

# load the tidyverse
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

To review the tidyverse syntax, we’re going to use a real data set on weather measurements in Antarctica. It contains weather measurements from many climate stations over the course of a year.

# read in the data file
# `read_csv()` is part of the `tidyverse` and gives us nice options when reading in data

# Let's take a look at the climate data.

Unlike the data we’ve seen so far, this data set is BIG. While we could reasonably look at the data in our team data set, it would be impossible for us to look at all 139,160 rows of this climate data. This is where R and the tidyverse can be really powerful!

The tidyverse offers a great function called glimpse() that lets us take a quick look at the data set. (Other similar functions are str() and head())

# explore the data set with glimpse


# str(data)


# head(data, n) gives us the first n rows

select() columns

Now let’s introduce a new function, select(). The utility of select isn’t entirely new to you. It operates in a similar way to the $, but with a bit more power. Select allows us to pick out specific columns from our data. You can use names or their position in the data frame. Unlike with $, with select, we can easily select multiple columns.

First, a quick reminder of how to use $

# column selection in base R (using $)

Now let’s see how the same can be accomplished with select(). The first argument in the function is the data frame. Any following arguments are the columns we want to select.

select(.data, column(s))

# first argument is the data frame, then the columns


# multiple columns
#select columns by name


# select adjacent columns


# ignore columns (everything but)

Notice that unlike using $, using select() returns a formatted data frame. With larger data sets, this becomes especially useful.

Let’s practice!

Write a line of code to select the following columns from the stationData: temp, wind_speed, humidity . Then, select everything but the wind_direction column.

#select


#select

filter() rows

To review, filter() allows you filter rows by certain conditions. Rows the meet the conditions are kept, while rows that do not are “filtered” out.

Let’s first review relational operators, which make filter() work. The ones that are relevant for today are == ,!= , > , and >=. You can think about these operators as true or false questions.

x == y - “is x equal to y?”

x != y - “is x unequal to y?”

x > y - “is x greater than y?”

x >= y - “is x greater than or equal to y?”

These expressions will evaluate to either TRUE or FALSE. Here are some examples:

# does 2 equal 2?
2==2
[1] TRUE
# is x unequal to y?
x<-3
y<-4
x!=y
[1] TRUE
# is x greater than y?
x>y
[1] FALSE
#is x equal to y?
x==y
[1] FALSE
# is "apple" equal to "orange"?
x<-"apple"
y<-"orange"
x==y
[1] FALSE
"a">"b"
[1] FALSE

This language of true and false will help us create conditions on which to filter our data.

With a sufficiently large data set though, it may be difficult to determine what to filter on.

The unique() function tells us all the unique values in an vector/column. When you have an enormous data set like this, you won’t be able to gather all the possible values of a variable manually. For instance, we might like to know: “What years are represented in our data?”, or “What are the IDs of the various stations?”

# get unique values of the year column


# get unique values of the month column
# get unique values of the station_id column

Another useful function we’ll introduce here is length(). Length simply counts the number of entries in a column or vector. We can use it in combination with unique() to find how many unique values exist in a column. So, we could then answer the question: “How many stations are there?”

# get unique values of the station_id column

Now, filter the data so that it only contains one station.

# filter for a single station


# worth noting here that we haven't saved any of this. We ought to write to a new object

Let’s practice using select() and filter()

Working with the climate data, construct a small set of code that does the following:

  1. Slims down the full data frame to one that contains the columns month, temp and station_id. Assign this to an object called slim.
  2. Filters the data for ag4q3h with an average temperature less than -22 degrees.
  3. Name this new data frame cold
# not piped

The pipe %>%

You can use the pipe operator to chain tidyverse functions together. You can think of the pipe as automatically sending the output from the first line into the next line as the input.

This is helpful for a lot of reasons, including:

  1. removing the clutter of creating a lot of intermediate objects in your work space, which reduces the chance of errors caused by using the wrong input object
  2. makes things more human-readable (in addition to computer-readable)

Now, let’s try the same as above, but using the pipe:

#with pipe

Let’s practice!

In small groups, use pipes to create a new data frame called singleStation that includes the following:

  • the columns day, temp, station_id
  • only rows with temperatures that are greater than -20 degrees

Creating new variables with mutate()

Sometimes our data aren’t in exactly the format we want. For example, we might want our temperature data in Fahrenheit instead of Celsius.

The tidyverse has a function called mutate() that lets us create a new column. Often, we want to apply a function to the entire column or perform some type of calculation, such as converting temp from F to C.

To help us out, here is the equation for converting: Fahrenheit = Celcius * (9/5) + 32

# create a new column for temps in Fahrenheit

Understanding data through summarize()

Like we have talked about in previous classes, one of the best ways for us to understand our data is through what we call summary statistics such as the mean, standard deviation, etc.

Summary statistics are particularly useful for large data sets, because we cannot navigate the data manually. Along with data visualization (covered soon!), summary statistics are essential to understanding large data sets.

# first attempt at mean and sd of average temperature

The above gives us an impression of the climate of the whole region contained by all of the stations. However, this wouldn’t really help us choose an ideal location for us to inhabit, or tell us which areas we might avoid.

Grouping data by variable values gives us a clearer picture of specific subsets of the data. For instance, we might like to know the average measurements for specific locations.

The group_by() function lets us do this. It is most often used in combination with summarize().

We can use this method to calculate the mean temperatures of each station_id instead of the overall mean of the entire data set.

Next, use arrange() function, which can rearrange our data in numerical order by a specific column. For instance, we could use it with the code below to find the stations with the highest and lowest average temperatures.

#group_by summarize arrange

One great thing about group_by() is that you can give it multiple columns. This allows us to have groups within groups. For instance, we could organize our data by station_id and then break up data for each station by month. That’s a mouthful, so let’s just take a look at it:

#group_by multiple columns

What are good/bad columns to group by? Try summarizing the mean temperature, grouping by different combinations of columns. Talk with your neighbors about which columns might or might not work well.