Practice Problems 1.3 Key

First we need to load the penguin data set, just like we have before. The data set will be called penguins This data was collected by real scientists! Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

library(palmerpenguins)
penguins<-palmerpenguins::penguins
penguins

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

library(tidyverse) # to make tidyverse commands available

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

There are three different species of penguins in this data set. We can see from the photo below that they may have different body dimensions. We will be using data visualizations to explore some of these differences.

Remember

Create a histogram of body mass for all penguin species. Using comments, write a description of what this histogram shows.

ggplot(data = penguins, mapping=aes(x=body_mass_g)) +
  geom_histogram() +
  labs(title="Distribution of body mass for all species", 
       x = "body mass (g)", 
       y = "frequency")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).

# this histogram shows a distribution of the variable body_mass_g

Create a histogram of body mass, with each species in a different color. What does this show us about the different species? Which species do you think has the greatest body mass?

ggplot(data = penguins, mapping=aes(x=body_mass_g, fill=species)) +
  geom_histogram() +
  labs(title="Distribution of body mass by species", 
       x = "body mass (g)", 
       y = "frequency")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).

# Gentoo tend to have a larger body mass than Adelie and Chinstrap penguins.

Now let’s find out! Create a bar graph with the average body mass for each penguin species. (Don’t forget about the NAs in the data set) Which one has the greatest average body mass? How does that compare with what you thought looking at the histogram?

penguinMass <- penguins %>% 
  group_by(species) %>% 
  summarize(avgMass = mean(body_mass_g, na.rm=TRUE))

ggplot(data=penguinMass, mapping=aes(x=species, y=avgMass, fill = species)) +
  geom_bar(stat="identity")+
  labs(title="Mass by Species", 
       y = "Average body mass (g)", 
       x = "species")

Create a visualization that will help answer the question: Do heavier penguins have longer flippers? Think about how many variables you have and the best way to present this data. Color by species.

ggplot(data = penguins, mapping = aes(x=body_mass_g, y = flipper_length_mm, color = species)) +
  geom_point() +
  labs(title = "Heavier penguins tend to have longer flippers", 
       x = "body mass (g)", 
       y = "flipper length (mm)")

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Create a data visualization to explore the question: Do penguins with longer bills tend to have longer flippers as well? Make sure to give the points either different colors or shapes based on the species.

ggplot(data = penguins, 
       mapping = aes(x=bill_length_mm, 
                     y = flipper_length_mm, 
                     shape = species, 
                     color=species)) +
  geom_point() +
  labs(title = "Penguins with longer bills tend to have longer flippers", 
       x = "bill length (mm)", 
       y = "flipper length (mm)")

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Create a bar graph that shows the average body mass by sex.

penguinMassBySex <- penguins %>% 
  group_by(sex) %>% 
  summarize(avgMass = mean(body_mass_g, na.rm=T))

ggplot(data=penguinMassBySex, mapping=aes(x=sex, y=avgMass, fill = sex)) +
  geom_bar(stat="identity")+
  labs(title="Mass by Sex", 
       y = "Average body mass (g)", x = "sex")

Create one figure that has three bar graphs: comparing average body mass by sex AND species. If you need a hint, please ask!

penguinMassBySexAndSp <- penguins %>% 
  group_by(sex, species) %>% 
  summarize(avgMass = mean(body_mass_g, na.rm=T)) %>% 
  na.omit()

`summarise()` has grouped output by 'sex'. You can override using the `.groups`
argument.

ggplot(data=penguinMassBySexAndSp, mapping=aes(x=sex, y=avgMass, fill = sex)) +
  geom_bar(stat="identity") + 
  labs(title="Mass by Sex", y = "Average body mass (g)", x = "sex") +
  facet_wrap(~ species)

There is so much option in ggplot for customizing your plots and so many interesting relationships between variables to explore! Create any data visualization you like of the palmer penguin data set, customizing it in at least one way. Here are some online resources for customization https://ggplot2-book.org/polishing.html, http://www.sthda.com/english/wiki/be-awesome-in-ggplot2-a-practical-guide-to-be-highly-effective-r-software-and-data-visualization (scroll to the bottom to see their style guide).

Think about how many variables you are graphing (one or two), what kind of variables they are (categorical or numerical), and what question your viz will explore!

Create a a pie chart, showing the percentage of the data set each penguin species comprises. (you definitely will need to use google). In data science, are pie charts a good idea? Take a look here, and explain your answer.

# first we need to calculate the number of penguins of each species in the dataset
penguinCounts <- penguins %>% group_by(species) %>% summarise(number = n())
penguinCounts

# A tibble: 3 × 2
  species   number
  <fct>      <int>
1 Adelie       152
2 Chinstrap     68
3 Gentoo       124

# now we need to divide each of those by the total to get the percentage
penguinCounts$perc <- penguinCounts$number / nrow(penguins)
penguinCounts

# A tibble: 3 × 3
  species   number  perc
  <fct>      <int> <dbl>
1 Adelie       152 0.442
2 Chinstrap     68 0.198
3 Gentoo       124 0.360

# now we can first make a bar plot with the data
penguinBP <- ggplot(penguinCounts, aes(x="", y=perc, fill = species)) +
  geom_bar(width=1, stat = "identity") +
  labs(title="Species of Penguin in this Sample", x="", y="percentage of species")

penguinBP

# and finally change that to a pie chart
penguinPieChart <- penguinBP + coord_polar("y")
penguinPieChart

# Generally speaking, pie charts aren't a great choice for data visualizations, because humans aren't good and comparing shapes with angles. Differences in bar charts are much easier for humans to interpret.