Introduction to GGPlot

An Introduction to GGPlot

In this tutorial, we’ll be looking at how to create plots using the package ggplot.

We’ll start by learning about how ggplot works.

Syntax

First, let’s consider the syntax when we write code with ggplot. There are many commands that can be used within the ggplot() function. They are all connected using a +.

General syntax for ggplot():

  • ggplot(data, aes(x=variable1, y=variable2)) +

    geom_x()

    OR

  • dataframe |>

    ggplot(aes(x=variable1, y=variable2)) +

    geom_x()

Specifying variables: X is the explanatory variable and Y is the outcome variable

Aesthetics

Within ggplot code, you can specify the aesthetics with various variables including color, size, variables, number of bins (in a histogram), etc.

For example the following code specifies the x & y variables, point size & color:

ggplot(dataframe, aes(x=month, y=temperature) + geom_point(size=5, color=red)

Geometry

You can then include geometry including:

  • geom_line(): Connects all the points in a plot

  • geom_smooth(method="lm", se=FALSE): Creates a line of best fit

  • geom_point(): Creates points (scatterplot) for each data point

  • geom_bar() : Creates a bar graph, aggregates data for you

    • Syntax: geom_bar(position="___") Can choose “stack” (bars on top of each other), “dodge” (bars side by side) or “fill” (bars stacked, scaled to 100%).
  • geom_col(): Creates a bar graph with pre-aggregated data

  • geom_histogram(): Creates a histogram

    • Syntax: geom_histogram(bins=X) Specify number of bins
  • geom_density(): Creates a density graph

    • Syntax: geom_density(adjust=X) X is a ratio, how smooth the graph is
  • labs() : Add a title and axes labels to your graph

  • And more!

Other functions to use with GGplot

You can use the pipe to incorporate other tidyverse functions to organize your data before plotting.

Our Dataset

We will work with a dataset with measurements for 3 different species of penguins on islands in the Palmer Archipelago, Antarctica. We’ll be focusing on the measurements for species, location, bill length (mm), bill depth (mm), flipper length (mm), body mass (g), sex, and year the data was taken. Here’s more information about the data set!

Here is a preview of the data you will be working with:

Exercises

Here are some practice exercises. You can check if your code works using the “check your work” button after each problem.

If you get stuck, you can also click for hints and for an example answer. Remember, there’s many ways to solve each problem, so the code in the “answer” box isn’t the only way to solve it!

Exercise 1

In this exercise, you will create 3 scatter plots, one for each species of penguin, displaying penguins’ body mass and flipper length. Additionally include a line of best fit.

The different parts of this exercise will walk you through the process of tidying the data and creating the scatter plot.

Part A

First, let’s clean the data and remove any rows with missing values. Create a new dataset called penguin_clean which doesn’t contain NA values.

To check your work, run the following code in the code chunk below.

n_miss(penguin_clean)

If your code works, this should print 0.

penguin_clean<- na.omit(_____)
penguin_clean<- na.omit(penguin)

Part B

Now use your clean data to construct a scatter plot with flipper_length as the explanatory variable and body_mass as the outcome variable.

You’ll need to use ggplot, specify the data, the aesthetics (which includes the x and y variables) and then add a geometry to your plot to create points.

ggplot(_______, aes(x=____, y=____))+
  geom_ ______()
ggplot(penguin_clean, aes(x=flipper_length, y=body_mass))+
  geom_point()

#OR you could use the pipe to start with penguin_clean and then create your plot: 
#penguin_clean |> ggplot(aes(x=flipper_length, y=body_mass))+ geom_point()

Part C

Now, add a line of best fit to your plot.

Add one more piece of geometry to your code from part B to your plot that creates a line of best fit.

ggplot(penguin_clean, aes(x=flipper_length, y=body_mass))+
  geom_point()+ 
  geom_smooth(method=_____, se=_____)
ggplot(penguin_clean, aes(x=flipper_length, y=body_mass))+
  geom_point()+ 
  geom_smooth(method="lm", se=FALSE)

#OR you could use the pipe function!

Part D

Finally, let’s separate the plot into 3 different plots, one for each species.

Use the function facet_wrap() and be sure to define your facet. Add this as a geometry to your code.

#Here's the syntax: 

facet_wrap(~_____)
ggplot(penguin_clean, aes(x=flipper_length, y=body_mass))+
  geom_point()+ 
  geom_smooth(method="lm", se=FALSE)+ 
  facet_wrap(~species)

#OR you could use the pipe!

Exercise 2

Create a bar graph which displays the number of penguins in each species. Also show which islands each penguin species is on by adjusting the fill and position of the bar graphs.

Part A

First, create a bar graph that shows the number of penguins in each species.

geom_bar() is used when the data still needs to be aggregated.

geom_col() is used when the data is already aggregated.

# Try this: 
penguin_clean|>
  ggplot(aes(x=_____))+
    ______
penguin_clean|>
  ggplot(aes(x=species))+
    geom_bar()

Part B

Next, display which island each species is on by adjusting the fill of the graph.

This adjustment will go in the aesthetics section of your ggplot. Choose how you would like to fill the bars by using fill = ____.

penguin_clean|>
  ggplot(aes(x=species, fill=______))+
  geom_bar()
penguin_clean|>
  ggplot(aes(x=species, fill=island))+
  geom_bar()

Part C

Now that the bars are filled in by island, let’s adjust them again so that they don’t stack on top of each other. Instead, place all the bars next to each other.

Adjust the position of bars in the geom_bar() geometry.

penguin_clean|>
  ggplot(aes(x=species, fill=island))+
  geom_bar(position="______")
penguin_clean|>
  ggplot(aes(x=species, fill=island))+
  geom_bar(position="dodge")

Exercise 3

Create a bar graph displaying the mean bill length for each species of penguin.

Part A

First, we need to make our data tiny so that we can then input it into a plot. So, start by using the summarize() function to display the mean bill length of penguins by species.

In addition to the summarize() function, you’ll need to also use the group_by() function to group by species. Use the pipe to string both of these functions together.

penguin_clean |> 
group_by(_____)|> 
  summarize(mean_bill=__________)
penguin_clean |> 
group_by(species)|> 
  summarize(mean_bill=mean(bill_length))

Part B

Now that the data is tidy and tiny, we can plot it. Add to your code from part A and create a bar graph.

Recall: geom_bar() is used when the data still needs to be aggregated.

geom_col() is used when the data is already aggregated.

penguin_clean |> 
  group_by(species)|> 
  summarize(mean_bill=mean(bill_length)) |>
  ggplot(aes(x=____, y=_____)) +
  geom_col()
penguin_clean |> 
  group_by(species)|> 
  summarize(mean_bill=mean(bill_length)) |> 
  ggplot(aes(x=species, y=mean_bill)) +
  geom_col()

Part C

Finally, we want to order the columns in the bar graph from smallest to largest. To do this, we’ll use fct_reorder(). Find where it goes in your current code. Hint: You will not have to write a new line.

Feel free to also add other features to your graph, like a title, colored bars, etc!

What are you trying to reorder on the graph? Are you trying to reorder the x variable or y variable?

penguin_clean |> 
  group_by(species)|> 
  summarize(mean_bill=mean(bill_length)) |> 
  ggplot(aes(x=fct_reorder(_____, _____), y=mean_bill))+
  geom_col()
penguin_clean |> 
  group_by(species)|> 
  summarize(mean_bill=mean(bill_length))|>
  ggplot(aes(x=fct_reorder(species, mean_bill), y=mean_bill, fill=species))+
  geom_col()+
  labs(x="Species", y = "Mean bill length (mm)", title="Mean bill length of penguins in the Palmer Archipelago")

Exercise 4

Create a density plot of penguin flipper length. Include a line displaying the mean flipper length. Additionally, separate the data by island, so that you have one density curve per island.

Part A

First, create a density curve of penguin flipper length.

Use the geometry command geom_density().

ggplot(_______, aes(x=______))+
  _________

#OR you can use the pipe: 

_____ |> 
  ggplot(aes(x=_____))+ 
    ________
ggplot(penguin_clean, aes(x=flipper_length))+
  geom_density()

#OR you can use the pipe: 

#penguin_clean|>
#ggplot( aes(x=flipper_length))+
#  geom_density()

Part B

Next, separate out your density curve and create one density curve for each island. Hint: You will not need to add a whole new line of code to do this, just add a command into your existing code.

To separate the plot by island, you will need to adjust the color in the aesthetics section of ggplot.

ggplot(penguin_clean, aes(x=flipper_length, color=______))+
  geom_density()
ggplot(penguin_clean, aes(x=flipper_length, color=island))+
  geom_density()

Part C

Next, we’ll add a line displaying the mean flipper length for each island. To do this, we need to first create tidy, tiny data which displays the mean flipper lengths.

Create a new variable called summary_data which displays a chart with the mean flipper length for each island.

To check your work, run the following code in the code chunk below.

summary_data

If your code is successful, this will print the mean flipper length for each island.

You’ll need to use the following functions in your code: group_by(), summarize(), and mean().

summary_data <- ________ |> 
  group_by(_____)|> 
  summarize(_________)
summary_data <- penguin_clean |> 
  group_by(______)|> 
  summarize(mean_flipper = mean(______))
summary_data<- penguin_clean|> 
  group_by(island)|>
  summarize(mean_flipper=mean(flipper_length))

Part C

Now let’s add vertical lines to the graph indicating the mean flipper length for each island.

To do this, use the geometry geom_vline(). Use your summary_data() to specify the conditions in geom_vline().

First define your summary_data and mean_flipper, like you did in Part B.

Then create the ggplot, like you did in Part A. Add a geom_vline() to the ggplot.

The syntax for geom_vline() is geom_vline(data=____, aes(xintercept=_____, color=_____)).

summary_data<- penguin_clean|> 
  group_by(island)|>
  summarize(mean_flipper=mean(flipper_length))

ggplot(penguin_clean, aes(x=flipper_length, color=island))+
  geom_density()+
  geom_vline(data=_______, aes(xintercept=_______, color=______))
summary_data<- penguin_clean|> 
  group_by(island)|>
  summarize(mean_flipper=mean(flipper_length))

ggplot(penguin_clean, aes(x=flipper_length, color=island))+
  geom_density()+
  geom_vline(data=summary_data, aes(xintercept=mean_flipper, color=island))

Part D

Finally, add labels to your plot including x and y axis labels and a title.

Just add the labs() geometry to your existing code.

summary_data<- penguin_clean|> 
  group_by(island)|>
  summarize(mean_flipper=mean(flipper_length))

ggplot(penguin_clean, aes(x=flipper_length, color=island))+
  geom_density()+
  geom_vline(data=_______, aes(xintercept=_______, color=______))+
  labs(title="_____", x="_____", y="_____")
summary_data<- penguin_clean|> 
  group_by(island)|>
  summarize(mean_flipper=mean(flipper_length))

ggplot(penguin_clean, aes(x=flipper_length, color=island))+
  geom_density()+
  geom_vline(data=summary_data, aes(xintercept=mean_flipper, color=island))+
  labs(title="Flipper Length Density Plot for Penguins on Biscoe, Dream, and Torgersen Islands", x="Flipper Length (mm)", y="Density")

Congratulations, you have finished this tutorial!!