In this tutorial, we’ll be looking at how to create plots using the package ggplot.
We’ll start by learning about how ggplot works.
Syntax
First, let’s consider the syntax when we write code with ggplot. There are many commands that can be used within the ggplot() function. They are all connected using a +.
General syntax for ggplot():
ggplot(data, aes(x=variable1, y=variable2))+
geom_x()
OR
dataframe|>
ggplot(aes(x=variable1, y=variable2))+
geom_x()
Specifying variables: X is the explanatory variable and Y is the outcome variable
Aesthetics
Within ggplot code, you can specify the aesthetics with various variables including color, size, variables, number of bins (in a histogram), etc.
For example the following code specifies the x & y variables, point size & color:
geom_smooth(method="lm", se=FALSE): Creates a line of best fit
geom_point(): Creates points (scatterplot) for each data point
geom_bar() : Creates a bar graph, aggregates data for you
Syntax: geom_bar(position="___") Can choose “stack” (bars on top of each other), “dodge” (bars side by side) or “fill” (bars stacked, scaled to 100%).
geom_col(): Creates a bar graph with pre-aggregated data
geom_histogram(): Creates a histogram
Syntax: geom_histogram(bins=X) Specify number of bins
geom_density(): Creates a density graph
Syntax: geom_density(adjust=X) X is a ratio, how smooth the graph is
labs() : Add a title and axes labels to your graph
And more!
Other functions to use with GGplot
You can use the pipe to incorporate other tidyverse functions to organize your data before plotting.
Our Dataset
We will work with a dataset with measurements for 3 different species of penguins on islands in the Palmer Archipelago, Antarctica. We’ll be focusing on the measurements for species, location, bill length (mm), bill depth (mm), flipper length (mm), body mass (g), sex, and year the data was taken. Here’s more information about the data set!
Here is a preview of the data you will be working with:
Exercises
Here are some practice exercises. You can check if your code works using the “check your work” button after each problem.
If you get stuck, you can also click for hints and for an example answer. Remember, there’s many ways to solve each problem, so the code in the “answer” box isn’t the only way to solve it!
Exercise 1
In this exercise, you will create 3 scatter plots, one for each species of penguin, displaying penguins’ body mass and flipper length. Additionally include a line of best fit.
The different parts of this exercise will walk you through the process of tidying the data and creating the scatter plot.
Part A
First, let’s clean the data and remove any rows with missing values. Create a new dataset called penguin_clean which doesn’t contain NA values.
Check your work
To check your work, run the following code in the code chunk below.
n_miss(penguin_clean)
If your code works, this should print 0.
Hint
penguin_clean<-na.omit(_____)
Answer
penguin_clean<-na.omit(penguin)
Part B
Now use your clean data to construct a scatter plot with flipper_length as the explanatory variable and body_mass as the outcome variable.
Hint
You’ll need to use ggplot, specify the data, the aesthetics (which includes the x and y variables) and then add a geometry to your plot to create points.
ggplot(penguin_clean, aes(x=flipper_length, y=body_mass))+geom_point()#OR you could use the pipe to start with penguin_clean and then create your plot: #penguin_clean |> ggplot(aes(x=flipper_length, y=body_mass))+ geom_point()
Part C
Now, add a line of best fit to your plot.
Hint
Add one more piece of geometry to your code from part B to your plot that creates a line of best fit.
ggplot(penguin_clean, aes(x=flipper_length, y=body_mass))+geom_point()+geom_smooth(method="lm", se=FALSE)#OR you could use the pipe function!
Part D
Finally, let’s separate the plot into 3 different plots, one for each species.
Hint
Use the function facet_wrap() and be sure to define your facet. Add this as a geometry to your code.
Hint
#Here's the syntax: facet_wrap(~_____)
Answer
ggplot(penguin_clean, aes(x=flipper_length, y=body_mass))+geom_point()+geom_smooth(method="lm", se=FALSE)+facet_wrap(~species)#OR you could use the pipe!
Exercise 2
Create a bar graph which displays the number of penguins in each species. Also show which islands each penguin species is on by adjusting the fill and position of the bar graphs.
Part A
First, create a bar graph that shows the number of penguins in each species.
Hint
geom_bar() is used when the data still needs to be aggregated.
geom_col() is used when the data is already aggregated.
Now that the bars are filled in by island, let’s adjust them again so that they don’t stack on top of each other. Instead, place all the bars next to each other.
Hint
Adjust the position of bars in the geom_bar() geometry.
Create a bar graph displaying the mean bill length for each species of penguin.
Part A
First, we need to make our data tiny so that we can then input it into a plot. So, start by using the summarize() function to display the mean bill length of penguins by species.
Hint
In addition to the summarize() function, you’ll need to also use the group_by() function to group by species. Use the pipe to string both of these functions together.
Finally, we want to order the columns in the bar graph from smallest to largest. To do this, we’ll use fct_reorder(). Find where it goes in your current code. Hint: You will not have to write a new line.
Feel free to also add other features to your graph, like a title, colored bars, etc!
Hint
What are you trying to reorder on the graph? Are you trying to reorder the x variable or y variable?
penguin_clean |>group_by(species)|>summarize(mean_bill=mean(bill_length))|>ggplot(aes(x=fct_reorder(species, mean_bill), y=mean_bill, fill=species))+geom_col()+labs(x="Species", y ="Mean bill length (mm)", title="Mean bill length of penguins in the Palmer Archipelago")
Exercise 4
Create a density plot of penguin flipper length. Include a line displaying the mean flipper length. Additionally, separate the data by island, so that you have one density curve per island.
Part A
First, create a density curve of penguin flipper length.
Hint
Use the geometry command geom_density().
Hint
ggplot(_______, aes(x=______))+ _________#OR you can use the pipe: _____ |>ggplot(aes(x=_____))+ ________
Answer
ggplot(penguin_clean, aes(x=flipper_length))+geom_density()#OR you can use the pipe: #penguin_clean|>#ggplot( aes(x=flipper_length))+# geom_density()
Part B
Next, separate out your density curve and create one density curve for each island. Hint: You will not need to add a whole new line of code to do this, just add a command into your existing code.
Hint
To separate the plot by island, you will need to adjust the color in the aesthetics section of ggplot.
Next, we’ll add a line displaying the mean flipper length for each island. To do this, we need to first create tidy, tiny data which displays the mean flipper lengths.
Create a new variable called summary_data which displays a chart with the mean flipper length for each island.
Check your work
To check your work, run the following code in the code chunk below.
summary_data
If your code is successful, this will print the mean flipper length for each island.
Hint
You’ll need to use the following functions in your code: group_by(), summarize(), and mean().