ggplot(data, mapping = aes()) +
geom_boxplot()
Lesson 5: Boxplots and scatterplots
1 Introduction
Welcome to lesson five, where you will learn the basics of graphing using boxplots and scatterplots.
See the home page for more details on how to use this tutorial and for troubleshooting tips!
In the previous lesson, we introduced GGPlot and made a histogram. In this lesson, we’ll continue to use GGPlot, but learn new ways to customize plots.
We will work with the physiology dataset, called data
, that you have already seen. Type the word data to recall what it contains.
2 Boxplots
Boxplots are an excellent graphing option for many datasets, because they are valid regardless of whether or not the data are normally distributed. To create a boxplot, you need one variable that is categorical and one that is continuous.
The command geom_boxplot()
will create a boxplot. We will add this as a geom_object()
to our ggplot()
function using a +
. The structure of the code looks like this:
2.1 Specifying the type of variable
Before we create a boxplot, we need to tell R what kind of variables we are inputting.
When creating a boxplot, our x variable should be categorical (called a factor variable in R language). A categorical variable has defined categories or groups.
Our y variable should be continuous. A continuous variable can take any numerical value.
Which variable in our dataset is a categorical variable?
✗RQ
✗heart_rate
✓Condition
In data
the different measurements are categorized as having been made under either exercise (1) or resting (2) conditions. So Condition
is a categorical variable because it has two different categories (or groups) that the measurements are placed into
Before we can create a boxplot, we have to tell R that whether our variable is a categorical variable. In R language, a categorical variable is called a “factor” variable. You can use this command to convert a variable to a categorical (or factor) variable: data$variableName <- as.factor(data$variableName)
.
Try that now with the categorical variable in data
.
Run the following code to check your work. If the output is “Factor w/ 2 levels….”, then you wrote the right code! If not, try again.
The Condition
variable should be categorical. So convert it to a factor variable (in R language).
$Condition <- as.factor(data$Condition) data
Finally, we can create a boxplot! Specify the x variable as Condition
and use heart_rate
for the y variable.
Here’s the basic structure for your code again:
ggplot(data, mapping = aes()) +
geom_boxplot()
Add the x and y variables to create a boxplot. As an extra bonus, add labels and color!
ggplot(data, mapping = aes(x= _______, y= ________)) +
geom_boxplot()
Make sure you use the proper capitalization and spelling for the x variable, exactly as it is shown in our data
.
If you get a weird looking boxplot, return to the previous code chunk and make sure you converted Condition
to a factor variable.
$Condition <- as.factor(data$Condition)
data
ggplot(data, mapping = aes(x= Condition, y= heart_rate)) +
geom_boxplot()
If your code worked, you should see a boxplot for each condition, 1 (exercise) and 2 (resting). The boxplot displays the median, quartiles and outliers for each condition.
Here’s an image that illustrates how to interpret a boxplot:
Looking at the boxplot you created, what is the 1st quartile for condition 2 (resting)?
✗65
✓53
✗73
2.2 Modifying a Boxplot
2.2.1 Adding labels
Now, try adding labels to your boxplot. We’ll use the same function we used in lesson 4: labs(x="_____", y="______", title="_________")
. Try it now!
Don’t forget to use a +
to chain on the labs()
function after geom_boxplot()
.
ggplot(______, mapping = aes(x= _______, y= ________)) +
geom_boxplot()+
labs(x="____", y="_____", title="_______")
ggplot(data, mapping = aes(x= Condition, y= heart_rate)) +
geom_boxplot()+
labs(x="Condition", y="Heart Rate", title="Boxplot of heart rate during exercise and rest")
But these labels do not specify whether “1” and “2” refer to resting or exercise conditions. We can add the following function to change the labels on the x axis: scale_x_discrete(labels=c("firstLabel", "secondLabel"))
Give it a try!
ggplot(data, mapping = aes(x= Condition, y= heart_rate)) +
geom_boxplot() +
labs(x="Condition", y="Heart Rate", title="Boxplot of heart rate during exercise and rest") +
scale_x_discrete(labels=c("_______", "_______"))
ggplot(data, mapping = aes(x= Condition, y= heart_rate)) +
geom_boxplot()+
labs(x="Condition", y="Heart Rate", title="Boxplot of heart rate during exercise and rest")+
scale_x_discrete(labels=c("Exercise", "Resting"))
3 Scatterplots
Suppose we wanted to look at a scatterplot of the relationship between heart rate and respiratory quotient (RQ).
We can create a scatter plot by adding the geom geom_point()
to the main ggplot()
function. This will create a point for every data value.
Follow the same overall structure that you used to create a boxplot and histogram. Try to create a scatterplot displaying the relationship between heart_rate
(on the x-axis) and RQ
(on the y-axis).
Recall the overall structure should be:
ggplot(data, mapping = aes(____________))+
_________ () geom_
ggplot(data, mapping = aes(x= _______, y= ________)) +
geom_point()
ggplot(data, mapping = aes(x= heart_rate, y= RQ)) +
geom_point()
3.1 Adjust plot style
Just like a boxplot, we can adjust the color, labels, and even the point size and shape.
3.1.1 Point size, color, shape
You can adjust the size, color and shape of the point by adding arguments to the geom_point()
function.
For example: geom_point(color="red", shape="circle", size=1.5)
Give it a try! Create a scatter plot displaying the relationship between heart_rate
(on the x-axis) and RQ
(on the y-axis) and change the color, size and shape of the points.
Try different shapes and colors - just type in their name and see if they work! R might not understand all the colors or shapes, but you can experiment and see which ones work or look it up online or check out this linked pdf to see the colors that can be specified by name.
Use the same code that you used in the code chunk just before this (you can just copy and paste!). Then add color, shape, and size specifications inside the parentheses of geom_point()
.
Don’t forget to use quotation marks around your color and shape! Do NOT use quotation marks around the size.
ggplot(data, mapping=aes(x=heart_rate, y=RQ)) +
geom_point(color="______", shape="_______", size=________)
ggplot(data, mapping=aes(x=heart_rate, y=RQ)) +
geom_point(color="blue", shape="triangle", size=2)
3.1.2 Axis limits
You can also adjust the axis limits by chaining on the functions: xlim(minLimit, maxLimit)
and ylim(minLimit, maxLimit)
.
For example:
ggplot(data, mapping=aes(x=_____, y=_____)) +
geom_point() +
xlim(____, ____) +
ylim(____, _____)
Let’s put everything together now! Create a scatterplot displaying the relationship between heart_rate
(on the x axis) and RQ
(on the y axis). Choose a color, size and shape for the points. Add reasonable x and y limits. And finally, add x and y axes labels and a title to your plot.
(Feel free to copy and paste your code from previous code chunks)
Check that you have quotation marks and parentheses in the right spots.
Did you chain all the different functions together with a +
?
Did you spell all your variable names correctly?
ggplot(_____, mapping=aes(x=______, y=______)) +
geom_point(color="______", shape="_______", size=________) +
xlim(____, ____) +
ylim(____, ____) +
labs(x="_____", y="_____", title="_____")
ggplot(data, mapping=aes(x=heart_rate, y=RQ)) +
geom_point(color="blue", shape="triangle", size=2) +
xlim(40, 90) +
ylim(0.7, 1.1) +
labs(x="Heart Rate", y="Respiratory Quotient", title="Relationship between Heart Rate and Respiratory Quotient")
Now you’ve learned the basics of creating a scatterplot when both the x and y variables are continuous. You know how to alter many aspects of your figure’s appearance to make it clearer.
3.2 Scatterplots with a categorical variable
Next, we’ll look at making scatterplots when the independent (x) variable is categorical.
Suppose you wanted to create a graph that compared the respiratory quotients (RQ) under the two conditions: exercise and resting. Remember that the Condition
variable is categorical because it sorts the data values into two categories.
3.2.1 Specifying Type of Variable
First, let’s tell R that Condition
is a factor (categorical) variable. We already did this when we made a boxplot, so skip this step if you already did it and haven’t refreshed the page since.
Do you remember how to change a variable to a factor variable?
3.2.2 Scatterplot
Now, create a scatterplot the RQ
s of the two Condition
groups, exercise and resting.
Your x variable is Condition
and your y variable is heart_rate
ggplot(data, mapping=aes(x=Condition, y=RQ))+
geom_point()
When you have one categorical variable and one continous variable, a scatterplot is only a good option if there are only a few data values, as in this example. Otherwise, use a boxplot.
3.2.3 Axis labels on categorical scatterplot
Let’s add labels to our plot. You know how to add axes labels and a title, but how can we change the unhelpful labels “1” and “2”? We will use the same command we used when making a boxplot. Refer back to your code when you made a boxplot, and see if you can add labels to your scatterplot. Add some color too!
The command you need to add is scale_x_discrete(labels(c("________", "_________")))
.
ggplot(____, mapping=aes(x=______, y=______))+
geom_point(color="red")+
labs(x="_______", y="_______", title="_______")
scale_x_discrete(labels(c("________", "_________")))
ggplot(data, mapping=aes(x=Condition, y=RQ))+
geom_point(color="red")+
labs(x="Condition", y="Respiratory Quotient", title="Respiratory Quotient at Rest or while Exercising")+
scale_x_discrete(labels=c("Exercise", "Resting"))
4 Congratulations
That’s it! Now you’re an expert in making boxplots and scatterplots. In lesson six, you’ll learn how to make two kinds of bar graphs.