Scatter Plots
1 What is a scatter plot?
A scatter plot compares both a numerical values on the x and y axes. They can be used to show trends over time, or to show the correlation between two variables.
Here is the plot framework:
2 Penguin Data
In the penguin data set, we can examine the relationship between bill depth and bill length.
Note: To generate a scatter plot, we call on the geom_point( ) function with ggplot.
ggplot(data=penguins, mapping=aes(x=bill_length_mm, y=bill_depth_mm))+
geom_point()
3 Line of best fit
So we have these dots, but how do we know the relationship between x and y-variables?
In order to perform regression analysis, or how the x and y variables are related, we can use a linear model equation y=mx + b, where y is a function of the change in a given y-value times its respective x-value, plus the y-intercept, or where the y-value crosses the y=0 axis.
All this information can be found using this skeletal formula:
=lm(data = df, fomula = y-variable ~ x-variable)
variablesummary(variable)
Here is the skeletal fomula applied to our scatter plot of the penguin bill depth vs. bill length. Below is the output: the summary statistics associated with this linear model.
=lm(data=penguins, formula=bill_depth_mm~bill_length_mm)
billDepthAndLengthModel
summary(billDepthAndLengthModel)
While this output is intimidating, we only need two numbers for our purposes: the estimate intercept (b) and the slope (m), located right below that number.
Our line of regression can be represented in the equation y = -0.08502x + 20.88574
Here is the same scatter plot with the line of best fit:
ggplot(data=penguins, mapping=aes(x=bill_length_mm, y=bill_depth_mm))+
geom_point()+
geom_smooth(method="lm")
4 Chick Data
Using the ChickWeight data set, that shows the the change in weight (variable: “weight”) over time in days (variable: “Time”) for chick 4.
First, we must filter the ChickWeight data to include only data from chick 4 by creating a new variable, “chick4”.
Now let’s create a scatter plot of chick 4’s weight over time. Don’t worry about adding a title or axes labels. Make sure to add a line of best fit!
R is case sensitive! The data set is named “ChickWeight” and the variables are “Time” and “weight”. Which variable belongs on the x-axis? Which variable belongs on the y-axis.
ggplot(data=chick4, mapping=aes(x=Time, y=weight))+
geom_point()+
geom_smooth(method=lm)
Let’s generate the line of best fit summary table for the chick4 data, using the model code from above.
Write the linear regression equation (y=mx+b) based on the summary model.
Make sure you have the slope and y-intercept in their respective spots!
This equation shows the relationship between the x-variable (time) and the y-variable (weight) for chick 4 from the ChickWeight data set.