Lesson 6: Bar Graphs

1 Introduction

Welcome. In this lesson, you will learn how to create simple bar graphs and clustered bar graphs, and how to add error bars to both types of graphs.

We’ll continue to use ggplot, which was introduced in lesson 4.

See the home page for more details on how to use this tutorial and for troubleshooting tips!

2 Dataset

A dataset called bargraphdata has been loaded into this tutorial.

Type its name so that you can see what it contains.

It would also be good to understand the structure of the different variables. Type the command that allows you to see the structure of bargraphdata.

Code editor
Answer

str(bargraphdata)

This dataset contains three variables. The data come from an experimental field study that lasted three years and involved 40 study plots. Each plot received one of two different experimental treatments.

The first variable, var1, is a measurement that was made in each of the plots; this variable is numerical.
The second variable, treatment, refers to whether a particular study plot received treatment A or treatment B; this is a factor (category) variable.
The third variable, year, refers to the year in which the measurement was made.

3 Bar Graphs

Suppose you want to make a bar graph that compares the means of var1 for each of the two treatments. For now, imagine that you do not care about the year the measurement was made.

3.1 Identify data

The first thing you need to do is to identify the data that will be used to make each bar. Right now all of the data values for both bars are part of the same variable, var1. So we’ll need to sort our data into groups (treatment A and B) and then calculate the mean for each group.

Which functions do you think we could use to group our data and then calculate the mean for each group?

Hint: You learned about it in lesson two.

✗summarize() and select() and mean()

✗filter() and mean()

✓group_by() and summarize() and mean()

If you weren’t sure which function to choose or what these functions do, review lesson 2.

First, we’ll use group_by() to group our data by different treatment groups. Then we use summarize() to calculate the mean of each group. Try this now:

First, specify your dataset, called bargraphdata. Remember to use the pipe to string the functions together.

Here’s the basic setup:

bargraphdata |> 
  group_by(______) |>
  summarize(avgVar1 = ________)

bargraphdata |> 
  group_by(treatment) |>
  summarize(avgVar1 = mean(var1))

You should get a chart displaying the means of each treatment group. But, we need to be able to input that data into our graph, so we’ll save it as a new variable.

Copy and paste your code from the previous exercise and save it to a new variable called tmtsummary.

Remember to use <- to assign data to a new variable name.

Now we can graph! We’ll use the same structure we did to create a histogram, boxplot and scatterplot, but we’ll use the geometry geom_col() to make our graph. Chain it onto the main ggplot() function using a + and use tmtsummary as your data.

Your x variable will be treatment and your y variable will be the name that you created for the mean of variable 1. In my answer example, it is called avgVar1, but you can use any name, as long as you defined it in the code chunk above.

Here’s the basic structure:

ggplot(____, mapping = aes(______))+
  geom_ ______()

ggplot(tmtsummary, mapping = aes(x=treatment, y=avgVar1))+
  geom_col()

3.2 Labels and color

Now let’s learn how to add color to the graph. We can color the bars by treatment, so that each treatment is a different color. Use your same code as before and type fill=treatment into the aes() command, after the x and y variables. Use a comma to separate the fill color from the x and y variables.

Give it a try now! Also add labels to your graph.

Here’s the basic structure:

ggplot(tmtsummary, mapping = aes(x=treatment, y=avgVar1, _______))+
  geom_col()+
  labs(________)

ggplot(tmtsummary, mapping = aes(x=treatment, y=avgVar1, fill=treatment))+
  geom_col()+
  labs(x="Treatment", y="Average Variable 1", title = "Bar graph of the Mean of Variable 1 for Treatment A and B")

3.3 Adding error bars

In lesson three, you learned about the importance of describing the variability in the data as well as its central tendency. This bar graph does not show the variability in var1. You can do that by adding error bars to the bars. The error bars can represent whatever measure of variability you choose: standard deviation, standard error, or 95% confidence limits.

3.3.1 Compute measure of variability

To add error bars to your graph, you first need to compute the values for the measure of variability you have chosen to use. In this example, let’s use the standard deviation.

We’ll return to our group_by() and summarize() functions. Use the exact same code you used to create tmtsummary and add a calculation for standard deviation in summarize(). Save this summarized data to the same variable name tmtsummary. Give it at try!

Your summarize command will have the structure:

summarize(avgVar1 = mean(var1),
          ______ = _____(____))

All the other code will be the same as when you first defined tmtsummary

tmtsummary <- ________ |> 
  group_by(______)|>
  summarize(avgVar1 = mean(var1),
            _____ = sd(_____))

tmtsummary <- bargraphdata |> 
  group_by(treatment)|>
  summarize(avgVar1 = mean(var1),
            sdVar1 = sd(var1))

Now we have our standard deviation data and we can use that to create error bars. The geom function for error bars is geom_errorbar(). We’ll add an aesthetics argument to specify the max and min for the error bar.

Here’s how the error bar code will look:

geom_errorbar(mapping=aes(ymin = _______,
                          ymax = _______))

Try adding this command to your bar graph. Fill in the ymin and the ymax. Note that: the minimum error bar value is the ‘mean of var1 - sd of var1’ and the maximum error bar value is the ‘mean of var1 + sd of var1’.

Copy and paste your code from the previous graph and chain on the geom_errorbar() function.

Your geom_errorbar() function should look like:

geom_errorbar(mapping=aes(ymin = avgVar1-sdVar1,
                          ymax = avgVar1+sdVar1))

Did you define the data tmtsummary and include a calculation for standard deviation in it? Are all your variables spelled correctly? Are there enough parentheses?

ggplot(tmtsummary, mapping = aes(x=treatment, y=avgVar1, fill=treatment))+
  geom_col()+
  labs(x="Treatment", y="Average Variable 1", title = "Bar graph of the Mean of Variable 1 for Treatment A and B")+
  geom_errorbar(mapping=aes(ymin = avgVar1-sdVar1,
                          ymax = avgVar1+sdVar1))

Bonus challenge question

Can you think of a way to use the pipe (|>) to join the group_by()/summarize() functions with ggplot()?

You won’t need to define a new variable tmtsummary if you use this method. Give it a try - you may find it’s actually easier than the previous method!

After the summarize() function, simply pipe your data to ggplot(). Then, in ggplot(), you don’t need to specify a dataset because it’s included in the pipe!

Here’s the general setup:

bargraphdata |> 
  group_by()|>
  summarize() |> 
  ggplot() + 
    geom_col()

If you would rather show standard error bars and you have loaded plotrix, you can substitute std.error for sd and get a bargraph with standard error bars.

4 Clustered bar graph

Suppose you were interested in a bar graph that not only compared the treatments, but also displayed if and how the years differed from each other. You can do this with what is called a clustered bar graph.

In this case, a clustered bar graph would have six different bars, one for each combination of treatment (A and B) and year (1, 2, and 3).

4.1 Specifying data and variables

To create a clustered bar graph, we’ll chain on the command facet_wrap(~variableName) to our chain of ggplot functions. We can copy all our previous code, with just a few edits.

Below you’ll see the code we used to create our non-clustered graph in the previous step. (Yours may look slightly different, but that’s ok!)
You’ll also see code to make a clustered bar graph, clustered by year. What’s different from the non-clustered graph code?
Finally, run the clustered bar graph code in the code editor!

Here’s the code you’ll use to create a clustered bar graph code. Compare it to the code you used to create a bar graph in the previous step. What is different?

tmtsummary<- bargraphdata |> 
  group_by(treatment, year)|>
  summarize(avgVar1 = mean(var1),
            sdVar1 = sd(var1)) 

ggplot(tmtsummary, mapping = aes(x=treatment, y=avgVar1, fill=treatment))+
  geom_col()+
  labs(x="Treatment", y="Average Variable 1", title = "Bar graph of the Mean of Variable 1 for Treatment A and B")+
  geom_errorbar(mapping=aes(ymin = avgVar1-sdVar1,
                            ymax = avgVar1+sdVar1))+
  facet_wrap(~year)

Here’s our code for the bar graph created in the previous step. Compare this code to the clustered bar graph code. What is different?

tmtsummary<- bargraphdata |> 
  group_by(treatment)|>
  summarize(avgVar1 = mean(var1),
            sdVar1 = sd(var1)) 

ggplot(tmtsummary, mapping = aes(x=treatment, y=avgVar1, fill=treatment))+
  geom_col()+
  labs(x="Treatment", y="Average Variable 1", title = "Bar graph of the Mean of Variable 1 for Treatment A and B")+
  geom_errorbar(mapping=aes(ymin = avgVar1-sdVar1,
                            ymax = avgVar1+sdVar1))

Copy and paste the code for the clustered bar graph to see what it does!

Here are the edits made for the clustered bar graph:

Line 2: group_by(treatment, year): We’re now interested in displaying treatment and year, so we want to group by both of those variables
Line 11: facet_wrap(~year): Tells R to cluster the bar graphs by year.

4.1.1 Changing the cluster variable

What if you wanted your bar graph to be clustered by treatment instead of year?

Give it a try! It’s not quite as simple as changing the facet_wrap() to treatment… there’s a couple other edits you have to make. Mess around a bit and see if you can figure it out. After you try it yourself, use the hints to help you figure out what else you need to edit in your code.

In line 11, change facet_wrap(~year) to facet_wrap(~treatment)

Line 7: You’ll need to change your x variable. It won’t be treatment anymore because treatment is now the cluster variable NOT the variable on the x axis.

Hopefully you changed your x variable to year. Finally, also change the fill color to year (also line 7 in the aes() argument).

tmtsummary<- bargraphdata |> 
  group_by(treatment, year)|>
  summarize(avgVar1 = mean(var1),
            sdVar1 = sd(var1)) 

ggplot(tmtsummary, mapping = aes(x=year, y=avgVar1, fill=year))+
  geom_col()+
  labs(x="Treatment", y="Average Variable 1", title = "Bar graph of the Mean of Variable 1 for Treatment A and B")+
  geom_errorbar(mapping=aes(ymin = avgVar1-sdVar1,
                            ymax = avgVar1+sdVar1))+
  facet_wrap(~treatment)

You may notice that the legend looks a little odd - it is continuous and doesn’t show 3 separate values. This is because R thinks the year data is numerical, but we are treating it as a categorical (factor) variable.

Try changing the year data to factor data and then rerunning the code to fix the legend. If you need a refresher on how to do switch a variable to a factor variable, check out lesson 5.

5 Congratulations

That’s it! Now you know how to use bar graphs to di splay your data, including adding error bars and creating clustered bar graphs. Remember that if the distribution of data values is highly non-normal, boxplots are a more appropriate form of graphical display than bar graphs.

Next lesson, we’ll learn some basic statistics.