Lesson 7: Inferential stats

1 Introduction

In this lesson you will learn how to perform some basic inferential statistics.

When testing scientific hypotheses, we are trying to decide whether the pattern in our data supports the hypothesis or not.

But some apparent patterns can be due to chance alone. Statistical inference gives us a way to compute the likelihood that an apparent pattern in our data could be due merely to chance.

1.1 Statistical Significance

Most scientists only accept a pattern as supporting a hypothesis if it is very unlikely that the pattern could be due to chance. We define an event as ‘very unlikely’ if it has a probability of .05 (5%) or less.

Statistical inference involves using the data to compute the probability – the p-value – that a pattern could have arisen by chance. If the p-value associated with a test is less than or equal to 0.05, we conclude that the pattern is statistically significant. In other words, the pattern is highly unlikely to have arisen by chance.

2 Dataset

In this tutorial you will work with a sample dataset called sampledata that has already been imported into the tutorial.

Write the command for looking at the structure of sampledata so you can see what it contains.

The structure function is str(datafilename).

str(sampledata)

You should also look at the values in the table. Write the command to display the data file.

Just type the name of the file

You can see that sampledata contains 30 different observations (rows) and 5 variables (columns).

  • Weight, length, and repro are numerical variables.

  • Temp is a factor variable with two levels, A and B.

  • Diet is an integer variable, but in this data set, we will want to use it as a factor variable: the experimental organisms were fed one of three possible types of diet.

2.0.1 Factor variable

Write a command that will convert diet to a factor variable.

You will need to use the as.factor() function.

Don’t forget to first reference your data set and then your variable, separating them with a $.

The basic structure is _____$_____ <- as.factor(_____$______)

sampledata$diet<-as.factor(sampledata$diet)

3 Statistical Analysis with Continuous Variables

Let’s start by thinking about situations where both the independent and dependent variables are continuous. You are interested in knowing if there is an association or a relationship between the values of the two variables. Does the value of one predict the value of the other?

We will work with the variables for weight and for length.

What is the most appropriate kind of plot for looking at the possible relationship between two continuous variables?

histogram

scatterplot

barplot

boxplot

Use ggplot to write some code to plot the relationship between weight and length. Put length on the x-axis. If you need a review on using ggplot, refer to the cheatsheet or lesson 5.

Recall the basic structure of ggplot:

ggplot(_____, mapping=aes(x=______, y=______))+
  geom_point()
ggplot(sampledata, mapping=aes(x=length, y=weight))+
  geom_point()

3.1 Regression Analysis

We can carry out a regression analysis to find the equation for the best-fit line through the points.

The following command performs a regression of the y-variable on the x-variable and stores it in a variable called model. lm stands for ‘linear model.’

model<-lm(sampledata$yVariable~sampledata$xVariable)

  • Notice that we put the y variable first and the x variable second, separated by a tilde ~. The tilde ~ means “explained by”: i.e. the y variable (weight) is explained by the x variable (length).

Write the command to perform a linear regression of weight on length for sampledata, and store it in a variable called weightreg.

weightreg<-lm(_____$______ ~ ______$______)

weightreg<-lm(sampledata$weight~sampledata$length)

You can find the slope of the line and the y-intercept (the values m and b in the equation y = mx + b) by typing the name of the variable you created from your regression.

Write the code to find the slope and y-intercept of the best-fit line.

You named your regression variable weightreg. Just type that in.

The output shows the values of the line’s y-intercept (-5.1502) and its slope (0.8149).

3.1.1 Add a regression line to your graph

To see the line superimposed on your graph, you can chain on the command:

geom_smooth(method="lm", se=FALSE)

Add the regression line command to your scatterplot.

Start with:

ggplot(sampledata, mapping=aes(x=length, y=weight))+
  geom_point()

and then chain on the regression command with a +.

ggplot(sampledata, mapping=aes(x=length, y=weight))+
  geom_point()+
  geom_smooth(method="lm", se=FALSE)

3.1.2 Is it statistically significant?

The points fit the line quite well. It seems unlikely that this association could have happened just by accident, though it IS possible. We would like to know the probability that there is actually no relationship between these two variables, that the slope of the true line through them is actually zero (a flat line).

We can find the p-value by using the command summary(modelName).

Write the command to find the probability that the true slope of the line is zero. Don’t forget to use the name you gave your model variable.

summary(weightreg)

In the ‘coefficients’ table, the line for sampledata$length shows a p-value, 4.7 x 10^-8, that is much smaller than 0.05. This means that you can conclude that the effect of length on weight in this sample is highly statistically significant.

4 Statistical Analysis with Continuous and Categorical Variables

Now consider a different data analysis situation. Suppose you did a study where you measured some continuous dependent variable under two different conditions, and you want to know if the two groups differ.

Again using sampledata, consider the categorical variable temp. Suppose we want to know whether weight differs depending on whether the temp category is A or B. In this situation, we would perform a t-test.

4.1 Visualize data

First it is a good idea to look at the data visually. We can most easily do this with a boxplot.

Use the variable that defines the categories.

In lesson 4, you used ggplot to make both boxplots and scatterplots.

Write code to make a boxplot of weights for the two temp categories.

The general structure is:

ggplot(______, aes(x=_____, y=_____))+
  geom_boxplot()
ggplot(sampledata, aes(x=temp, y=weight))+
  geom_boxplot()

You can see that the medians for the two sets of data values are different, but the values overlap quite a bit. Are they different enough that the difference is unlikely to be due to chance?

4.2 2 Sample t-test

We can find out by performing a t-test. You used a t-test in lesson 3 to calculate the confidence interval for a single variable; this is called a one-sample t-test. This time you will do a two-sample t-test, which is a way of finding out whether two groups of data differ significantly.

There are two ways of writing the t-test command.

  • If the data are in one variable, and the classification categories are in another, as they are in this case, we would write: t.test(data$dependentVar ~ data$categoricalVar), using real file and variable names in place of these placeholders.

Write the command to perform a t-test to determine if weights for the different temperatures (temp A and temp B) are significantly different.

t.test(sampledata$weight~sampledata$temp)

The output shows a p-value (in this case, 0.2308) for the probability that a difference this large could be due to chance alone. Since p is greater than 0.05, we should conclude that the differences in weights are NOT statistically significant. The two groups might really have similar weights, but just by chance, we sampled heavier individuals in one group than in the other.

Sometimes the data for the two groups we would like to compare with a t-test are in two different variables of our data table.

  • When that is the case, we use this command: t.test(varA, varB).

    • For example, if we had two seperate variables for weight, based on the temperature group (instead of all of them listed in one variable “weight”), then we could use this command.

4.2.1 t-tests used only with a normal distribution

It is important to remember, though, that t-tests are designed for data values with a normal distribution. Sometimes data are very non-normal. For example, go back and look at the variable repro.

You can see that there are a very large number of zeroes. This is often a good indication that the mean is not the center of the data distribution.

4.3 Wilcoxon test

There is an alternative to the t-test that does not require the data to have a normal distribution. It is called a Wilcoxon test (also sometimes a Mann-Whitney Test).

4.3.1 Visualize data

Before doing a statistical test, though, write some code to make a boxplot showing how the values of repro differ for the two temperatures.

The structure should be:

ggplot(_____, mapping=aes(x=_____, y=_____))+
  geom_boxplot()
ggplot(sampledata, mapping=aes(x=temp, y=repro))+
  geom_boxplot()

Once again, the medians differ, but the data values for the two groups overlap.

4.3.2 Wilcoxon test

To carry out a Wilcoxon test, use the command:

wilcox.test(data$depvar~data$catvar).

Write the command to perform a Wilcoxon test to determine whether values of repro differ for the two temp conditions.

Your dependent variable should be repro and your categorical variable should be temp.

wilcox.test(sampledata$repro~sampledata$temp)

It’s fine to ignore the warning values that you see whenever the data contain duplicate values.

You can see that the p-value (.076) tells you that the differences are NOT statistically significant, though they are nearly so.

4.4 ANOVA Test

Another data analysis situation you might encounter is the need to compare more than two groups with each other. This is done with a test called an analysis of variance (ANOVA).

Suppose we wanted to know whether diet had an effect on the length of the organisms in our sample. Notice that there are 3 diet categories.

4.4.1 Visualize data

Write a command to create a boxplot for length as a function of diet category.

Length is the dependent variable, diet is the independent variable.

ggplot(sampledata, mapping=aes(x=diet, y=length))+
  geom_boxplot()

We can see that length is quite different between diets 1 and 2, but not nearly as different between diets 2 and 3. Are these differences statistically significant?

4.4.2 ANOVA test

The command to perform an ANOVA is: model<-aov(depvar~indvar).

  • This syntax (dependent variable ~ independent variable) should be feeling familiar by now.

Write a command to perform an ANOVA on length as a function of diet, and assign the result to a variable called dietaov.

The syntax is dietaov <- aov(_____$_____~____$_diet____)

dietaov<-aov(sampledata$length~sampledata$diet)

To see the output from the ANOVA, you can use the command summary(model), replacing the placeholder with your own model’s name.

Write a command to look at the output from your ANOVA.

summary(dietaov)

The output is a table of numbers.

  • The important line is the one labelled sampledata$diet. This reports the probability that the differences in length among the diets could be merely due to chance.

  • The p-value is very small, 0.00011, so the effect of diet on length is highly statistically significant.

However, recall that you are comparing 3 groups, and that the differences between diet groups 1 and 2 were larger than the differences between diet groups 2 and 3. Are ALL these differences statistically significant?

  • Not necessarily. A significant p-value in an ANOVA only means that at least one of the groups is different from the rest, not that ALL the groups are different from one another.

4.5 Tukey’s test

So to probe the situation more deeply, we should follow up this significant ANOVA with a Tukey’s test. A Tukey’s test looks at all the pairwise comparisons of groups, and gives a separate p-value for each comparison.

The command to perform a Tukey’s test is TukeyHSD(model).

  • HSD stands for ‘honestly significant difference.’

Write the command to perform a Tukey’s test on your model.

Note: Tukey’s test will only run if your categorical data is a “factor” variable. Earlier in this tutorial, you should have converted diet to a factor variable, but if you skipped that step, convert diet to a factor variable and rerun your ANOVA test.

TukeyHSD(dietaov)

Note: if you get an error, it might be because your diet data is not factor data. You’ll need to convert the diet variable to factor data first. Then, rerun the anova test command:

sampledata$diet<- as.factor(sampledata$diet)

dietaov<-aov(sampledata$length~sampledata$diet)

And finally, you can run Tukey’s test.

The output is in the form of a table that compares each of the 3 groups to the others, with the p-value for that particular comparison as the final column.

We can see that diet group 1 differs significantly from group 2, and that group 1 differs from group 3, but that diet group 2 does NOT differ significantly from diet group 3.

4.6 Two-way ANOVA for multiple independent variables

ANOVA is a very versatile data analysis tool. For example, an ANOVA can be used to explore data from experiments that have more than one independent variable.

In our sampledata, there are two categorical variables, temp and diet.

  • In the study that generated these data, individuals were reared under one of two different temperature regimes, and were raised on one of three different diets.

  • We’d like to know how temp influenced body weight, how diet influenced body weight, and whether there was an interaction between the effects of temp and diet.

4.6.1 Visualize data

Let’s start, as always, by looking at a graph. In this case, we want to see a clustered boxplot that shows all 6 combinations of diet and temperature as separate bars.

  • We’ll use ggplot() to create a boxplot of length, with temp as the independent variable.

  • Then we can use the command facet_wrap(~diet) to separate the temperatures into the three different diet categories. This creates a clustered boxplot.

  • You did something very similar with bar graphs in lesson 6. Refer to that for more information.

Write the code to create a boxplot of length, using temp for the independent variable and diet for the cluster variable.

The general structure is:

ggplot(_____, mapping=aes(x=_____, y=_____))+
  geom_boxplot()+
  facet_wrap(~______)
ggplot(sampledata, mapping=aes(x=temp, y=length))+
  geom_boxplot()+
  facet_wrap(~diet)

Notice there are 3 separated diet groups (1,2,and 3), and within each group, the data for the two temps (A and B) are shown side-by-side. If instead you wanted two temp groups, with the 3 diets shown side-by-side within them, you would switch which variable is the x variable and which is the facet variable.

Try it now. Write a command to create a length boxplot that has two temperature groups with the three diet types shown side-by-side.

Use the same code as the previous graph, just switch diet and temp.

The general structure is:

ggplot(sampledata, mapping=aes(x=_____, y=length))+
  geom_boxplot()+
  facet_wrap(~______)
ggplot(sampledata, mapping=aes(x=diet, y=length))+
  geom_boxplot()+
  facet_wrap(~temp)

Note: If this doesn’t work, it’s probably because diet is not a factor variable. Make sure you convert it to a factor variable before graphing. Use this command: sampledata$diet<- as.factor(sampledata$diet).

4.6.2 Two-way ANOVA

We can assess the effect of both temperature and diet on length with a two-way ANOVA. The two-way refers to having two different independent variables, temp and diet in this case.

The command for performing a two-way ANOVA is:

model<-aov(data$depvar ~ data$cat1*data$cat2)

  • cat1 and cat2 are placeholders that represent the 2 independent variables.

Write a command to perform a two-way ANOVA on the effect of temp and diet on length, and store the result in a variable named length2way. Put temp in front of the *, and diet behind.

length2way<-aov(sampledata$length~sampledata$temp*sampledata$diet)

Type a command to look at the output from your new two-way ANOVA.

summary(length2way)

The table of output gives the probability that the differences in length between the two temperatures could be due to chance alone, and the probability that the differences in length among the three diets could be due to chance alone. The interaction term assesses whether the effect of temperature depends on which diet type the organism received.

In this example, temperature did NOT have a significant effect on length (p=.205761), diet DID have a significant effect on length (p=.000124), and there was no interaction between temperature and diet (p=.154474)).

In other words, the effect of temperature was the same for all diets, and the effect of diet was the same for both temperatures.

5 Congratulations!

Now you are familiar with some of the most basic and widely-used tests of statistical inference. To learn more about statistical inference in general, consider taking Math 255. To learn more about carrying out statistical analyses in R, there are many online resources that can be easily found by Googling your specific situation.

You’ve also finished all the Bio 110 online tutorials! Excellent work! Hopefully you have a grasp of how to use R in biology classes. If you want a quick reference to what you’ve learned, check out the Bio110 Cheatsheet.