Cheatsheet for Data in the Wild Tutorials

The following cheatsheet is based on commands used in a course developed for Lewis & Clark College, called “Data in the Wild”. It covers basic R commands (including basic arithmetic, importing data), cleaning and mutating data, plotting using ggplot and inferential statistics.

For more information and to download the tutorials used in conjunction with this cheatsheet, refer to the R files page.

Basic R Commands

Basic arithmetic
- Examples: 5*6, sqrt(9), abs(-3)
- Type help(command) to find information about any command
Create variables: Use <- or = to assign values to a variable.
- Example: x <- 7
Create a vector:
```
vector <- c(1,3,7)
vector
```
```
[1] 1 3 7
```

Create a data frame: data.frame(____)

df <- data.frame(var_1 = c(1,2,3) , var_2 = c("cat", "dog", "fish"))
df

  var_1 var_2
1     1   cat
2     2   dog
3     3  fish

Basic Statistics

mean(vector) : Calculates the mean of a given set of values.
median(vector) : Calculates the median of a given set of values.
sd(vector) : Calculates the standard deviation of a given set of values.
nrow(data): Calculates the total number of rows in a dataset
na.rm = TRUE : Remove NA values. Add this as an argument to any of the statistics calculations. E.g. mean(vector, na.rm=TRUE)

Working with a data set in the tidyverse

We use the tidyverse to analyze data in this cheatsheet. To install run the following code:

install.packages("tidyverse")

After installing, to load the tidyverse run:

library(tidyverse)

Uploading and viewing a data set

dataFrame <- read_csv("myCSV.csv") : Creates a data frame from a file called myCSV.csv
write_csv(dataFrame, "myCSV.csv"): Creates a csv from a data frame
view(dataFrame): Enter view-mode to see the entire data frame
str(dataFrame): Gives the structure of data frame
glimpse(dataFrame): Take a quick look at a data frame
summary(dataFrame): Returns min, max, mean, meadian, 1st/3rd quartiles for all vectors in dataFrame
dataFrame$columnName: Calls up specific column from a data frame
unique(vector): Find the number of unique values in a column/vector
length(): Counts the number of entries in a column/vector

Rearranging & Summarizing data

Use logical operators to combine conditions: & (and), | (or), ! (not)
Use comparison operators to describe conditions: <, >, ==, !=, <=, >=
group_by(.data, column): Takes a dataset and groups it by a column/variable
summarize(.data, summaryStat = statistic formula): Takes a dataset and outputs summary statistics that you define.
- n(): Calculates current group size. Can be used in summarize and group_by
- Combine group_by and summarize using the pipe (|>) to see summary statistics for specific groups/variables.
- The pipe: Use |> OR %>% to string functions and data together. Read as “AND THEN”.
Example:
```
  dataFrame |> 
    group_by(firstColumn)|> 
    summarize(mean_of_secondColumn = mean(secondColumn), 
              standard_deviation_of_secondColumn = sd(secondColumn),
              number_in_each_group = n(), 
              percent = n()/nrow(dataFrame)*100)
```
filter(.data=myDataFrame, column=="some value"): Subsets dataframe, lets us pick rows of data based on “some value”, including a specific value, mean, median, etc.
select(.data, column(s)): Choose columns from a dataframe.
- column1 , column2 : choose column1 and column2
- column1:column5 : choose all columns between column1 and column5
- -column1: choose all columns except column1
mutate(data, newColumn = yourFormula): Creates a new column in a dataset defined by a calculation you input.
arrange(data, column): Rearrange data into numerical order by a specific column

Plotting with GGPlot

Resources:

GGPlot Basics:

Basic structure:

ggplot(data, mapping=aes()) +
      geom_function()

Connect different properties using a +
Basic Components:
- Data: data=dataSet: Define your data set
- Aesthetics: mapping = aes(variables): Define the variables. Can also specify color/fill for your graph and geometries.
- Geometry: geom_GeometricObject(): Define the type of plot

Geometries

geom_histogram(): Creates a histogram
- Syntax: geom_histogram(bins=X) Specify number of bins
geom_point(): Creates points (scatterplot) for each data point
- Can specify color = ___ and shape = ____
geom_bar() : Creates a bar graph
- Syntax:
  - geom_bar(): Aggregates data for you
  - geom_bar(stat = "identity"): Creates a bar graph with pre-aggregated data that you input
  - geom_bar(position="___") Can choose “stack” (bars on top of each other), “dodge” (bars side by side) or “fill” (bars stacked, scaled to 100%).
  - Add error bars: geom_errorbar(mapping=aes(ymin, ymax), width)
geom_boxplot(): Creates a boxplot
- Add error bars: stat_boxplot(geom="errorbar")
geom_density(): Creates a density graph
- Syntax: geom_density(adjust=X) X is a ratio, how smooth the graph is
geom_smooth(method="lm", se=FALSE): Creates a line of best fit
labs() : Add a title and axes labels to your graph
facet_wrap(~ variable): Create separate plots for each aspect of a given variable
And more!

Other useful GGPlot functions:

install.packages("ggplot2") and library(ggplot2): Install and load GGplot package. Note: ggplot2 is included in the tidyverse so you do NOT need to load both packages.
ggsave("yourTitleHere.jpg", currentPlotName): Saves “currentPlotName” as a jpg file called “yourTitleHere.jpg”.

Examples: Using GGPlot with other tidyverse functions

Use the pipe to string together functions and create tidy data before plotting.

The following examples use data from the palmerpenguins package which has been preloaded into this document. Click the green play button to see them run. Feel free to edit the code as well if you want to try out other functions!

Here’s a preview of the penguin data, and here is a link to more information about the data.

Basic bar graph of median flipper length on each island:

Boxplot of Adelie penguin flipper length, by sex:

Inferential Statistics

Basic statistical tests

t.test(data, dependentVariable ~ independentVariable): 2 sample t-test - compare 2 groups on a numerical measure
- Requires: 2 samples of quantitative data
aov(data, dependentVariable ~ independentVariable): ANOVA (analysis of variance) test - compare means among groups
- Requires: Independent variable - categorical; dependent variable - numerical
TukeyHSD(data): Tukey’s HSD - Post Hoc test which tells you which specific means differ from each other

Linear regression

cor(data$independentVariable, data$dependentVariable): Correlation coefficient - outputs a number between -1 and 1 which tells strength and direction of the relationship between 2 numeric variables
- Requires 2 numeric variables
lm(data, dependentVariable ~ independentVariable): Fit a linear model
- summary(yourLinearModel): View more details about a linear model including regression coefficients, p value for linear model coefficients, quartiles, residual standard error, F-statistic.