<- c(1,3,7)
vector vector
[1] 1 3 7
The following cheatsheet is based on commands used in a course developed for Lewis & Clark College, called “Data in the Wild”. It covers basic R commands (including basic arithmetic, importing data), cleaning and mutating data, plotting using ggplot and inferential statistics.
For more information and to download the tutorials used in conjunction with this cheatsheet, refer to the R files page.
Basic arithmetic
Examples: 5*6
, sqrt(9)
, abs(-3)
Type help(command)
to find information about any command
Create variables: Use <-
or =
to assign values to a variable.
x <- 7
Create a vector:
Create a data frame: data.frame(____)
mean(vector)
: Calculates the mean of a given set of values.
median(vector)
: Calculates the median of a given set of values.
sd(vector)
: Calculates the standard deviation of a given set of values.
nrow(data)
: Calculates the total number of rows in a dataset
na.rm = TRUE
: Remove NA
values. Add this as an argument to any of the statistics calculations. E.g. mean(vector, na.rm=TRUE)
We use the tidyverse
to analyze data in this cheatsheet. To install run the following code:
After installing, to load the tidyverse run:
dataFrame <- read_csv("myCSV.csv")
: Creates a data frame from a file called myCSV.csv
write_csv(dataFrame, "myCSV.csv")
: Creates a csv from a data frame
view(dataFrame)
: Enter view-mode to see the entire data frame
str(dataFrame)
: Gives the structure of data frame
glimpse(dataFrame)
: Take a quick look at a data frame
summary(dataFrame)
: Returns min, max, mean, meadian, 1st/3rd quartiles for all vectors in dataFrame
dataFrame$columnName
: Calls up specific column from a data frame
unique(vector)
: Find the number of unique values in a column/vector
length()
: Counts the number of entries in a column/vector
Use logical operators to combine conditions: &
(and), |
(or), !
(not)
Use comparison operators to describe conditions: <
, >
, ==
, !=
, <=
, >=
group_by(.data, column)
: Takes a dataset and groups it by a column/variable
summarize(.data, summaryStat = statistic formula)
: Takes a dataset and outputs summary statistics that you define.
n()
: Calculates current group size. Can be used in summarize
and group_by
Combine group_by
and summarize
using the pipe (|>
) to see summary statistics for specific groups/variables.
The pipe: Use |>
OR %>%
to string functions and data together. Read as “AND THEN”.
Example:
filter(.data=myDataFrame, column=="some value")
: Subsets dataframe, lets us pick rows of data based on “some value”, including a specific value, mean, median, etc.
select(.data, column(s))
: Choose columns from a dataframe.
column1 , column2
: choose column1 and column2column1:column5
: choose all columns between column1 and column5-column1
: choose all columns except column1mutate(data, newColumn = yourFormula)
: Creates a new column in a dataset defined by a calculation you input.
arrange(data, column)
: Rearrange data into numerical order by a specific column
Basic structure:
Connect different properties using a +
Basic Components:
Data: data=dataSet
: Define your data set
Aesthetics: mapping = aes(variables)
: Define the variables. Can also specify color/fill for your graph and geometries.
Geometry: geom_GeometricObject()
: Define the type of plot
geom_histogram()
: Creates a histogram
geom_histogram(bins=X)
Specify number of binsgeom_point()
: Creates points (scatterplot) for each data point
color = ___
and shape = ____
geom_bar()
: Creates a bar graph
geom_bar()
: Aggregates data for yougeom_bar(stat = "identity")
: Creates a bar graph with pre-aggregated data that you inputgeom_bar(position="___")
Can choose “stack” (bars on top of each other), “dodge” (bars side by side) or “fill” (bars stacked, scaled to 100%).geom_errorbar(mapping=aes(ymin, ymax), width)
geom_boxplot()
: Creates a boxplot
stat_boxplot(geom="errorbar")
geom_density()
: Creates a density graph
geom_density(adjust=X)
X is a ratio, how smooth the graph isgeom_smooth(method="lm", se=FALSE)
: Creates a line of best fit
labs()
: Add a title and axes labels to your graph
facet_wrap(~ variable)
: Create separate plots for each aspect of a given variable
And more!
install.packages("ggplot2")
and library(ggplot2)
: Install and load GGplot package. Note: ggplot2
is included in the tidyverse
so you do NOT need to load both packages.
ggsave("yourTitleHere.jpg", currentPlotName)
: Saves “currentPlotName” as a jpg file called “yourTitleHere.jpg”.
Use the pipe to string together functions and create tidy data before plotting.
The following examples use data from the palmerpenguins
package which has been preloaded into this document. Click the green play button to see them run. Feel free to edit the code as well if you want to try out other functions!
Here’s a preview of the penguin data, and here is a link to more information about the data.
Basic bar graph of median flipper length on each island:
Boxplot of Adelie penguin flipper length, by sex:
t.test(data, dependentVariable ~ independentVariable)
: 2 sample t-test - compare 2 groups on a numerical measure
aov(data, dependentVariable ~ independentVariable)
: ANOVA (analysis of variance) test - compare means among groups
TukeyHSD(data)
: Tukey’s HSD - Post Hoc test which tells you which specific means differ from each other
cor(data$independentVariable, data$dependentVariable)
: Correlation coefficient - outputs a number between -1 and 1 which tells strength and direction of the relationship between 2 numeric variables
lm(data, dependentVariable ~ independentVariable)
: Fit a linear model
summary(yourLinearModel)
: View more details about a linear model including regression coefficients, p value for linear model coefficients, quartiles, residual standard error, F-statistic.