Variables, functions, loading data

Author

Jeremy McWilliams

Learning Objectives

Gain some familiarity and comfort with rstudio
Review how to assign variables
Learn about and use functions
Learn about and use vectors
Write code to work with a data set, and calculate some descriptive statistics

Arithmetic

The grey rectangle below is a “code chunk”. Everything withing the grey area is interpreted as R code. To run the code, click the green triangle in the upper-right corner.

In this example, R can perform basic math:

2+2

[1] 4

Now it’s your turn. Enter code below to subtract ten from twenty-two:

Assigning Variables

We’ll be working a lot with variables throughout this semester. A variable is a name you give to some value. The value could be a single number, a word, a bunch of words, an entire data set, etc.

Most scripting languages use the “=” sign to assign a value to a variable, but R uses “<-”.

# assigns 10 to x
x<-10

It’s important to note that creating a new variable using code above doesn’t give you any output. Often it’s a good idea to print your variable to the screen, just to confirm it worked the way you intended:

# prints x
x

[1] 10

#Anything preceded by a "#" is a "comment". It does not get executed as code.
#Comments can be super helpful to provide info on your code.

Now it’s your turn. Create a variable “y”, set it equal to 7+9, and then print it out:

Functions

Coding languages, including R, have functions that help you quickly execute common tasks. Functions typically take the form of:

functionName(argument1, argument2, etc….)

Arguments are the inputs you send to a function, so it has all the information it needs to perform its operation.

For example, the function sqrt(number) takes the square root of a number. This lets us quickly compute the answer, rather than having to write the formula for a square root.

sqrt(9)

[1] 3

YOUR TURN: In the chunk below, create a variable z, set it equal to the square root of 90, and print it out:

z<-sqrt(90)
z

[1] 9.486833

One nice thing about rstudio is that you can readily access documentation for functions by using the “help” command:

help(sqrt)

The documentation appears in the lower right window in the “help” tab.

One key question: how do you know what functions exist, and what they do?
Answer: you Google what you’re trying to do! In the case of R, you might search “How do I do ‘x’ in R?”

Let’s say you are interested in calculating the absolute value (positive distance from zero) of -35 in R. Take a moment with your group/neighbors, and try to find the answer by searching the internet. In the code chunk below, use the function you found to compute this calculation:

# compute the absolute value of -35

Vectors

So far we’ve created variables that have single values (e.g. x<-7), but there are often cases where we need to assign multiple values to a variable. These types of variables are called vectors.

In order to create a vector, you can use the “c” function (c stands for “combine”). Here’s an example:

myFirstVector<-c(3,7,1,10)

myFirstVector

[1]  3  7  1 10

Now it’s your turn. Create a vector called mySecondVector, assign the values 8, -11, 100, 35 to it, and print it to the screen:

# use the "c" function to create mySecondVector:

Before moving on, let’s talk a little about variable naming conventions. We started out using x, y, and z when learning about variables. That technically works, but it’s better practice to be more descriptive in your variable names. The examples above and below use a syntax called “camel case”. This allows you to string words together without spaces, but preserves quick readability. From this point on, we’re going to create variables with camel case - you should too!

It may not be immediately clear what the utility of vectors is, so let’s take a look at a practical use case. Below is a vector containing the responses from you and your classmates (and U of Arizona students) on self-reported fishing skill (1 being low, 5 being high):

fishingSkill<-c(1,2,2,1,1,1,1,2,2,5,1,2,2,1,5,3,4,4,4,2,1,2,2,3,2,1,2,2,2,1,5,2,3,1,4,5,2,2,2,1,1,1,1,2,3,1,1,2,1,3,1,1,1,1,2,2,5,3,1,1,1,1,2,1,3,2,5,1,2,3,3,2,3,5,2)

fishingSkill

 [1] 1 2 2 1 1 1 1 2 2 5 1 2 2 1 5 3 4 4 4 2 1 2 2 3 2 1 2 2 2 1 5 2 3 1 4 5 2 2
[39] 2 1 1 1 1 2 3 1 1 2 1 3 1 1 1 1 2 2 5 3 1 1 1 1 2 1 3 2 5 1 2 3 3 2 3 5 2

Let’s say we’re interested in finding the average of all the responses. We can do this by use the mean function in R: (we’ll dive more into descriptive statistics next week, and how they differ from inferential)

avgFishingSkill<-mean(fishingSkill)

avgFishingSkill

[1] 2.146667

We can also calculate the median (the “middle” value, when data is in numerical order) with the median function:

medianFishingSkill<-median(fishingSkill)

medianFishingSkill

[1] 2

We can also calculate the standard deviation (a measurement of how spread apart the data is):

sdFishingSkill<-sd(fishingSkill)

sdFishingSkill

[1] 1.248711

Now it’s your turn. Given the vector below of self-reported cooking skill ranking, calculate its mean, median, and standard deviation:

cookingSkill<-c(4,5,4,1,2,5,4,4,4,4,3,2,2,2,4,4,4,3,3,5,3,3,4,4,5,5,3,4,3,3,4,4,3,4,1,4,3,4,1,4,2,3,2,4,2,2,2,4,4,1,5,3,3,4,2,4,2,3,3,2,1,3,4,2,5,5,3,5,4,2,4,4,3,3,4,4)

# calculate the mean



#calculate median



# calculate the standard deviation

Working with a data set

One of the most common uses of R is to load a data set into R as a variable, and then use that data to ask and answer questions with code. Let’s start off by loading a package called the Tidyverse. The Tidyverse is a series of functions written by data scientists to make working with data a little easier. We can load it by running the following command:

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Notice in the Files tab in the lower-right window, there is a file titled teamAntarcticaData.csv. This is a copy of the spreadsheet data from the Google form. Below, we can assign the entire data set to a variable using the read_csv function:

#load the data
teamAntarcticaData<-read_csv("teamAntarcticaData.csv")

Rows: 75 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): Timestamp, school, swim, animals, parkaColor, teamFlag, distance
dbl (5): fishing, cold, remote, bedsideManner, cooking

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#print to screen
teamAntarcticaData

# A tibble: 75 × 12
   Timestamp       school fishing swim   cold animals remote parkaColor teamFlag
   <chr>           <chr>    <dbl> <chr> <dbl> <chr>    <dbl> <chr>      <chr>   
 1 8/30/2022 16:0… Unive…       1 Yes       4 Yes          4 Gold       Penguin 
 2 8/30/2022 16:0… Unive…       2 Yes       4 Yes          5 Blue       Bear    
 3 8/30/2022 16:0… Unive…       2 Yes       4 Yes          3 Green      Penguin 
 4 8/30/2022 16:0… Unive…       1 Yes       1 Yes          1 Blue       Seal    
 5 8/30/2022 16:0… Unive…       1 Yes       3 Yes          3 White      Sea Spi…
 6 8/30/2022 16:0… Unive…       1 Yes       3 Yes          3 hot pink   Penguin 
 7 8/30/2022 16:0… Unive…       1 Yes       2 Yes          3 Blue       Sea Spi…
 8 8/30/2022 16:0… Unive…       2 Yes       2 Yes          4 Blue       Penguin 
 9 8/30/2022 16:0… Unive…       2 Yes       2 Yes          5 White      Bear    
10 8/30/2022 16:0… Unive…       5 Yes       5 Yes          5 Blue       Penguin 
# ℹ 65 more rows
# ℹ 3 more variables: distance <chr>, bedsideManner <dbl>, cooking <dbl>

Earlier in this exercise we looked at the array of responses for both fishing and cooking aptitude, though in both cases the vectors were hand-typed (by me). A much more common way to acquire, and then use, a vector of data is to directly query the data set. You can get a vector (a.k.a. column) of data by using the following syntax:

dataSet$columnName

Let’s get all responses for fishing aptitude directly from the data set:

fishing<-teamAntarcticaData$fishing

fishing

 [1] 1 2 2 1 1 1 1 2 2 5 1 2 2 1 5 3 4 4 4 2 1 2 2 3 2 1 2 2 2 1 5 2 3 1 4 5 2 2
[39] 2 1 1 1 1 2 3 1 1 2 1 3 1 1 1 1 2 2 5 3 1 1 1 1 2 1 3 2 5 1 2 3 3 2 3 5 2

And just like before, we can calculate the mean, median, and standard deviation:

mean(fishing)

[1] 2.146667

median(fishing)

[1] 2

sd(fishing)

[1] 1.248711

Now it’s your turn:

Use the data set to get the column values for tolerance of cold (hint: after typing the $, use auto-complete to select the column name). Calculate its mean, median, and standard deviation.

# create a vector that contains the column values for cold tolerance



#calculate the mean



#calculate the median



# calculate the standard deviation

Now do the same for comfort level with being in a remote location:

# create a vector that contains the column values for comfort level with remote location




#calculate the mean



#calculate the median



# calculate the standard deviation

Now create a vector to get the responses for parka color. How is this data different from the other examples we’ve seen? What can we learn from the data?