This tutorial, which accompanies the textbook Statistics Using R: An Integrative Approach by Weinberg, Harel, & Abramowitz (Cambridge University Press, 2021 expected publication date), covers the basic use and manipulation of datasets, which are also referred to as data frames, in R. The activities covered in this tutorial are designed to help you understand the examples in each chapter and to complete end-of-chapter exercises in the textbook. It also is designed to help you learn some basic coding skills that will be helpful when working with data frames in R so as to aid your ability to complete more complex analyses and, ultimately, to learn the larger statistical concepts covered throughout the textbook.
To close this tutorial, you will need to exit this tab in your browser window and press Escape within the Console window of RStudio. Note that when you close the tutorial, your progress will be saved until you re-open it next time. To clear your progress before closing the tutorial, click the Start Over button at the bottom of the browser screen.
We will use the Framingham dataset, used throughout the textbook and contained within the sur package, which accompanies the textbook, as an example data frame in the tutorial. The Framingham dataset is based on a longitudinal study investigating factors relating to coronary heart disease. A more complete description of the Framingham dataset may be found in the textbook in Appendix A. Alternatively, you can type ?Framingham into the Console window and press Enter. This will cause the description of the dataset to open in the help tab of RStudio.
To find out more information about any command or operator used throughout the tutorial, type a ? before the name of the command or operator in the Console window (either in R, RStudio) and press Enter, or click on the Help tab in the lower right window pane of RStudio.
The answers you provide to the coding exercises are not checked for correctness, but the solution to each exercise is available by clicking on the Solution button along the top of the codebook.
In this section, we describe a couple of basic data structures in R. Then, we explain how to access datasets included in the sur package. Finally, we briefly review common commands for reading in datasets from outside sources other than the sur package.
A data frame is a particular type of data structure in R that is organized by rows and columns. Typically, in a data frame, each row represents a set of values related to an observation or subject, and each column represents a set of values represented by a variable. The columns in a data frame may be named (e.g., by the name of the variable represented by that column) and the data contained in each column may be one of a different type or class of values (e.g., they may be numbers or non-numerical string characters). Each column in a data frame is called a vector, defined by the fact that all the values (or elements) in that vector are of the same type or class (e.g., they are all numerical or non-numerical string characters). Data structures other than data frames are possible in R, including matrices, lists, and still others, but are beyond the scope of this tutorial.
All datasets used in Statistics Using R: An Integrative Approach are contained within the sur package and are readily available as data frames after installing and loading the sur package using first install.packages("sur") once per computer that you are using, and then library(sur) each time RStudio is opened. For instance, we simply have to type Framingham to see the Framingham dataset printed by R. Type Framingham below. Then click the Run Code button or place the cursor on the line of code and use a keyboard shortcut: Command+Enter for Mac or Ctrl+Enter for Windows and Linux.
FraminghamNow that we have accessed and printed the Framingham dataset to the console, we can see some of the information included in the dataset: each row appears to represent data for an individual with a specific identification number (given by the ID column). Data for each individual seems to cover both numeric measurements as well as categorical information. We will inspect this data frame in more detail in the coming sections.
There are many types of objects in R. As noted above, data frames are a type of data structure, holding a collection of variables. When the name of a data frame is typed into the console, R prints its contents. A package is also an object in R, but it contains an assortment of related data, functions, and other code. If we want R to do something with these objects (other than simply printing data frame contents), we have to give R a command, also known as a function. For example, we used the function install.packages to install the sur package to our library. R knew which package we wanted to install because we listed sur as an argument of install.packages: we put "sur" in the parentheses following the command. Arguments to functions may tell R on what object the command should act, or even how to act on it. Likewise, when we wanted to access the contents of the sur package within an R session, we used the library function followed by sur in parentheses, telling R to open the sur package from our library. Note that install.packages required the argument sur to be in quotes while library did not.
If you would like to read into R as a data frame a dataset that is not part of the sur package, but that is, instead, from an outside source, you may do so with one of several R functions. Two popular such R functions are the following:
read.csv – allows you to read in files with comma-separated values (CSV) only.
read.delim – allows you to read in files that not only are separated by commas (in CSV format), but also that are separated by tabs, spaces, and so on. This function is more general than the first as it allows for a greater variety of file types to be read in and converted into data frames.
These functions can even read data in from web addresses, so that the user does not have to download and save the file before reading it into R. We will not be practicing these commands within this tutorial, since they are not needed to access our datasets, but users should know them for when they need to conduct analyses on other datasets. To practice these commands, see the end-of-chapter exercises for Chapter 1 in the Statistics Using R textbook. For further information on these functions, type ?read.csv or ?read.delim into the Console window and press Enter, or search for the commands in the Help tab of the lower righthand windowpane of RStudio.
In this section, we show how to view the top and bottom rows of a data frame, how to quickly obtain the dimensions of a data frame, and how to initially examine the structure and variables of a data frame.
Now that you know how to access a dataset, we will show you how to obtain some initial information about it using the Framingham dataset as an example. As you will see below, when you simply type Framingham, only some observations (rows) and variables (columns) will be printed in the output window at one time. While this tutorial allows scrolling, outside of this tutorial R has a maximum number of rows and columns it will print to the console at one time. To overcome this limitation and be able to access information more readily in the dataset, we introduce a number of different R commands. In particular, to get an overview of what the data look like we may view the first n rows of the data frame by using the head command and typing, not simply Framingham, but head(Framingham). By default, R sets n to be 6. The appropriate code is given below: we use the function head with the argument Framingham to tell R to print the first 6 rows of Framingham in the output window. To see the output, hit the Run Code button, or use the shortcut Command+Enter for Mac or Ctrl+Enter for Windows and Linux.
head(Framingham)Analogously, we can use the tail command to print the last n rows of a data frame in the output window. The default for n for this command also is 6. To print the last 3 rows instead of 6, we specify that n is to be equal to 3 by adding the argument n = 3 in the tail command after a comma as shown below. Click the Run Code button or use a keyboard shortcut to run the code below.
tail(Framingham, n = 3)We can see the row numbers of the Framingham dataset printed alongside the data frame in an unnamed column on the left side. From these row numbers we can tell that R printed rows numbered 1-6 when we used the head command under default settings, and R printed rows 398-400 when we used the tail command with the argument n = 3.
As noted earlier, a data frame in R has data arranged in rows and columns, where, typically, the rows represent observations and the columns represent variables. Accordingly, to determine how many observations a data frame has, we simply need to find out the row dimensionality of the data frame. Likewise, to determine how many variables a data frame has, we simply need to find out the column dimensionality of the data frame. To do so, we use the command dim, which stands for dimension, and type dim(Framingham). Try this command in the code box below.
dim(Framingham)The dim command returns a vector with the number of rows (observations) as the first element and the number of columns (variables) as the second element.
A useful command for learning more about the variables in a dataset is the str command, which stands for structure. From this command we may learn about (1) the way in which a dataset is structured (for the Framingham dataset, the data are structured as a data.frame as defined earlier), (2) how many row and column dimensions the dataset has, and (3) the name of each variable along with whether the variable is numeric or non-numeric. Variables that are listed as being numeric (noted as num) are either ratio- or interval-levelled; and variables that are listed as factor (noted as Factor) are either nominal- and ordinal-levelled. Details on working with these two types/classes of variables will be covered in the Data and Variable Types section of the tutorial. A single data frame may contain both numeric and factor variables.
Inspect the output of running str on Framingham below and then answer the following question.
str(Framingham)It is often the case that a particular analysis will involve not all of the variables or not all of the observations in a data frame, but only a subset of each of them. To carry out an analysis on only a subset of variables or observations, it is first necessary to select that subset of variables or observations. R refers to this as subsetting the data. In this section we review methods for subsetting the data and, in particular, for selecting a particular subset of variables (i.e., columns of the data frame) or particular subset of observations (i.e., rows of the data frame). As we will describe in detail in the following sections, a subset of columns may be selected by identifying the names of the variables stored by those columns. Columns also may be selected by identifying their placement within the dataframe (e.g., 1st, 2nd, 20th, etc.). We refer to this number as a column’s index (plural: indices).
In R, data frames can be subsetted by selecting rows and/or columns. These rows and columns can be selected by name or by index. First, we will look at how to select a single column by name. Recall the output of str(Framingham), which is shown below.
## 'data.frame':    400 obs. of  33 variables:
##  $ ID       : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ SEX      : Factor w/ 2 levels "Men","Women": 1 1 1 1 1 1 1 1 1 1 ...
##  $ TOTCHOL1 : num  260 195 185 278 210 235 212 176 215 245 ...
##  $ AGE1     : num  52 41 44 60 36 61 36 57 52 51 ...
##  $ SYSBP1   : num  142 139 115 160 112 ...
##  $ DIABP1   : num  89 88 69 96 85.5 81 98 97 80 69 ...
##  $ CURSMOKE1: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ CIGPDAY1 : Factor w/ 22 levels "not a current smoker",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BMI1     : num  26.4 26.9 22.3 26.4 21.9 ...
##  $ DIABETES1: Factor w/ 2 levels "Not a diabetic",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BPMEDS1  : Factor w/ 2 levels "Not currently used",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ HEARTRTE1: num  76 85 65 55 71 56 72 68 70 85 ...
##  $ GLUCOSE1 : num  79 65 82 75 77 90 75 94 87 NA ...
##  $ PREVCHD1 : Factor w/ 2 levels "Free of CHD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ TIME1    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TIMECHD1 : num  8766 8766 5317 3367 8766 ...
##  $ TOTCHOL3 : num  280 162 216 236 NA 290 201 150 230 NA ...
##  $ AGE3     : num  64 53 56 71 NA 73 48 69 64 NA ...
##  $ SYSBP3   : num  168 152 117 156 NA ...
##  $ DIABP3   : num  100 101 70 96 NA 75 110 96.5 82 NA ...
##  $ CURSMOKE3: Factor w/ 2 levels "No","Yes": 1 1 2 1 NA 1 1 1 1 NA ...
##  $ CIGPDAY3 : num  0 0 20 0 NA 0 0 0 0 NA ...
##  $ BMI3     : num  25.7 26.4 21.4 22.2 NA ...
##  $ DIABETES3: Factor w/ 2 levels "Not a diabetic",..: 1 1 1 2 NA 1 1 2 1 NA ...
##  $ BPMEDS3  : Factor w/ 2 levels "Not currently used",..: 1 1 1 NA NA 1 1 1 1 NA ...
##  $ HEARTRTE3: num  92 105 72 60 NA 72 60 80 72 NA ...
##  $ GLUCOSE3 : num  82 78 49 NA NA 71 66 125 87 NA ...
##  $ PREVCHD3 : Factor w/ 2 levels "Free of CHD",..: 1 1 1 2 NA 1 1 1 1 NA ...
##  $ TIME3    : num  4438 4411 4070 4173 NA ...
##  $ HDLC3    : num  44 31 42 45 NA 52 48 40 49 NA ...
##  $ LDLC3    : num  236 91 174 171 NA 238 153 110 181 NA ...
##  $ TIMECHD3 : num  8766 8766 5317 3367 NA ...
##  $ ANYCHD4  : Factor w/ 2 levels "CHD Event Did Not Occur",..: 1 1 2 2 1 2 1 2 1 1 ...$ in front of it. This operator is how we reference a specific column within data frames in R. Say we want to call just the SEX variable from the Framingham dataset. We would simply enter Framingham$SEX and R would print the values of the SEX variable as the output. Try selecting just the variable AGE1 from Framingham using the $, and run the code below. Remember, you can always view the solution by clicking the Solution button at the top of the code box.
Framingham$AGE1Once a variable is selected, there are other operations that can be performed on that variable beyond simply printing the values of that variable in the output window. If we would like to compute the mean of the variable AGE1, for example, we would use the mean command as shown below. Other commands for summarizing the values of a variable follow the same format and are given throughout the textbook.
mean(Framingham$AGE1)In this section, we use index values to describe how to access a single value or multiple values within a data frame. We note that a standard way to access values in any row by column array is to specify the value’s row, column indices in that format. In R we use brackets [ ], instead of the $, following the name of the data frame and place the row and column indices within the brackets separated by a comma. Thus, if we wanted to access and print the value in the second row, fourth column of the Framingham dataset, we would type Framingham[2,4].
To verify that we will indeed obtain the value in the second row and fourth column of the Framingham dataset, use the head command to print the first three rows of Framingham in the output window. Then type Framingham[2,4] and verify that it matches the value in the second row and fourth column.
head(Framingham, n = 3)
Framingham[2,4]From the head output, we note that the variable AGE1 occupies the fourth column. To select all the values in the entire fourth column we, once again, use the bracket operator, but rather than specifying a single value for the row, as we did earlier, we now leave the row index blank: Framingham[,4]. Alternatively, we can use the column name AGE1 within the bracket instead of the number 4: Framingham[,"AGE1"]. Note that when we use the column name within brackets, we must place the name within quotes; we do not use quotes when using the $ operator.
In the space below, access and print all the values in the TOTCHOL1 variable from Framingham in three ways: (1) use the $ operator, (2) use brackets with the column index number, and (3) use brackets with the variable name in quotes. We can find the column’s index number by examining the output from having executed the head command in the previous exercise. Verify that the outputs from the execution of these three commands are identical by checking that the first three values are the same.
Framingham$TOTCHOL1
Framingham[,3]
Framingham[,"TOTCHOL1"]If instead of wanting to access and print all the values of a single variable, we wanted to access and print all the values of more than one variable, where the variables are in sequence in the dataset, we can do so using a colon, :, within the brackets. The variables may be referred to either by the column indices or by the variable names. For example, if we wanted to access and print all the values of the four variables TOTCHOL1, AGE1, SYSBP1, and DIABP1, we note from our earlier work using the head command, that these four variables are in columns 3, 4, 5, and 6. Accordingly, we may use the colon operator to access them by referencing their indices as follows: Framingham[, 3:6]. It is worth noting that 3 and 6 represent, respectively, the first and last indices of the four variables of interest. Another way to access these four variables is by their names using the following command: Framingham[, c("TOTCHOL1", "AGE1", "SYSBP1" ,"DIABP1")]. When more than one variable is named, they all need to be joined together in a string using the c function, which stands for concatenate. It also is worth noting that the names of the variables within the brackets must be in quotes.
If instead of selecting all rows for a specific column, we wanted to select all columns for specific rows, we again use brackets. By analogy, we now leave the column entry blank within the brackets. Thus, if we wanted to access and print the values of all the variables (columns) for just the first subject, we would type Framingham[1,]. If, instead, we wanted a subset of sequential rows, we would, as before, use the colon operator, :, separating the first and last indices in the sequential set. For example, if we wanted to access the values for all the variables for only the first three rows, we would type Framingham[1:3,]. Note that this is identical to calling head(Framingham, n = 3).
If we want to select indices that are not sequential, we can use the c function within the brackets to group the indices of interest together. For instance, if we want all the rows for the second, fifth, and ninth columns, we would call Framingham[,c(2,5,9)]. We can also use c within the brackets to refer to multiple columns by name. Try calling all rows for the SYSBP1 and BMI1 variables below. Note again that variable names need to be in quotes when using brackets to subset.
Framingham[,c("SYSBP1","BMI1")]str and head for obtaining information about a data frame. One such other command is names. This command will print the names of the variables in a dataset in the Console window. The format or syntax of this command simply is similar to the str and head commands. Execute the names command on Framingham to see the names of the variables appear in the output window in the same order as they are in the dataset.
names(Framingham)For this next exercise, we have already created a new vector of 400 values called X that is available for use in this tutorial. X is not part of the Framingham dataset, but since X has the same number of values as the number of rows in Framingham, we can add it to the data frame as a new column. The quickest way to do this is to use an equals sign, =, to assign X as a new variable in the dataset. So that it is clear that we want this new variable to be part of the Framingham dataset, we assign the variable X to a name that includes the name of the dataset as well, Framingham$new_var. The code for this is displayed below. To be clear, the left side of the equation tells R that we are adding a new column to Framingham and we are naming this column new_var; the right side of the equation tells R that new_var will be getting the data contained in our outside variable X.
The line of code below will not produce any output in the console when run. To verify that the data of X has been added as a new variable in Framingham called new_var, add code to check the variable names of Framingham using the names command.
Framingham$new_var = XFramingham$new_var = X
names(Framingham)new_var is not a particularly good name for a variable, as it tells us nothing about what that variable measures, nor does it stick to the all-capitals naming convention of the Framingham dataset. Since new_var is just a fake variable that we made up for practice, let’s rename it FAKE. We can use the names command and brackets to assign a new name to new_var. On reviewing the output produced by names(Framingham), we note that the output consists of a single row of names. Output consisting of either a single row or a single column may be described as an array of only one dimension. Arrays of one dimension are called vectors. As we have learned, by contrast, data frames consist of two dimensions, both rows and columns. Because X was added as a new variable at the end of the list of variables in the Framingham data set, and the Framingham dataset originally had 33 variables, new_var became the 34th variable in that dataset. To refer to new_var given that it is the 34th element of the vector of names, we use the names(Framingham) command followed by the number 34 in brackets as follows: names(Framingham)[34]. To change the name of new_var to FAKE, we assign the name FAKE to the 34th element in the names(Framingham) vector using the equals sign =. In writing this code, we must remember to place the name FAKE in quotes because quotes need to be used when we refer to the name of a variable. As a distinction, when we refer to the set of values of a variable, quotes are not used. Write the code to execute this name change and then print all the variable names again to verify that the name new_var has been changed to FAKE.
Framingham$new_var <- Xnames(Framingham)[34] = "FAKE"
names(Framingham)Sometimes we want to remove columns from a data frame—perhaps they were created in error or we decide they are unnecessary. We can remove a column by assigning it the object NULL. Access the FAKE column from Framingham and assign it NULL using the = operator. Then check that the column has been removed by using the names command.
Framingham$X <- X
names(Framingham)[34] <- "FAKE"Framingham$FAKE = NULL
names(Framingham)Sometimes we want to look at values of a variable just for a certain group or just for a certain condition, or we want to compare statistics on a variable by group. We can do this using brackets and a logical statement. In R, a logical statement is a statement that is evaluated as either TRUE or FALSE. For instance, 3 < 5 states that 3 is less than 5. When we enter this in R, the returned value is TRUE. If we try 3 > 5, we would get back FALSE. We can also use relational operators for characters as well: "dog" == "cat" comes back FALSE, but "dog" == "dog" comes back TRUE. Note that R is case sensitive, so "dog" == "DOG" also comes back FALSE. The following relational operators may be used to create logical statements:
< means less than
<= means less than or equal to
> means greater than
>= means greater than or equal to
== means equal to
!= means not equal to
If we specify a variable, which is an entire column of values, on the left side of the logical statement, each value in that variable will be checked against the right side. First, print the AGE1 variable in Framingham. Then, check if the values in this variable are less than 50. You will notice that the value returned is TRUE whenever the logical expression is true (i.e., when the value of AGE1 is less than 50) and otherwise the value returned will be FALSE.
Framingham$AGE1
Framingham$AGE1 < 50In the previous example, the code Framingham$AGE1 < 50 produces a returned value of either TRUE or FALSE for each of the 400 values of AGE1 depending upon whether the value of AGE1 was less than 50 or not. We also can use a logical expression to subset a variable and select cases from the dataset for which the logical expression is true. To do so for this example, we would use the command Framingham$AGE1[Framingham$AGE1 < 50] to obtain the age values of only those cases with ages less than 50, as shown below. Said another way, we are asking R to return values from Framingham$AGE1, but only those for which the statement Framingham$AGE1 < 50 is true.
##   [1] 41 44 36 36 43 41 44 49 44 43 44 48 48 41 49 46 33 38 40 46 43 44 42 43 46
##  [26] 36 41 38 36 37 44 45 49 43 44 44 36 43 39 44 45 44 36 37 35 48 45 38 38 40
##  [51] 36 40 49 42 39 49 39 40 34 43 35 42 39 45 43 40 44 44 38 37 35 43 45 44 46
##  [76] 36 49 41 41 44 46 43 41 41 46 39 40 44 47 39 40 47 43 39 43 37 40 47 41 39
## [101] 42 43 43 42 45 46 38 39 47 46 40 47 39 48 40 47 39 44 41 39 49 45 38 42 41
## [126] 40 41 42 49 46 37 47 49 44 44 46 45 39 39 36 39 42 46 46 45 47 47 41 43 38
## [151] 45 38 37 47 37 38 46 45 49 42 42 44 46 45 43 40 43 39 40 36 44 44 43 43 48
## [176] 40 44 42 42 48 49 34 39 49 41 36 44 46 43 41 42 40 49 41 43 38 41 46 40 47
## [201] 36 44 49 40 43 46 39 36 43 46 43 36 43 48 39Let’s suppose, instead, we had wanted the systolic blood pressure for only women. In this case, we would use the command: Framingham$SYSBP1[Framingham$SEX == "Women"]. Said differently, we would be subsetting the SYSBP1 variable by whether or not the case is a woman and obtain as output the systolic blood pressure values of just the cases for which the SEX variable has the value "Women". Below, try subsetting the AGE1 variable to see just the ages of women. In this case, we would be subsetting the AGE1 by whether or not the case is a woman and obtain as output the ages of just the cases for which the SEX variable has the value "Women".
Framingham$AGE1[Framingham$SEX == "Women"]Let’s suppose, we now wanted the age values of women whose systolic blood pressure is 130 or more. To obtain these results we would need to include two logical expressions within the brackets, in this case connected by an “and”. One logical expression would specify that SEX == "Women" and the other that SYSBP1 >= 130. In R, “and” is represented by the & operator and “or” is represented by the | operator. Accordingly, the code for subsetting age to those cases who are women and who have systolic blood pressure greater than or equal to 130 is: Framingham$AGE1[Framingham$SEX == "Women" & Framingham$SYSBP1 >= 130]. Below, try using the & operator to subset systolic blood pressure (SYSBP1) to just the observations for subjects who are women and are 60 years old or older.
Framingham$SYSBP1[Framingham$SEX == "Women" & Framingham$AGE1 >= 60]SYSBP1 for the youngest and oldest subjects: those younger than 35 or older than 65.
Framingham$SYSBP1[Framingham$AGE1 < 35 | Framingham$AGE1 > 65]As mentioned earlier in the tutorial, each column of a data frame is a vector of values that are all of the same type. There are several classes of vectors in R, but in this section, we will limit the discussion to just a few important ones: numeric, character, logical, and factor. We can check the class of a vector with the class function.
Numeric data is exactly what it sounds like: numbers! Numeric vectors typically store values as double precision, which allow for decimals and can be mathematically operated upon. We might use numeric vectors to store values for interval- or ratio-level measurements. See Chapter 1 of Statistics Using R: An Integrative Approach for a review of measurement levels of variables.
Character data consist of string letters and/or numbers contained in quotes. Character vectors might hold nominal- or ordinal-level measurements, and may require conversion to factor vectors in later stages, but more on this shortly. Numbers in quotes are characters and, as such, cannot be mathematically operated on. Let’s look at a quick example of this using two vectors that we will create using the c function we first saw in the previous section. In the space below, vector A has been assigned the numbers 1, 4, 7, 5, and 0, all in quotes. Create a vector B that is assigned those same numbers, but without quotes.
A = c("1", "4", "7", "5", "0")A = c("1", "4", "7", "5", "0")
B = c(1, 4, 7, 5, 0)In R, we can double every value in a numeric vector by multiplying that vector by 2 using the * as the multiplication operator. Check the class of each vector using the class command. Then try multiplying the numeric vector by two.
A <- c("1", "4", "7", "5", "0")
B <- c(1, 4, 7, 5, 0)class(A)
class(B)
B*2A <- c("1", "4", "7", "5", "0")A*2As we can see from the output, A*2 returns an error because the vector A is not a numeric vector. Since all of the elements in A contain only numbers, we can easily convert A from character data to numeric by applying the code as.numeric to A and assigning this the name A. This means we will be replacing A with a numeric version of itself and overwriting the previous character version. Further, instead of using class, we can use is.numeric to check if A is now numeric. The code for the conversion of A to numeric is shown below. Type the appropriate code to check if the conversion worked.
A <- c("1", "4", "7", "5", "0")A = as.numeric(A)A = as.numeric(A)
is.numeric(A)In an earlier section, we described how logical statements in R evaluate to either TRUE or FALSE. It follows that logical vectors contain only the elements TRUE or FALSE. Internally, R stores the values of TRUE and FALSE as 1 and 0, respectively. To demonstrate this, let’s print the logical statement that identifies whether a subject is less than 40 years old (based on the AGE1 variable), and then, let’s put this entire statement within the as.numeric command.
Framingham$AGE1 < 40
as.numeric(Framingham$AGE1 < 40)Notice that each TRUE is represented by a 1 and each FALSE by a 0. This internal coding using the numbers 1 and 0 makes it possible to perform many operations on logical variables. For example, suppose we wanted to know the number of subjects under age 40 in the Framingham dataset. We know from the subsetting section of the tutorial that we can select those values using a logical statement in brackets. We could select these individuals and then get the length of the new vector using the length command. However, we could do this more efficiently by summing the logical statement using the sum command. R will add all the 1’s in the vector and return the total number of cases where a subject’s age is less than 40. Try this in the space below: take the sum of the logical statement that returns TRUE if an individual in Framingham is younger than 40. Note that you do not need to use the as.numeric command here because the sum command accesses the internal numeric codes, 1 and 0, directly.
sum(Framingham$AGE1 < 40)Thus, we find that 57 of the 400 individuals in Framingham are under 40 years old.
Factor vectors contain the elements of categorical variables, such as nominal- and ordinal-level measurements. R encodes (internally represents) the levels (categories) of the variable as numbers, but allows the labels of these levels to be strings of numbers or characters. When we classify a vector as a factor variable, R will enter the variable correctly into models as a categorical variable rather than as a numeric variable.
Recall that when we use the str command, R prints the class of each variable after its name. We also can check if a specific variable is a factor with is.factor. Again from Framingham, check if CURSMOKE1, the variable that indicates if a subject is a current smoker, is a factor variable. Then check what the categories of CURSMOKE1 are by running the levels command on this variable.
is.factor(Framingham$CURSMOKE1)
levels(Framingham$CURSMOKE1)We can see that the levels are “No” and “Yes”, but if we wanted to see the underlying numeric coding, we can use the as.numeric command as we did earlier with respect to logical variables. When applied to a factor variable, the numeric vector that is produced contains the numeric values that are used to internally represent the categories. Try this for CURSMOKE1 below.
as.numeric(Framingham$CURSMOKE1)This is helpful, but inefficient. Now let’s look at what the table command does when run on CURSMOKE1.
table(Framingham$CURSMOKE1)## 
##  No Yes 
## 200 200The table command provides counts of each level of the variable: half the subjects in Framingham are currently smokers and half are not. If we run table on two variables, R provides a tabulation across the combinations of levels of each variable. For example, if we run table(Framingham$SEX, Framingham$CURSMOKE1) we get the following output, which shows smokers and non-smokers by sex.
##        
##          No Yes
##   Men   100 100
##   Women 100 100If we run table on a factor variable and its numeric conversion we may obtain how the levels of that factor variable are numerically represented internally. Try this for CURSMOKE1 below.
table(Framingham$CURSMOKE1,as.numeric(Framingham$CURSMOKE1))From the output we see that “No” is encoded as 1 and “Yes” is encoded as 2. Although one may refer to the different levels by their names, as opposed to by their numerical values used to represent them internally, knowing which numerical value represents each level is important for the interpretation of results from statistical analyses. For more about this, see Statistics Using R: An Integrative Approach.
If, for some reason, you would like to alter the way in which the levels of a factor variable are numerically internally represented, you may do so by using the command relevel on that variable and setting the argument ref to the name of the level we want to have the value 1. The level assigned the number 1 is often called the reference level, category, or group. Try changing the “Yes” of CURSMOKE1 to have the value 1. Assign this to a new variable in Framingham called CURSMOKE_RL. Verify that “Yes” is now encoded as 1 and “No” is now encoded as 2 using the table command on CURSMOKE_RL and its numeric conversion.
Framingham$CURSMOKE_RL = relevel(Framingham$CURSMOKE1, ref = "Yes")
table(Framingham$CURSMOKE_RL,as.numeric(Framingham$CURSMOKE_RL))Sometimes we need to recode a numeric variable into a factor variable. We will try this with SYSBP1 from Framingham. SYSBP1 contains numeric measurements of systolic blood pressure. Let’s assume we would like to recode this variable so that values less than 130 are grouped together under the category named, “normal”, and values greater than or equal to 130 are grouped together under the category named, “high”. To accomplish this, we will use the ifelse command. The ifelse command takes three arguments: a logical statement to be evaluated, values to return if the logical statement is true, and values to return if the logical statement is false.
First, let’s try an example of how to use ifelse. The vector x is available in our working environment and contains the numbers 1 through 10. We want to create a new vector y, such that any value less than 5 is recoded as “low,” and all other values are recoded as “high.” Below we have provided the code to print the x vector and the logical statement that evaluates whether a value in the x vector is less than 5. Notice that TRUE is returned for the first five entries, and FALSE thereafter.
We now use the ifelse command with its three arguments. The first argument, x < 5, checks whether each of the values of x is less than 5. The second argument specifies the value to be assigned (in this case, “low”) to each entry that satisfies the logical statement, x < 5, and for which the returned logical value is therefore TRUE. The third argument specifies the value to be assigned (in this case, “high”) to each entry that does NOT satisfy the logical statement, x < 5, and for which the returned logical value is therefore FALSE. Add two lines of code below: one to assign to the variable y the values produced from the ifelse command applied to x and another to print y to confirm our code worked.
x <- c(1:10)x
x < 5
ifelse(x < 5, "low", "high")x
x < 5
ifelse(x < 5, "low", "high")
y = ifelse(x < 5, "low", "high")
yNow we will try this with the Framingham dataset. Use the ifelse command to recode an individual’s systolic blood pressure (SYSBP1) into a factor variable in such a way that if the blood pressure is greater than or equal to 130 (the returned value is TRUE), it is assigned the value “high”, and if it is not (the returned value is FALSE), it is assigned the value “normal”. Assign the result to a new variable in Framingham called SYSBP_CAT.
Framingham$SYSBP_CAT = ifelse(Framingham$SYSBP1 >= 130, "high", "normal")Use the space below to run whatever code is needed to answer the following questions about SYSBP_CAT.
Framingham$SYSBP_CAT <- ifelse(Framingham$SYSBP1 >= 130, "high", "normal")# use this space to run code
# any line that starts with '#' is a comment and will not be evaluated by RSYSBP_CAT is a character vector, not a factor vector. To complete the conversion of our new variable into a factor variable, we would like to set the lowest group (the reference group to be internally coded by the number 1) to be “normal.” Note that if we do not set this explicitly, R will set “high” to 1 and “normal” to 2 because the default is to encode the groups alphabetically.
Setting our reference group explicitly is easily done by calling the factor command on our SYSBP_CAT variable and adding a second argument called levels after a comma. We use the c function to list the levels in the order in which we would like them to be. Because we would like “normal” to be the first level (internally represented by the number 1), we would place “normal” as the first element in the c function. We would then set our levels argument of the factor command equal to our c function. In the space below, convert SYSBP1_CAT to a factor with “normal” as the first level. Be sure to assign the result to the same variable so that the changes are saved in Framingham and the character version of the variable is overwritten by the factor version. Then, check that you were successful using the table command.
Framingham$SYSBP_CAT <- ifelse(Framingham$SYSBP1 >= 130, "high", "normal")Framingham$SYSBP_CAT = factor(Framingham$SYSBP_CAT, levels = c("normal", "high"))
table(Framingham$SYSBP_CAT,as.numeric(Framingham$SYSBP_CAT))By default, the numerical values assigned to factor variables are unordered in the sense that no level is considered greater or lesser than any other. Said differently, factor variables typically are considered to be nominal-leveled variables wherein the numbers assigned to levels are used merely to distinguish one level from another. Sometimes, however, a factor variable is ordinal-leveled, implying that an ordering of the values assigned to the levels of that variable is meaningful. In such instances, we would like our analytic results and plots to reflect that ordering. For example, a factor variable with levels “small”, “medium”, and “large” would be an ordinal-leveled factor variable, and it would therefore be important for an interpretation of results to reflect the fact that “large” is greater than “medium”, which is greater than “small.”
Let’s suppose we wanted to add two additional levels to SYSBP_CAT: “low” for systolic blood pressure below 90 and “elevated” for systolic blood pressure between 120 and 129.9, inclusive. Before recoding SYSBP_CAT, we would need to add “low” and “elevated” as levels of the factor variable. We do this using the factor function once more, but this time we add the two new categories to the levels argument.
Even though we have multiple options for the levels argument, we are going to use c("low", "normal", "elevated", "high") because we would like these levels to be ordered from least to greatest. We add ordering to our factor simply by setting the argument ordered to TRUE. The space below shows the code for assigning a new factoring of SYSBP_CAT to a variable named SYSBP_CAT2. Add the missing arguments to the factor command so that the new variable has all four levels and R knows that they are to be regarded as an ordered factor variable with the order as specified.
Framingham$SYSBP_CAT <- ifelse(Framingham$SYSBP1 >= 130, "high", "normal")
Framingham$SYSBP_CAT <- factor(Framingham$SYSBP_CAT, levels = c("normal", "high"))Framingham$SYSBP_CAT2 = factor(Framingham$SYSBP_CAT)Framingham$SYSBP_CAT2 = factor(Framingham$SYSBP_CAT,
                                levels = c("low", "normal", "elevated", "high"),
                                ordered = TRUE)If we print our new ordered variable, we see the ordering of the levels at the very bottom, as shown below.
Framingham$SYSBP_CAT <- ifelse(Framingham$SYSBP1 >= 130, "high", "normal")
Framingham$SYSBP_CAT <- factor(Framingham$SYSBP_CAT, levels = c("normal", "high"))
Framingham$SYSBP_CAT2 <- factor(Framingham$SYSBP_CAT,
                                levels = c("low", "normal", "elevated", "high"),
                                ordered = TRUE)Framingham$SYSBP_CAT2So far we have allowed for the possibility of systolic blood pressure falling into one of four categories, but we have not yet told R how to distinguish when an individual has low or elevated blood pressure. This is why the output above shows four possible categories, but only “normal” or “high” actually being used. Now we need to recode our variable to include the two new categories: “low” for systolic blood pressure below 90 and “elevated” for systolic blood pressure between 120 and 129.9, inclusive. Below we have provided the code to recode SYSBP_CAT2 to “low” for any rows where systolic blood pressure (SYSBP1) is less than 90. Try recoding for the “elevated” category in a similar manner. Hint: We will need to evaluate two logical statements to cover the range for “elevated.”
Framingham$SYSBP_CAT <- ifelse(Framingham$SYSBP1 >= 130, "high", "normal")
Framingham$SYSBP_CAT <- factor(Framingham$SYSBP_CAT, levels = c("normal", "high"))
Framingham$SYSBP_CAT2 <- factor(Framingham$SYSBP_CAT,
                                levels = c("low", "normal", "elevated", "high"),
                                ordered = TRUE)Framingham$SYSBP_CAT2[Framingham$SYSBP1 < 90] = "low"Framingham$SYSBP_CAT2[Framingham$SYSBP1 < 90] = "low"
Framingham$SYSBP_CAT2[Framingham$SYSBP1 >= 120 & Framingham$SYSBP1 <= 129.9] = "elevated"Use the space below to run code that shows counts of each level of SYSBP_CAT2, and then answer the following questions.
Framingham$SYSBP_CAT <- ifelse(Framingham$SYSBP1 >= 130, "high", "normal")
Framingham$SYSBP_CAT <- factor(Framingham$SYSBP_CAT, levels = c("normal", "high"))
Framingham$SYSBP_CAT2 <- factor(Framingham$SYSBP_CAT,
                                levels = c("low", "normal", "elevated", "high"),
                                ordered = TRUE)
Framingham$SYSBP_CAT2[Framingham$SYSBP1 < 90] <- "low"
Framingham$SYSBP_CAT2[Framingham$SYSBP1 >= 120 & Framingham$SYSBP1 <= 129.9] <- "elevated"# use this space to run codeThe output of using the table function on SYSBP_CAT2 reveals that the “low” level is entirely unused. We can drop this level using the droplevels command on our factor variable and assigning the results to the same variable, effectively overwriting it with the version that does not include “low.” Try this in the space below. Verify that the level has been dropped by running the levels command.
Framingham$SYSBP_CAT <- ifelse(Framingham$SYSBP1 >= 130, "high", "normal")
Framingham$SYSBP_CAT <- factor(Framingham$SYSBP_CAT, levels = c("normal", "high"))
Framingham$SYSBP_CAT2 <- factor(Framingham$SYSBP_CAT,
                                levels = c("low", "normal", "elevated", "high"),
                                ordered = TRUE)
Framingham$SYSBP_CAT2[Framingham$SYSBP1 < 90] <- "low"
Framingham$SYSBP_CAT2[Framingham$SYSBP1 >= 120 & Framingham$SYSBP1 <= 129.9] <- "elevated"Framingham$SYSBP_CAT2 = droplevels(Framingham$SYSBP_CAT2)
levels(Framingham$SYSBP_CAT2)Obtaining descriptive statistics about variables in a dataset is the first step of most analyses (and even the main objective in some cases!). In this section, we review how to obtain these statistics, what to do when data are missing, and what to do when analyses call for complete cases across more than one variable. See chapters 2 through 5 of Statistics Using R: An Integrative Approach for a more complete and in-depth discussion of these statistics and how to access them with R.
summary CommandTo get a rough idea of what the distribution of each of our variables looks like, and whether they contain missing values, we can use the summary command. Use the summary command on Framingham in the code box below and inspect the results.
summary(Framingham)As we can see from the output, summary gives us the name of each variable in Framingham and some basic descriptive statistics about each of them. For numeric variables, we get the minimum and maximum values, the mean and median, and the first and third quartiles (denoted “1st Qu.” and “3rd Qu.”, respectively). For categorical variables, we get the names of the groups and their counts. Thus, summary is a wonderful command for obtaining an overview of our data, but it is not recommended for when you need to obtain specific statistics for only certain variables.
If any variable has missing values, there will be an additional piece of information at the bottom of the list: a count of NA values. NA stands for “not available” and is the element/entry that R uses to note a missing value in the column. We will cover more on dealing with missing values later in this section.
When there are no missing values in a dataset, it is very simple to obtain descriptive statistics about variables, such as those listed below.
length returns the length of a vector/variable, giving a count of observations for that variable
mean returns the mean value of a vector/variable
sd returns the standard deviation of a vector/variable
Use the space below to run code in order to answer the following questions about variables from the Framingham dataset.
# use this space to run codeIn R, when values of a variable are missing from a vector or data frame, they are represented as NA, meaning “Not Available.” The Framingham dataset includes variables whose measurements were taken at a number of different time points. Because not all subjects participated in the study at all time points, we do not have values for some of the variables for some of the subjects. Rather than these spaces being left blank, the entries of variables for unavailable subjects are listed as NA. We can find the number of missing values in a vector/variable by running a logical statement to check if each value is NA and then taking the sum of the result. The code below shows how to find the number of missing values for the AGE3 variable, the age of the subject measured at time point 3. Add code that finds the count of AGE3.
sum(is.na(Framingham$AGE3))sum(is.na(Framingham$AGE3))
length(Framingham$AGE3)Even though AGE3 contains 92 missing values, R returns the length of the vector/variable to be 400, the total number of observations in the dataset. The reason for this is that each NA is occupying an element’s space in the vector, and as such, is still counted by the length function. To circumvent this issue, we use a function called na.omit on the vector to filter out the NA values from it. Then we feed this into the length function, in the same way that we fed the is.na result into the sum function above. Try this for AGE3 in the space below.
length(na.omit(Framingham$AGE3))Now we see that AGE3 actually has only 308 non-missing values, not 400.
Fortunately, many functions in R, including mean and sd, come with an optional argument na.rm that, when set to TRUE, removes all the NA values before running the function. In the space below, try running mean and sd for AGE3 without the na.rm argument, and then with it set to TRUE.
mean(Framingham$AGE3)
sd(Framingham$AGE3)
mean(Framingham$AGE3, na.rm = TRUE)
sd(Framingham$AGE3, na.rm = TRUE)From the output, we can see that when there are missing values in a vector and we do not include the na.rm argument, R returns NA as the calculation’s result. In order to obtain the result we seek, based on the non-missing values only, we must include the na.rm argument to remove the NA values. Alternatively, we may use the command na.omit to achieve the same result, as shown below.
mean(na.omit(Framingham$AGE3))## [1] 59.87662sd(na.omit(Framingham$AGE3))## [1] 7.94839In order to find the correlation, for example, between two variables, such as height and weight, in a sample of individuals, we would need to have the height and weight measures for each individual in that sample. Because each pair of height and weight values comes from a single individual, height and weight are said to be paired. In this situation, when variables are paired, we must have non-missing values on both of the paired variables in order to run the analysis and obtain the results we seek. Accordingly, we need to use code that allows us to limit the analysis to only those rows that have non-missing values on both variables of interest (i.e., where a result of TRUE is returned in response to a query about whether the entry for the paired height and weight variables are non-missing or complete). In another context, suppose we wish to compute the correlation between diastolic blood pressure measured at time 1 (DIABP1) and at time 3 (DIABP3). Because both the measurements at the two time periods belong to the same person, they are considered to be paired. To limit the analysis to those individuals that have non-missing/complete data on both paired measures, we ask whether DIABP1and DIABP3 have non-missing values by using the command complete.cases. This command does the opposite of is.na: complete.cases checks to see if each element of a vector is not an NA value and returns TRUE if the value is non-missing and FALSE if it is missing.
Use brackets and the solution from the previous question to subset the values DIABP1 to only those values where both DIABP1 and DIABP3 are non-missing. Then do the same for DIABP3.
Framingham$DIABP1[complete.cases(Framingham$DIABP1) & complete.cases(Framingham$DIABP3)]
Framingham$DIABP3[complete.cases(Framingham$DIABP1) & complete.cases(Framingham$DIABP3)]In this final section, we present a new dataset: the NELS dataset, available in your environment as NELS. Code boxes will be available to help answer quiz questions about the dataset using the skills learned in the previous sections. We encourage you to try to use commands from memory as much as possible, but solution code is available using the Solution button at the top of the code box in case you need assistance. Keep in mind that in R there are often multiple ways to obtain the information sought, so sometimes your approach to finding the solution will not match that of the solution code provided, even though you were still successful in finding the correct information.
Let’s start with some basic information about the dataset. Use the empty code box below to run any commands necessary to answer the quiz questions for this section. Suggested solutions are available by clicking on the Solution button on the code box.
# use this box to run code# get observation and variable counts
dim(NELS)
# check variable data classes
str(NELS)Let’s look at some of the variables more closely now.
# use this box to run code# overview of variables (including NAs)
summary(NELS)
# or check individual variables for missing values
sum(is.na(NELS$hwkin12))
sum(is.na(NELS$famsize))
# maximum of slfcnc08: find within summary(NELS) or use the following
max(NELS$slfcnc08)
# mean of ses
mean(NELS$ses)
# mean of achsls08
mean(NELS$achsls08, na.rm=TRUE)
mean(na.omit(NELS$achsls08))
# non-missing achsls08
length(na.omit(NELS$achsls08))
length(NELS$achsls08[complete.cases(NELS$achsls08)])
# non-missing approg
length(na.omit(NELS$approg))Now, let’s dig a little deeper and investigate more specific details about our data.
# use this box to run code# check levels of region variable
levels(NELS$region)
# females from Northeast, males from South
table(NELS$region,NELS$gender)
# mean family size of students from the West
mean(NELS$famsize[NELS$region=="West"])
# standard deviation of first 20 slfcnc10 (3 ways)
sd(NELS[1:20,"slfcnc10"])
sd(NELS[1:20, 10])
sd(NELS$slfcnc10[1:20])
# 151st student cigarette use (2 ways)
NELS[151,"cigarett"]
NELS$cigarett[151]
# complete pairs of parmarl8 and nursery
sum(complete.cases(NELS$parmarl8) & complete.cases(NELS$nursery))Finally, let’s create some variables and answer questions related to them.
In the code box below, add a variable to NELS called achmatdiff, which is the difference in math achievement scores from 8th to 12th grade for each student. Remember that you can check variable names and descriptions by running ?NELS in the Console window of R or RStudio. You can use the - operator to subtract one column from another by row.
# create achmatdiff
NELS$achmatdiff = NELS$achmat12 - NELS$achmat08Now, use the code box below to run any code necessary to answer the following questions about our new variable.
NELS$achmatdiff <- NELS$achmat12 - NELS$achmat08# use this box to run code# minimum, maximum, and mean
summary(NELS$achmatdiff)
# missing values
sum(is.na(NELS$achmatdiff))
# standard deviation
sd(NELS$achmatdiff)Next, let’s recode achmatdiff to a categorical variable called achmatcat, which has the value “negative” when achmatdiff has a value less than zero, and “positive” everywhere else. Check the class of achmatcat; if it is not a factor variable, change it so that it is. Then check that the levels are “negative” and “positive.”
NELS$achmatdiff <- NELS$achmat12 - NELS$achmat08NELS$achmatcat = ifelse(NELS$achmatdiff < 0, "negative", "positive")
class(NELS$achmatcat)
NELS$achmatcat = factor(NELS$achmatcat)
levels(NELS$achmatcat)Finally, let’s inspect achmatcat and check that we seem to have created it correctly. Use the code box below to answer the following questions.
NELS$achmatdiff <- NELS$achmat12 - NELS$achmat08
NELS$achmatcat <- ifelse(NELS$achmatdiff < 0, "negative", "positive")
NELS$achmatcat <- factor(NELS$achmatcat)# use this box to run code# first 10 rows of achmatdiff and achmatcat
NELS[1:10, c("achmatdiff","achmatcat")]
# factor encoding for achmatcat
table(NELS$achmatcat,as.numeric(NELS$achmatcat))
# proportion positive
table(NELS$achmatcat)
258/500
# achmatcat by region
table(NELS$achmatcat,NELS$region)
# average ses for "negative"
mean(NELS$ses[NELS$achmatcat == "negative"])Now that you’ve successfully completed this tutorial, you should be well prepared to begin your study of statistics using the textbook Statistics Using R: An Integrative Approach by Weinberg, Harel, and Abramowitz.