This vignette will walk a reader through the tbl_summary() function, and the various functions available to modify and make additions to an existing table summary object.
To start, a quick note on the {magrittr} package’s pipe function, %>%. By default the pipe operator puts whatever is on the left hand side of %>% into the first argument of the function on the right hand side. The pipe function can be used to make the code relating to tbl_summary() easier to use, but it is not required. Here are a few examples of how %>% translates into typical R notation.
x %>% f() is equivalent to f(x)
x %>% f(y) is equivalent to f(x, y)
y %>% f(x, .) is equivalent to f(x, y)
z %>% f(x, y, arg = .) is equivalent to f(x, y, arg = z)Here’s how this translates into the use of tbl_summary().
mtcars %>% tbl_summary() is equivalent to tbl_summary(mtcars)
mtcars %>% tbl_summary(by = am) is equivalent to tbl_summary(mtcars, by = am)
tbl_summary(mtcars, by = am) %>% add_p() is equivalent to
    tbl = tbl_summary(mtcars, by = am)
    add_p(tbl)Before going through the tutorial, install {gtsummary} and {gt}.
install.packages("gtsummary")
remotes::install_github("rstudio/gt")
library(gtsummary)
library(dplyr)If you experience issues installing {gt} on Windows, install Rtools from CRAN, restart R, and attempt installation again. The vignettes hosted on https://cran.r-project.org do not use the {gt} package to print tables. For examples with {gt}, browse to the {gtsummary} website.
Examples utilize the {gt} package to generate tables. The knitr::kable() function will be used to generate tables if the {gt} package is not available, or if the user requests with options(gtsummary.print_engine = "kable").
We’ll be using the trial data set throughout this example. This set contains data from 200 patients randomized to a new drug therapy or placebo. The outcome is a binary tumor response. Each variable in the data frame has been assigned an attribute label (i.e. attr(trial$trt, "label") == "Treatment Randomization") with the labelled package. These labels are displayed in the output table by default. A data frame without labels will print variable names.
trt      Treatment Randomization
age      Age, yrs
marker   Marker Level, ng/mL
stage    T Stage
grade    Grade
response Tumor Response
death    Patient Died
ttdeath  Years from Randomization to Death/Censor| trt | age | marker | stage | grade | response | death | ttdeath | 
|---|---|---|---|---|---|---|---|
| Drug | 23 | 0.160 | T3 | I | 1 | 0 | 24.00 | 
| Drug | 9 | 1.107 | T4 | III | 1 | 1 | 14.90 | 
| Drug | 31 | 0.277 | T1 | I | 1 | 0 | 24.00 | 
| Placebo | 46 | 2.067 | T4 | II | 1 | 1 | 10.21 | 
| Drug | 51 | 2.767 | T2 | II | 0 | 1 | 20.02 | 
| Drug | 39 | 0.613 | T1 | III | 1 | 1 | 16.55 | 
The default output from tbl_summary() is meant to be publication ready. Let’s start by creating a table of summary statistics from the trial data set. The tbl_summary() can take, at minimum, a data set as the only input, and returns descriptive statistics for each column in the data frame.
For brevity, keeping a subset of the variables in the trial data set.
| Characteristic1 | N = 200 | 
|---|---|
| Treatment Randomization | |
| Drug | 107 (54%) | 
| Placebo | 93 (46%) | 
| Marker Level, ng/mL | 0.69 (0.22, 1.41) | 
| Unknown | 9 | 
| T Stage | |
| T1 | 51 (26%) | 
| T2 | 49 (24%) | 
| T3 | 42 (21%) | 
| T4 | 58 (29%) | 
| 
          
            1
          
           
          Statistics presented: n (%); median (IQR)
           | |
This is a great table, but for trial data the summary statistics should be split by randomization group. While reporting p-values for baseline characteristics in a randomized trial isn’t recommended, we’ll do it here as an illustration of the feature. To compare two or more groups, include add_p() with the function call.
| Characteristic1 | Drug, N = 107 | Placebo, N = 93 | p-value2 | 
|---|---|---|---|
| Marker Level, ng/mL | 0.60 (0.22, 1.21) | 0.72 (0.22, 1.53) | 0.4 | 
| Unknown | 5 | 4 | |
| T Stage | 0.13 | ||
| T1 | 25 (23%) | 26 (28%) | |
| T2 | 26 (24%) | 23 (25%) | |
| T3 | 29 (27%) | 13 (14%) | |
| T4 | 27 (25%) | 31 (33%) | |
| 
          
            1
          
           
          Statistics presented: median (IQR); n (%)
           
          
            2
          
           
          Statistical tests performed: Wilcoxon rank-sum test; chi-square test of independence
           | |||
There are four primary ways to customize the output of the summary table.
tbl_summary() function input argumentsadd_*() functionsThe tbl_summary() function includes many input options for modifying the appearance.
label       specify the variable labels printed in table  
type        specify the variable type (e.g. continuous, categorical, etc.)
statistic   change the summary statistics presented  
digits      number of digits the summary statistics will be rounded to  
missing     whether to display a row with the number of missing observations 
sort        change the sorting of categorical levels by frequency
percent     print column, row, or cell percentagesThe {gtsummary} package has built-in functions for adding to results from tbl_summary(). The following functions add columns and/or information to the summary table.
add_p()           add p-values to the output comparing values across groups   
add_overall()     add a column with overall summary statistics   
add_n()           add a column with N (or N missing) for each variable   
add_stat_label()  add a column showing a label for the summary statistics shown in each row   
add_q()           add a column of q values to control for multiple comparisons   The {gtsummary} package comes with functions specifically made to modify and format summary tables.
modify_header()         relabel columns in summary table  
bold_labels()           bold variable labels  
bold_levels()           bold variable levels  
italicize_labels()      italicize variable labels  
italicize_levels()      italicize variable levels  
bold_p()                bold significant p-values  The {gt} package is packed with many great functions for modifying table output—too many to list here. Review the package’s website for a full listing. https://gt.rstudio.com/index.html
To use the {gt} package functions with {gtsummary} tables, the summary table must first be converted into a gt object. To this end, use the as_gt() function after modifications have been completed with {gtsummary} functions.
trial %>%
  tbl_summary(by = trt, missing = "no") %>%
  add_n() %>%
  as_gt() %>%
  <gt functions>The code below calculates the standard table with summary statistics split by treatment randomization with the following modifications
trial2 %>%
  # build base summary table
  tbl_summary(
    by = trt,
    # change variable labels
    label = list(vars(marker) ~ "Marker, ng/mL",
                 vars(stage) ~ "Clinical T Stage"),
    # change statistics printed in table
    statistic = list(all_continuous() ~ "{mean} ({sd})",
                     all_categorical() ~ "{n} / {N} ({p}%)"),
    digits = list("marker" ~ c(1, 2))
  ) %>%
  # add p-values, report t-test, round large pvalues to two decimal place
  add_p(test = list(vars(marker) ~ "t.test"),
                 pvalue_fun = function(x) style_pvalue(x, digits = 2)) %>%
  # add statistic labels
  add_stat_label() %>%
  # bold variable labels, italicize levels
  bold_labels() %>%
  italicize_levels() %>%
  # bold p-values under a given threshold (default is 0.05)
  bold_p(t = 0.2) %>%
  # include percent in headers
  modify_header(stat_by = "**{level}**, N = {n} ({style_percent(p, symbol = TRUE)})")| Characteristic | Statistic | Drug, N = 107 (54%) | Placebo, N = 93 (46%) | p-value1 | 
|---|---|---|---|---|
| Marker, ng/mL | mean (SD) | 0.9 (0.89) | 1.0 (0.83) | 0.61 | 
| Unknown | n | 5 | 4 | |
| Clinical T Stage | 0.13 | |||
| T1 | n / N (%) | 25 / 107 (23%) | 26 / 93 (28%) | |
| T2 | n / N (%) | 26 / 107 (24%) | 23 / 93 (25%) | |
| T3 | n / N (%) | 29 / 107 (27%) | 13 / 93 (14%) | |
| T4 | n / N (%) | 27 / 107 (25%) | 31 / 93 (33%) | |
| 
          
            1
          
           
          Statistical tests performed: t-test; chi-square test of independence
           | ||||
Each of the modification functions have additional options outlined in their respective help files.
The {gtsummary} package includes a set of functions meant to help the user specify function arguments for groups of variables. For example, if all continuous variables will be summarized in tbl_summary() as minimum and maximum, the all_continuous() function can be used: statistic = all_continuous() ~ "{min} to {max}"
The set of select helper function includes the functions in the {tidyselect} package (used throughout the tidyverse), and functions specific to {gtsummary}. There are four types of select helpers.
Functions in the {tidyselect} package used throughout the tidyverse, including vars() from the {dplyr} package.
Functions to select groups of variables based on their attributes like class or type.
Functions to select groups of variables based on their display type in tbl_summary()
List variables in a vector, e.g. "age" or c("age", "stage")
| {tidyselect} | attribute | summary type | character vector | 
|---|---|---|---|
| starts_with(),ends_with(),contains(),matches(),one_of(),everything(),num_range(),last_col(),vars() | all_numeric(),all_integer(),all_logical(),all_factor(),all_character(),all_double() | all_continuous(),all_categorical(),all_dichotomous() | "age"orc("age", "stage") | 
The select helpers are primarily used in tbl_summary() and its related functions, e.g. add_p(). We’ll review a few examples illustrating their use.
In the example below, we report the mean and standard deviation for continuous variables, and percent for all categorical. We’ll report t-tests rather than Wilcoxon rank-sum test for continuous variables, and report Fisher’s exact test for response.
Note that dichotomous variables are, by default, included with all_categorical(). Use all_categorical(dichotomous = FALSE) to exclude dichotomous variables.
trial %>%
  dplyr::select(trt, response, age, stage, marker, grade) %>%
  tbl_summary(
    by = trt,
    type = list(c("response", "grade") ~ "categorical"), # select by variables in a vector
    statistic = list(all_continuous() ~ "{mean} ({sd})", all_categorical() ~ "{p}%") # select by summary type/attribute
  ) %>%
  add_p(test = list(contains("response") ~ "fisher.test", # select using functions in tidyselect
                    all_continuous() ~ "t.test"))| Characteristic1 | Drug, N = 107 | Placebo, N = 93 | p-value2 | 
|---|---|---|---|
| Tumor Response | 0.019 | ||
| 0 | 49% | 66% | |
| 1 | 51% | 34% | |
| Unknown | 4 | 5 | |
| Age, yrs | 48 (15) | 46 (13) | 0.4 | 
| Unknown | 6 | 3 | |
| T Stage | 0.13 | ||
| T1 | 23% | 28% | |
| T2 | 24% | 25% | |
| T3 | 27% | 14% | |
| T4 | 25% | 33% | |
| Marker Level, ng/mL | 0.90 (0.89) | 0.96 (0.83) | 0.6 | 
| Unknown | 5 | 4 | |
| Grade | 0.3 | ||
| I | 36% | 31% | |
| II | 32% | 26% | |
| III | 33% | 43% | |
| 
          
            1
          
           
          Statistics presented: %; mean (SD)
           
          
            2
          
           
          Statistical tests performed: Fisher's exact test; t-test; chi-square test of independence
           | |||
Reproducible reports are an import part of good practices. We often need to report the results from a table in the text of an R markdown report. Inline reporting has been made simple with inline_text().
First create a basic summary table.
| Characteristic1 | Drug, N = 107 | Placebo, N = 93 | 
|---|---|---|
| Marker Level, ng/mL | 0.60 (0.22, 1.21) | 0.72 (0.22, 1.53) | 
| Unknown | 5 | 4 | 
| T Stage | ||
| T1 | 25 (23%) | 26 (28%) | 
| T2 | 26 (24%) | 23 (25%) | 
| T3 | 29 (27%) | 13 (14%) | 
| T4 | 27 (25%) | 31 (33%) | 
| 
          
            1
          
           
          Statistics presented: median (IQR); n (%)
           | ||
To report the median (IQR) of the marker levels in each group, use the following commands inline.
The median (IQR) marker level in the drug and placebo groups are
`r inline_text(tab1, variable = "marker", column = "Drug")`and`r inline_text(tab1, variable = "marker", column = "Placebo")`, respectively.
Here’s how the line will appear in your report.
The median (IQR) marker level in the drug and placebo groups are 0.60 (0.22, 1.21) and 0.72 (0.22, 1.53), respectively.
If you display a statistic from a categorical variable, include the level argument.
`r inline_text(tab1, variable = "stage", level = "T1", column = "Drug")`resolves to “25 (23%)”
For more detail about inline code, review to the RStudio documentation page.
When you print output from the tbl_summary() function into the R console or into an R markdown, there are default printing functions that are called in the background: print.tbl_summary() and knit_print.tbl_summary(). The true output from tbl_summary() is a named list, but when you print the object, a formatted version of .$table_body is displayed. All formatting and modifications are made using the {gt} package.
tbl_summary(trial2) %>% names()
#> [1] "gt_calls"     "kable_calls"  "table_body"   "table_header"
#> [5] "meta_data"    "inputs"       "N"            "call_list"These are the additional data stored in the tbl_summary() output list.
table_body   data frame with summary statistics  
meta_data    data frame that is one row per variable with data about each  
by, df_by    the by variable name, and a  data frame with information about the by variable  
call_list    named list of each function called on the `tbl_summary` object  
inputs       inputs from the `tbl_summary()` function call  
gt_calls     named list of {gt} function calls  
kable_calls  named list of function calls to be applied before knitr::kable()  gt_calls is a named list of saved {gt} function calls. The {gt} calls are run when the object is printed to the console or in an R markdown document. Here’s an example of the first few calls saved with tbl_summary():
tbl_summary(trial2) %>% purrr::pluck("gt_calls") %>% head(n = 5)
#> $gt
#> gt::gt(data = x$table_body)
#> 
#> $cols_align
#> gt::cols_align(align = 'center') %>% gt::cols_align(align = 'left', columns = gt::vars(label))
#> 
#> $fmt_missing
#> gt::fmt_missing(columns = gt::everything(), missing_text = '')
#> 
#> $tab_style_text_indent
#> gt::tab_style(style = gt::cell_text(indent = gt::px(10), align = 'left'),locations = gt::cells_data(columns = gt::vars(label),rows = row_type != 'label'))
#> 
#> $footnote_stat_label
#> gt::tab_footnote(footnote = 'Statistics presented: n (%); median (IQR)',locations = gt::cells_column_labels(columns = gt::vars(label)))The {gt} functions are called in the order they appear, always beginning with the gt() function.
If the user does not want a specific {gt} function to run (i.e. would like to change default printing), any {gt} call can be excluded in the as_gt() function by specifying the exclude argument. For example, the tbl_summary() call creates many named {gt} function calls: gt, cols_align, fmt_missing, tab_style_text_indent, footnote_stat_label, cols_label, cols_hide, fmt. Any of these can be excluded. In the example below, the default footnote will be excluded from the output.
After the as_gt() function is run, additional formatting may be added to the table using {gt} formatting functions. In the example below, a spanning header for the by = variable is included with the {gt} function tab_spanner().
tbl_summary(trial2, by = trt) %>%
  as_gt(exclude = "footnote_stat_label") %>%
  gt::tab_spanner(label = "Randomization Group",
                  columns = gt::starts_with("stat_"))| Characteristic | Randomization Group | |
|---|---|---|
| Drug, N = 107 | Placebo, N = 93 | |
| Marker Level, ng/mL | 0.60 (0.22, 1.21) | 0.72 (0.22, 1.53) | 
| Unknown | 5 | 4 | 
| T Stage | ||
| T1 | 25 (23%) | 26 (28%) | 
| T2 | 26 (24%) | 23 (25%) | 
| T3 | 29 (27%) | 13 (14%) | 
| T4 | 27 (25%) | 31 (33%) | 
The tbl_summary() function and it’s related functions have sensible defaults for rounding and formatting results. If you, however, would like to change the defaults there are a few options. The default options can be changed for a single script with addition an options() command in the script. The defaults can also be set on the project- or user-level R profile, .Rprofile.
The following parameters are available to be set:
| Description | Example | Functions | 
|---|---|---|
| Formatting and rounding p-values | options(gtsummary.pvalue_fun = function(x) gtsummary::style_pvalue(x, digits = 2)) | add_p(),tbl_regression(),tbl_uvregression() | 
| Formatting and rounding percentages | options(gtsummary.tbl_summary.percent_fun = function(x) sprintf("%.2f", 100 * x)) | tbl_summary() | 
| Print tables with gtorkable | options(gtsummary.print_engine = "kable")options(gtsummary.print_engine = "gt") | All tbl_*()functions | 
When setting default rounding/formatting functions, set the default to a function object rather than an evaluated function. For example, if you want to round percentages to 3 significant figures use,