library(injurytools)
library(dplyr)
library(knitr)
library(kableExtra)Whenever one collects and prepares the data, the next natural step is to summarize and explore these data. In this case, from a sports-applied point of view, one wants to know how many and what type of injuries have occurred, how often they have occurred or what the load has been.
This document shows convenient functions from
injurytools to describe sports injury data, in terms of
measures used in sports injury epidemiology (see Bahr and Holme (2003) and Waldén
et al. (2023)). Below, these measures are
explained and then, injsummary() and injprev()
functions are illustrated.
As Hodgson Phillips (2000) state,
“Sports injuries occur when athletes are exposed to their given sport and they occur under specific conditions, at a known time and place.”
Thus, when attempting to describe the distribution of injuries it is necessary to relate this to the population at risk over a specified time period. This is why the fundamental unit of measurement is rate.
A rate is a measure that consists of a denominator and a numerator over a period of time. Denominator data can be a number of different things (e.g. number of minutes trained/played, number of matches played). As such, it reflects the speed at which new “injury-related” events occur.
Injury incidence rate is the number of new injury cases (\(I\)) per unit of player-exposure time, i.e.
\[ I_{r} = \frac{I}{\Delta T}\] where \(\Delta T\) is the total time under risk of the study population.
Injury burden rate is the number of days lost (\(n_d\)) per unit of player-exposure time, i.e.
\[I_{br} = \frac{n_d}{\Delta T}\] where \(\Delta T\) is the total time under risk of the study population.
NOTE 1: as Bahr, Clarsen, and
Ekstrand (2018)
state, injury incidence (likelihood) and injury burden (severity) should
be reported and assessed in conjunction rather than in isolation. In
this regard, see the risk matrix plot provided by
gg_injriskmatrix() function.
NOTE 2: neither injury incidence (\(I_r\)) nor injury burden (\(I_{br}\)) are ratios, and they are not interpreted as a probability; they are rates and their unit (person-time)\(^{-1}\) (e.g. per 1000h of player-exposure, per player-season etc.).
NOTE 3: as Waldén et al. (2023) point out, incidence-based measures that provide a standard time-window for the population at risk (injuries per hour) are preferable to measures for which the time at risk varies across individuals (injuries per athletic-exposure, injuries per number of matches), because time-based measures better facilitate comparison across sports.
Prevalence, period prevalence, is a proportion that refers to the number of players that has reported the injury of interest (\(X\)) divided by the total player-population at risk at any time during the specified period of time (\(\Delta T\) time window). This includes players who already had the condition at the start of the study period as well as those who acquired it during that period.
\[P = \frac{X}{N}\] where \(X\) is the number of injury cases and \(N\) the total number of players in the study at any point in the window of time \(\Delta T\).
Again, as in the prepare-data
vignette, we use the data sets available from the
injurytools package: data from Liverpool Football Club
male’s first team players over two consecutive seasons, 2017-2018 and
2018-2019, scrapped from https://www.transfermarkt.com/ website1.
injsummary()df_exposures <- prepare_exp(raw_df_exposures, player = "player_name",
                            date = "year", time_expo = "minutes_played")
df_injuries  <- prepare_inj(raw_df_injuries, player = "player_name",
                            date_injured = "from", date_recovered = "until")
injd         <- prepare_all(data_exposures = df_exposures,
                            data_injuries  = df_injuries,
                            exp_unit = "matches_minutes")Now, the preprocessed data is passed to injsummary() to
calculate injury summary statistics:
injds <- injsummary(injd)
#> Warning in injsummary_unit(unit, injds, quiet): 
#>   Exposure time unit is matches_minutes
#>   So... Injury incidence and injury burden are calculated per 100 player-matches of exposure (90 minutes times 100)
#> Warning in injsummary_unit(unit, injds_overall, quiet): 
#>   Exposure time unit is matches_minutes
#>   So... Injury incidence and injury burden are calculated per 100 player-matches of exposure (90 minutes times 100)We notice that it throws some warning messages (unless
quiet = TRUE). They are thrown to make it clear what the
exposure time unit is2.
What injsummary() returns as its output is a list of two
elements, i.e. the output stored (here) in the injds object
consists of,
str(injds, 1)
#> List of 2
#>  $ playerwise: tibble [28 × 9] (S3: tbl_df/tbl/data.frame)
#>  $ overall   : tibble [1 × 14] (S3: tbl_df/tbl/data.frame)
#>  - attr(*, "class")= chr [1:2] "injds" "list"
#>  - attr(*, "unit_exposure")= chr "matches_minutes"
#>  - attr(*, "unit_timerisk")= chr "100 player-match"two data frames (two tables), which can be accessed by typing
injds[[1]] (or injds[["playerwise"]]) and
injds[[2]] (injds[["overal"]]).
To present the results in a more tidier and comprehensible way (instead of R code styled output) the following can be done:
# format the 'playerwise' data frame for output as a table
injds[[1]] %>% 
  arrange(desc(injincidence)) %>% # sort by decreasing order of injincidence
  head(10) %>%
  kable(digits = 2, col.names = c("Player", "N injuries", "N days lost", 
                                  "Mean days lost", "Median days lost", "IQR days lost",
                                  "Total exposure", "Incidence", "Burden"))| Player | N injuries | N days lost | Mean days lost | Median days lost | IQR days lost | Total exposure | Incidence | Burden | 
|---|---|---|---|---|---|---|---|---|
| adam-lallana | 6 | 302 | 43.14 | 43.0 | 18.5-52.5 | 700 | 77.14 | 3882.86 | 
| daniel-sturridge | 3 | 122 | 30.50 | 33.5 | 12-52 | 927 | 29.13 | 1184.47 | 
| divock-origi | 1 | 5 | 2.50 | 2.5 | 1.25-3.75 | 366 | 24.59 | 122.95 | 
| philippe-coutinho | 3 | 62 | 15.50 | 18.5 | 9-25 | 1117 | 24.17 | 499.55 | 
| naby-keita | 3 | 89 | 22.25 | 19.0 | 13.5-27.75 | 1393 | 19.38 | 575.02 | 
| dejan-lovren | 6 | 160 | 22.86 | 13.0 | 9-28.5 | 3109 | 17.37 | 463.17 | 
| jordan-henderson | 8 | 91 | 10.11 | 7.0 | 4-11 | 4154 | 17.33 | 197.16 | 
| xherdan-shaqiri | 2 | 67 | 33.50 | 33.5 | 23.25-43.75 | 1057 | 17.03 | 570.48 | 
| fabinho | 3 | 22 | 5.50 | 5.5 | 1.5-9.5 | 2013 | 13.41 | 98.36 | 
| james-milner | 5 | 48 | 8.00 | 9.0 | 6.25-11.75 | 3548 | 12.68 | 121.76 | 
Using RMarkdown, and in
particular, knitr::kable() function.
# format the table of total incidence and burden (main columns)
injds[[2]] %>% 
  select(1:8) %>% 
  data.frame(row.names = "TOTAL") %>% 
  kable(digits = 2,
        col.names = c("N injuries", "N days lost", "Mean days lost",
                      "Median days lost", "IQR days lost",
                      "Total exposure", "Incidence", "Burden"),
        row.names = TRUE) %>% 
  kable_styling(full_width = FALSE)| N injuries | N days lost | Mean days lost | Median days lost | IQR days lost | Total exposure | Incidence | Burden | |
|---|---|---|---|---|---|---|---|---|
| TOTAL | 82 | 2049 | 18.97 | 7.5 | 1-20.25 | 74690 | 9.88 | 246.9 | 
Note that to provide numbers that are easy to interpret and to avoid small decimals, injury incidence and injury burden are reported ‘per 100 player-match exposure’. As in this example exposure time is minutes played in matches, we multiply the rates by 90*100 (i.e. 90 minutes lasts a football match). Therefore, the reported incidence rate is estimated by \(\hat{I}_r = \frac{82}{74690}\times90\times100\).
# format the table of total incidence and burden (point + ci estimates)
injds_tot_cis <- injds[[2]] %>% 
  select(7:last_col()) %>% 
  data.frame(row.names = "TOTAL")
injds_tot_cis$ci_injincidence <- paste0("[", round(injds_tot_cis$injincidence_lower, 1),
                                    ", ",round(injds_tot_cis$injincidence_upper, 1), "]")
injds_tot_cis$ci_injburden <- paste0("[", round(injds_tot_cis$injburden_lower, 1),
                                    ", ",round(injds_tot_cis$injburden_upper, 1), "]")
conf_level <- attr(injds, "conf_level")*100
injds_tot_cis %>% 
  select(1, 9, 2, 10) %>% 
  kable(digits = 2,
        col.names = c("Incidence",  paste0("CI", conf_level, "% for \\(I_r\\)"), 
                      "Burden", paste0("CI", conf_level, "% for \\(I_{br}\\)")))| Incidence | CI% for \(I_r\) | Burden | CI% for \(I_{br}\) | |
|---|---|---|---|---|
| TOTAL | 9.88 | [8.1, 11.7] | 246.9 | [237.9, 255.9] | 
Players with the highest injury incidence rate (all type of injuries) were Adam Lallana and Daniel Sturridge with 77.1 and 29.1 injuries per 100 player-matches respectively. The teams overall injury incidence was of 9.9 injuries per 100 player-matches and the injury burden of 246.9 days lost per 100 player-matches.
These summaries can be done by injury type:
injstats_pertype <- injsummary(injd, var_type_injury = "injury_type", quiet = T)These are the teams results regarding injury incidence and injury burden according to injury type:
injstats_pertype[["overall"]] %>% 
  select(1:5, 7:11) %>% 
  mutate(ninjuries2 = paste0(ninjuries, " (", percent_ninjuries, ")"),
         ndayslost2 = paste0(ndayslost, " (", percent_dayslost, ")"),
         median_dayslost2 = paste0(median_dayslost, " (", iqr_dayslost, ")")) %>% 
  select(1, 11:13, 8:10) %>% 
  arrange(desc(injburden)) %>% 
  kable(digits = 2,
        col.names = c("Type of injury", "N injuries (%)", "N days lost (%)",
                      "Median days lost (IQR)",
                      "Total exposure", "Incidence", "Burden"),
        row.names = TRUE) %>% 
  kable_styling(full_width = FALSE)| Type of injury | N injuries (%) | N days lost (%) | Median days lost (IQR) | Total exposure | Incidence | Burden | |
|---|---|---|---|---|---|---|---|
| 1 | Muscle | 25 (30.49) | 735 (35.87) | 21 (12-36) | 67266 | 3.34 | 98.34 | 
| 2 | Ligament | 9 (10.98) | 596 (29.09) | 28 (7-54) | 67266 | 1.20 | 79.74 | 
| 3 | Unknown | 21 (25.61) | 332 (16.2) | 7 (4-18) | 67266 | 2.81 | 44.42 | 
| 4 | Concussion | 16 (19.51) | 213 (10.4) | 10.5 (5.75-14.5) | 67266 | 2.14 | 28.50 | 
| 5 | Bone | 11 (13.41) | 173 (8.44) | 9 (4.5-16.5) | 67266 | 1.47 | 23.15 | 
injprev()df_exposures <- prepare_exp(raw_df_exposures, player = "player_name",
                            date = "year", time_expo = "minutes_played")
df_injuries  <- prepare_inj(raw_df_injuries, player = "player_name",
                            date_injured = "from", date_recovered = "until")
injd         <- prepare_all(data_exposures = df_exposures,
                            data_injuries  = df_injuries,
                            exp_unit = "matches_minutes")We calculate the injury prevalence and the proportions of injury-free players on a season basis:
availability_table1 <- injprev(injd, by = "season")
availability_table1
#> # A tibble: 4 × 5
#>   season           type_injury     n n_player  prop
#>   <fct>            <fct>       <int>    <int> <dbl>
#> 1 season 2017/2018 Available       7       23  30.4
#> 2 season 2017/2018 Injured        16       23  69.6
#> 3 season 2018/2019 Available       2       19  10.5
#> 4 season 2018/2019 Injured        17       19  89.5Making use of knitr::kable():
kable(availability_table1,
      col.names = c("Season", "Availability", "N", "Total", "%"))| Season | Availability | N | Total | % | 
|---|---|---|---|---|
| season 2017/2018 | Available | 7 | 23 | 30.4 | 
| season 2017/2018 | Injured | 16 | 23 | 69.6 | 
| season 2018/2019 | Available | 2 | 19 | 10.5 | 
| season 2018/2019 | Injured | 17 | 19 | 89.5 | 
Overall, there were more injured players in the 18-19 season than in the previous season. Let’s calculate it monthly:
availability_table2 <- injprev(injd, by = "monthly")
## compare two seasons July and August
availability_table2 %>%
  group_by(season) %>% 
  slice(1:4)
#> # A tibble: 8 × 6
#> # Groups:   season [2]
#>   season           month type_injury     n n_player  prop
#>   <fct>            <fct> <fct>       <int>    <int> <dbl>
#> 1 season 2017/2018 Jul   Available      21       23  91.3
#> 2 season 2017/2018 Jul   Injured         2       23   8.7
#> 3 season 2017/2018 Aug   Available      18       23  78.3
#> 4 season 2017/2018 Aug   Injured         5       23  21.7
#> 5 season 2018/2019 Jul   Available      16       19  84.2
#> 6 season 2018/2019 Jul   Injured         3       19  15.8
#> 7 season 2018/2019 Aug   Available      15       19  78.9
#> 8 season 2018/2019 Aug   Injured         4       19  21.1
## compare two seasons January and February
availability_table2 %>%
  group_by(season) %>% 
  slice(13:16)
#> # A tibble: 8 × 6
#> # Groups:   season [2]
#>   season           month type_injury     n n_player  prop
#>   <fct>            <fct> <fct>       <int>    <int> <dbl>
#> 1 season 2017/2018 Jan   Available      18       23  78.3
#> 2 season 2017/2018 Jan   Injured         5       23  21.7
#> 3 season 2017/2018 Feb   Available      21       23  91.3
#> 4 season 2017/2018 Feb   Injured         2       23   8.7
#> 5 season 2018/2019 Jan   Available       9       19  47.4
#> 6 season 2018/2019 Jan   Injured        10       19  52.6
#> 7 season 2018/2019 Feb   Available      12       19  63.2
#> 8 season 2018/2019 Feb   Injured         7       19  36.8Looking at monthly basis, there were more differences w.r.t. player availability in Liverpool FC 1st male team, during the winter January/February months. More injured players in the 18-19 season.
availability_table3 <- injprev(injd, by = "monthly", var_type_injury = "injury_type")Tidy up:
## season 1
availability_table3 %>% 
  filter(season == "season 2017/2018", month == "Jan") %>% 
  kable(col.names = c("Season", "Month", "Availability", "N", "Total", "%"),
        caption = "Season 2017/2018") %>% 
  kable_styling(full_width = FALSE, position = "float_left")
## season 2
availability_table3 %>% 
  filter(season == "season 2018/2019", month == "Jan") %>% 
  kable(col.names = c("Season", "Month", "Availability", "N", "Total", "%"),
        caption = "Season 2018/2019") %>% 
  kable_styling(full_width = FALSE, position = "left")| Season | Month | Availability | N | Total | % | 
|---|---|---|---|---|---|
| season 2017/2018 | Jan | Available | 18 | 23 | 78.3 | 
| season 2017/2018 | Jan | Ligament | 1 | 23 | 4.3 | 
| season 2017/2018 | Jan | Muscle | 3 | 23 | 13.0 | 
| season 2017/2018 | Jan | Unknown | 1 | 23 | 4.3 | 
| Season | Month | Availability | N | Total | % | 
|---|---|---|---|---|---|
| season 2018/2019 | Jan | Available | 9 | 19 | 47.4 | 
| season 2018/2019 | Jan | Bone | 1 | 19 | 5.3 | 
| season 2018/2019 | Jan | Concussion | 2 | 19 | 10.5 | 
| season 2018/2019 | Jan | Ligament | 1 | 19 | 5.3 | 
| season 2018/2019 | Jan | Muscle | 3 | 19 | 15.8 | 
| season 2018/2019 | Jan | Unknown | 4 | 19 | 21.1 | 
In the near future there will be available the negative binomial
(method = "negbin" argument), zero-inflated poisson
(“zinfpois”) and zero-inflated negative binomial
("zinfnb") methods in injsummary()
function.
Finally, this document shows how to perform descriptive analyses for injury epidemiology, but naturally, following these analyses further statistical inferences or multivariate regression analyses may be chosen to infer about the player’s/athletes population properties (e.g. to test whether there are differences between the injury incidence rates of two cohorts) or to evaluate the influence of independent factors (e.g. previous injuries, workload) on the injuries occurred.