---
title: "Dispersion analysis"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Dispersion analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(tlda)
```

### Overview

The `tlda` package offers various functions for corpus-linguistic dispersion analysis (see Gries 2020; Sönning 2025). The dispersion measures that are implemented are parts-based, which means that they assess the distribution of an item across corpus parts. For any type of dispersion analysis carried out with the `tlda` package, you therefore need to supply two variables:

- the *subfrequencies*, i.e. the number of occurrences of the item in each corpus part
- the *size of the corpus parts*

You can provide these data in two forms, and the choice between them mainly depends on whether you want to measure the dispersion of a *single* item only, or for *multiple* items simultaneously. 

- Single item: *vector form*. You can provide the subfrequencies and part sizes as two separate vectors.
- Multiple items: *term-document matrix*. This is a summary table that includes subfrequencies for multiple items, where rows represent items, and columns represent corpus parts. (explained in more detail below)

Accordingly, there are two versions of each function: The function `disp()`, for instance, calculates dispersion measures for data provided in vector form. The `_tdm` suffix in a function name indicates the version that works with a term-document matrix. The function `disp_tdm()` therefore offers the same functionality as `disp()`, the only difference being that it takes data provided in matrix format.


### Term-document matrix (TDM)

When analyzing the dispersion of multiple (and possibly all) items in a corpus, it is convenient (and efficient) to use what is referred to as a term-document matrix (TDM). This is a tabular arrangement that lists the subfrequencies for multiple items. The following is an excerpt from a TDM for ICE-GB (Nelson et al. 2002), where items represent word forms. It shows a selection of 10 items and 8 corpus parts (i.e. text files):

```{r echo=FALSE}
set.seed(8)
tdm_excerpt <- biber150_ice_gb[sample(2:151, size = 10), sample(1:500, 8)]
tdm_excerpt
```

Importantly, the rows in a TDM represent items and the columns represent corpus parts. A 'proper' TDM includes all words occurring in the corpus. In that case, we can retrieve the size of the corpus parts by summing the subfrequencies in each column (i.e. for each corpus part). This can be done in R with the function `addmargins()`, which appends the column sums as a new row at the end of the table (the argument `margins = 1` tells it to sum over columns rather than rows).

```{r eval=FALSE}
addmargins(tdm, margin = 1)
```

There are many settings, however, where we work with 'improper' TDMs, where the rows only represent a selection of items in the corpus. This is the case if we focus on certain lexical items of interest, or if we are dealing with phraseological or syntactic structures. In such a case, the size of corpus parts cannot be recovered from the TDM; instead, it needs to be supplied as a separate row. 

The `tldr` package includes a number of TDMs that hold distributional information for a selection of 150 lexical items (a list compiled by Biber et al. 2016 for the evaluation of dispersion measures). In these TDMs, the first row (`word_count`) records the part sizes (i.e. the number of word tokens). The TDM begins like this:

```{r}
biber150_ice_gb[1:10, 1:8]
```

The part sizes can alternatively be given in the last row of the table. In general, you need to tell the function where to find them.


### Dispersion scores: Directionality of scaling

Parts-based dispersion scores range between 0 and 1, and the conventional scaling (see Juilland & Chang-Rodriguez 1964; Carroll 1970; Rosengren 1971) assigns higher scores to items that are more dispersed, i.e. that show a wider spread or a more even/balanced distribution across corpus parts. Unfortunately, some more recent dispersion measures have the reverse scaling, which can cause confusion. For this reason, the functions in the `tlda` package explicitly control the directionality of scaling and treat it separately (!) from the original formula. The default setting for all (!) dispersion measures uses conventional scaling, where higher values reflect higher dispersion (`directionality = "conventional"`). The reverse scaling can be requested by specifying `directionality = "gries"`.


### Overview of functions

The functions `disp()` and `disp_tdm()` calculate seven dispersion measures:

- *R~rel~*  (relative range)
- *D*  (Juilland's *D*)
- *D~2~*  (Carroll's *D~2~*)
- *S*  (Rosengren's *S*)
- *D~P~*  (Gries's *deviation of proportions*)
- *D~A~*  (Burch et al. 2017)
- *D~KL~*  (based on the Kullback-Leibler divergence)

However, these functions do not provide finer control over the way in which a specific dispersion measure is calculated. This limitation concerns indices for which multiple formulas (or versions, or computational procedures) exist in the literature:

- *Range*: Different versions exist (absolute range, relative range, relative range with size)
- *D~P~*: Different formulas in found in the literature
- *D~A~*: There is a basic and a computationally more efficient (approximate) procedure
- *D~KL~*: Various methods exist for standardizing the Kullback-Leibler divergence to the unit interval [0,1]

By default, `disp()` and `disp_tdm()` print information that tells you which version of these measures it uses. This printout also reminds you of the directionality of scaling that has been applied. The following code returns dispersion scores for the item *a* in ICE-GB (Nelson et al. 2002): 

```{r}
disp(
  subfreq = biber150_ice_gb[2,], # row 2 in the TDM represents "a"
  partsize = biber150_ice_gb[1,] # row 1 in the TDM contains the part sizes
)
```

While `disp()` and `disp_tdm()` use sensible default settings, you may want to have more control over the way in which these four indices are calculated. You can then turn to the following functions:

- `disp_R()` / `disp_R_tdm()`: This function allows you to use different versions of *range*: 
  - absolute range (the number of corpus parts containing the item)
  - relative range (the proporiton of corpus parts containing the item)
  - relative range with size (which takes into account the size of the corpus parts)
- `disp_DP()` / `disp_DP_tdm()`: Here you can choose between different formulas that have been suggested for Gries's *deviation of proportions*:
  - original version in Gries (2008)
  - modification described in Lijffijt & Gries (2012)
  - modification described in Egbert et al. (2020)
- `disp_DA()` / `disp_DA_tdm()`: You can select the computational procedure:
  - basic, which implements the formula
  - shortcut, which is quicker and yields a close approximation to the basic version
- `disp_DKL()` / `disp_DKL_tdm()`: Allows for the choice between different methods for standardizing KLD scores (see Gries 2024: 90)
- `disp_S()` / `disp_S_tdm()`: For Rosengren's *S*, uses sensible defaults for frequency adjustment


### Frequency adjustment

The `tlda` package also allows you to adjust dispersion scores for the frequency of the item, by specifying `freq_adjust = TRUE`. This addresses an important concern raised by Gries (2022, 2024), namely that all parts-based dispersion measures provide a score that in fact blends information on dispersion and frequency. He therefore proposed a method for 'partialing out' frequency, i.e. to remove its unwanted effect on the dispersion score we obtain.

To address this issue, Gries (2022; 2024) suggests that the dispersion score for a specific item should be re-expressed based on its dispersion potential in the corpus at hand. The dispersion potential refers to the lowest and highest possible score an item can obtain given its overall corpus frequency as well as the number (and size) of the corpus parts. Dispersion is then re-expressed relative to these endpoints, where the dispersion pessimum is set to 0, and the dispersion optimum to 1 (using conventional scaling). The frequency-adjusted score falls between these bounds and expresses how close the observed distribution is to the theoretical maximum and minimum. In Gries (2024), this is referred to as the *min-max transformation*.

This adjustment therefore requires a maximally and a minimally dispersed distribution of the item across the parts. These hypothetical extremes may be determined in different ways. Gries (2022, 2024) suggests a computationally expensive strategy, which uses a trial-and-error approach to find the distribution that maximizes the *dispersion score* yielded by a particular index (e.g. *D~KL~*). The `tlda` package uses a different approach. It finds the maximally and minimally dispersed distribution independently of the specific measure(s) applied. Instead, it operates based on the distributional features of *pervasiveness* or *evenness*. This is to say that it constructs a distribution based on what we may consider, conceptually, as "highly dispersed". The user must decide whether extremes should represent distributions where the item is maximally/minimally *pervasive* (`freq_adjust_method = "pervasive"`), i.e. whether it is spread as widely or narrowly as possible across corpus parts, or whether the extremes should mark distributions that are maximally/minimally even (`freq_adjust_method = "even"`), i.e. whether the spread of the item across corpus parts is maximally balanced or maximally concentrated (bursty/clumpy). More details and explanations can be found in the vignette `vignette("frequency-adjustment")`.



### Calculating dispersion for a single item

We illustrate the use of the `tlda` package by focussing on the distribution of the item *actually* in the Spoken BNC2014. The "corpus parts" are the 668 speakers. We start by extracting the distributional information for *actually* from the built-in TDM `biber150_spokenBNC2014`.

```{r}
speaker_word_count <- biber150_spokenBNC2014["word_count",]
subfreq_actually <- biber150_spokenBNC2014["actually",]
```


#### General function `disp()`

We start with the umbrella function `disp()`:

```{r}
disp(
  subfreq = subfreq_actually,
  partsize = speaker_word_count
)
```

We can change to the reverse scaling by supplying `directionality = "gries"`. Note how the information given in the printout changes accordingly.

```{r}
disp(
  subfreq = subfreq_actually,
  partsize = speaker_word_count,
  directionality = "gries"
)
```

For frequency-adjusted scores, we supply `freq_adjust = TRUE`. By default, the method `even` is used, which gives priority to evenness when building a minimally and maximally dispersed reference distribution (see above).

```{r}
disp(
  subfreq = subfreq_actually,
  partsize = speaker_word_count,
  freq_adjust = TRUE
)
```

#### Functions for specific dispersion measures

If you require more flexibility, you can use one of the functions for specific dispersion measures. 


For *Range*, the function `disp_R()` offers three choices: 

- relative range (`relative`), i.e. the *proportion* of corpus parts containing at least one occurrence of the item (default)
- absolute range (`absolute`), i.e. the *number* of corpus parts containing at least one occurrence of the item
- relative range with size (`relative_withsize`), the proportional version that takes into account the size of the corpus parts (see Gries 2022: 179-180; Gries 2024: 27-28). 

The following code returns relative range with size:

```{r}
disp_R(
  subfreq = subfreq_actually,
  partsize = speaker_word_count,
  type = "relative_withsize"
)
```


For Gries's *deviation of proportions*, the function `disp_DP()` allows you to choose from three different formulas that are found in the literature. The following code uses the original formula in Gries (2008), but with `conventional` (!) scaling (see information in print-out):

```{r}
disp_DP(
  subfreq = subfreq_actually,
  partsize = speaker_word_count,
  formula = "gries_2008"
)
```

We can also compare the scores produced by the three formulas. We now suppress the printing of the score (`print_score = FALSE`) and the background information (`verbose = FALSE`):

```{r}
compare_DPs <- rbind(
  disp_DP(subfreq = subfreq_actually,
          partsize = speaker_word_count,
          formula = "gries_2008",
          verbose = FALSE, print_score = FALSE),
  disp_DP(subfreq = subfreq_actually,
          partsize = speaker_word_count,
          formula = "lijffijt_gries_2012",
          verbose = FALSE, print_score = FALSE),
  disp_DP(subfreq = subfreq_actually,
          partsize = speaker_word_count,
          formula = "egbert_etal_2020",
          verbose = FALSE, print_score = FALSE
  ))

rownames(compare_DPs) <- c(
  "Gries (2008)",
  "Lijffijt & Gries (2012)",
  "Egbert et al. (2020)"
)
compare_DPs
```

For *D~A~*, three computational procedures are available, The `basic` version is a direct implementation of the actual formula for this measure, which is computationally expensive if the number of corpus parts is large. Wilcox (1973: 343) gives a `shortcut` version, which is much quicker (see this [blog post](https://lsoenning.github.io/posts/2023-12-11_computation_DA/)). Finally, `shortcut_mod` is a slightly adapted form of the shortcut (EXPERIMENTAL), which ensures that scores do not exceed 1 (conventional scaling). The following code uses Wilcox's (1973) quicker approximate procedure:

```{r}
disp_DA(
  subfreq = subfreq_actually,
  partsize = speaker_word_count,
  procedure = "shortcut"
  )
```

Again, we can compare the scores produced by the different computational procedures, again suppressing the printing of the score and background information:

```{r}
compare_DAs <- rbind(
  disp_DA(subfreq = subfreq_actually,
          partsize = speaker_word_count,
          procedure = "basic",
          verbose = FALSE, print_score = FALSE),
  disp_DA(subfreq = subfreq_actually,
          partsize = speaker_word_count,
          procedure = "shortcut",
          verbose = FALSE, print_score = FALSE),
  disp_DA(subfreq = subfreq_actually,
          partsize = speaker_word_count,
          procedure = "shortcut_mod",
          verbose = FALSE, print_score = FALSE
  ))

rownames(compare_DAs) <- c(
  "Basic procedure",
  "Shortcut",
  "Shortcut (modified)"
)
compare_DAs
```


For *D~KL~*, we may opt for a specific standardization method, which refers to the transformation that maps the Kullback-Leibler divergence to the unit interval [0,1]. The method used in Gries (2021: 20) can be implemented using `standardization = "base_e"`:

```{r}
disp_DKL(
  subfreq = subfreq_actually,
  partsize = speaker_word_count,
  standardization = "base_e"
)
```

We compare the output of different standardization methods:

```{r}
compare_DKLs <- rbind(
  disp_DKL(subfreq = subfreq_actually,
          partsize = speaker_word_count,
          standardization = "o2p",
          verbose = FALSE, print_score = FALSE),
  disp_DKL(subfreq = subfreq_actually,
          partsize = speaker_word_count,
          standardization = "base_e",
          verbose = FALSE, print_score = FALSE),
  disp_DKL(subfreq = subfreq_actually,
          partsize = speaker_word_count,
          standardization = "base_2",
          verbose = FALSE, print_score = FALSE
  ))

rownames(compare_DKLs) <- c(
  "Odds-to-probability",
  "Base e",
  "Base 2"
)
compare_DKLs
```


### Calculating dispersion for a term-document matrix

All of the operations can be applied to a term-document matrix (TDM) using the `_tdm`-suffixed variants of the functions used above. In these functions, the arguments `subfreq` and `partsize` are replaced by the two following:

- `tdm` A term-document matrix, where rows represent items and columns represent corpus parts; it must also contain a row giving the size of the corpus parts (first or last row in the TDM)
- `row_partsize` Character string indicating which row in the TDM contains the size of the corpus parts.

In the TDMs that ship with the `tlda` package, it is the first row that contains the part sizes. You can obtain dispersion scores for all 150 items in the TDM for ICE-GB (`biber150_ice_gb`) as follows. Since the function `disp_tdm()` returns seven indices, the output is a matrix, which we store as a new object `DM_ice_gb`.

```{r}
DM_ice_gb <- disp_tdm(
    tdm = biber150_ice_gb, 
    row_partsize = "first",
    print_score = FALSE,
    verbose = FALSE)
```

We convert the matrix to a data frame, since this makes it easier to work with the results here.

```{r}
DM_ice_gb <- data.frame(DM_ice_gb)
```

We then only print the first ten items, and round scores to two decimal places:

```{r}
round(DM_ice_gb[1:10,], 2)
```

We can now inspect the distribution of scores for these 150 items visually. For instance, we may be interested in the distribution of the *D~P~* scores for the 150 items.

```{r fig.width=3.5, fig.height=2.5, warning=FALSE, message=FALSE, fig.align='center', out.width="35%"}
par(mar = c(4, 4, 1, 0.3), xpd = TRUE)

hist(
  DM_ice_gb$DP, 
  main = NULL, 
  xlab = "DP", 
  xlim = c(0,1), 
  breaks = seq(0,1,.05), 
  col = "grey60")
```

The same plot for Juilland's *D* demonstrates its sensitivity to the number of corpus parts: Most scores are bunched up near 1, since we are calculating dispersion across 500 corpus parts (i.e. text files).

```{r fig.width=3.5, fig.height=2.5, warning=FALSE, message=FALSE, fig.align='center', out.width="35%"}
par(mar = c(4, 4, 1, 0.3), xpd = TRUE)

hist(
  DM_ice_gb$D, 
  main = NULL, 
  xlab = "DP", 
  xlim = c(0,1), 
  breaks = seq(0,1,.05), 
  col = "grey60")
```

The same is true, although less dramatically, for Carroll's *D~2~*:

```{r fig.width=3.5, fig.height=2.5, warning=FALSE, message=FALSE, fig.align='center', out.width="35%"}
par(mar = c(4, 4, 1, 0.3), xpd = TRUE)

hist(
  DM_ice_gb$D2, 
  main = NULL, 
  xlab = "DP", 
  xlim = c(0,1), 
  breaks = seq(0,1,.05), 
  col = "grey60")
```

To inspect the correlation between the scores produced by different measures, we can draw a scatterplot matrix:

```{r fig.width=5, fig.height=5, fig.align='center'}
pairs(DM_ice_gb, gap = 0, cex = .5, cex.labels = 1)
```

To inspect the association of dispersion scores with frequency, we add a new column to the data frame. To obtain the corpus frequency for the 150 items, we add up their subfrequencies by summing across the rows in the TDM `biber150_ice_gb`, excluding row 1 (!), which contains the sizes of the corpus parts.

```{r}
DM_ice_gb$frequency <- rowSums(biber150_ice_gb[-1,])
```

Now we can look at the association between dispersion scores and the (logged) corpus frequency of these 150 items. The following scatterplot looks at *D~P~*:

```{r fig.width=3, fig.height=2.5, warning=FALSE, message=FALSE, fig.align='center', out.width="35%"}
par(mar = c(4, 4, 1, 0.3), xpd = TRUE)

plot(
  DM_ice_gb$DP ~ log(DM_ice_gb$frequency),
  xlab = "Log frequency",
  ylab = "DP",
  ylim = c(0,1))

```

We can express this association using Spearman's rank correlation coefficient:

```{r}
cor(
  DM_ice_gb$DP, 
  log(DM_ice_gb$frequency), 
  method = "spearman",
  use = "complete.obs")
```
Let us now adjust scores for frequency by repeating the above steps but supplying `freq_adjust = TRUE` to the `disp_tdm()` function. Note that this takes a bit longer to run.

```{r}
DM_ice_gb_nofreq <- disp_tdm(
    tdm = biber150_ice_gb, 
    row_partsize = "first",
    freq_adjust = TRUE,
    freq_adjust_method = "even",
    print_score = FALSE,
    verbose = FALSE)

DM_ice_gb_nofreq <- data.frame(DM_ice_gb_nofreq)

DM_ice_gb_nofreq$frequency <- rowSums(biber150_ice_gb[-1,])

str(DM_ice_gb_nofreq)
```

The association with frequency is now attenuated:

```{r fig.width=3, fig.height=2.5, warning=FALSE, message=FALSE, fig.align='center', out.width="35%"}
oldpar <- par(mar = c(5.1, 4.1, 4.1, 2.1))

par(mar = c(4, 4, 1, 0.3), xpd = TRUE)

plot(
  DM_ice_gb_nofreq$DP_nofreq ~ log(DM_ice_gb_nofreq$frequency),
  xlab = "Log frequency",
  ylab = "DP",
  ylim = c(0,1))

par(oldpar)
```


This is also reflected in Spearman's rank correlation coefficient:

```{r}
cor(
  DM_ice_gb_nofreq$DP_nofreq, 
  log(DM_ice_gb_nofreq$frequency), 
  method = "spearman",
  use = "complete.obs")
```



### References

Biber, Douglas, Randi Reppen, Erin Schnur & Romy Ghanem. 2016. On the (non)utility of Juilland’s *D* to measure lexical dispersion in large corpora. *International Journal of Corpus Linguistics* 21(4). 439–464. doi: [10.1075/ijcl.21.4.01bib](https://doi.org/10.1075/ijcl.21.4.01bib) 

Burch, Brent, Jesse Egbert & Douglas Biber. 2017. Measuring and interpreting lexical dispersion in corpus linguistics. *Journal of Research Design and Statistics in Linguistics and Communication Science* 3(2). 189–216. doi: [10.1558/jrds.33066](https://doi.org/10.1558/jrds.33066)

Carroll, John B. 1970. An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. *Computer Studies in the Humanities and Verbal Behaviour* 3(2). 61–65. doi: [10.1002/j.2333-8504.1970.tb00778.x](https://doi.org/10.1002/j.2333-8504.1970.tb00778.x)

Egbert, Jesse, Brent Burch & Douglas Biber. 2020. Lexical dispersion and corpus design. *International Journal of Corpus Linguistics* 25(1). 89–115. doi: [10.1075/ijcl.18010.egb](https://doi.org/10.1075/ijcl.18010.egb)

Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. *International Journal of Corpus Linguistics* 13(4). 403–437. doi: [10.1075/ijcl.13.4.02gri](https://doi.org/10.1075/ijcl.13.4.02gri)

Gries, Stefan Th. 2020. Analyzing dispersion. In Magali Paquot & Stefan Th. Gries (eds.), *A practical handbook of corpus linguistics*, 99–118. New York: Springer. doi: [10.1007/978-3-030-46216-1_5](https://doi.org/10.1007/978-3-030-46216-1_5)

Gries, Stefan Th. 2021. A new approach to (key) keywords analysis: Using frequency, and now also dispersion. *Research in Corpus Linguistics* 9(2). 1−33. doi: [10.32714/ricl.09.02.02](https://doi.org/10.32714/ricl.09.02.02)

Juilland, Alphonse G. & Eugenio Chang-Rodriguez. 1964. *Frequency dictionary of Spanish words.* The Hague: Mouton de Gruyter. doi: [10.1515/9783112415467](https://doi.org/10.1515/9783112415467)

Keniston, Hayward. 1920. Common words in Spanish. *Hispania* 3(2). 85–96. doi: [10.2307/331305](https://doi.org/10.2307/331305)

Lijffijt, Jefrey & Stefan Th. Gries. 2012. Correction to Stefan Th. Gries’ ‘Dispersions and adjusted frequencies in corpora’. *International Journal of Corpus Linguistics* 17(1). 147–149. doi: [10.1075/ijcl.17.1.08lij](https://doi.org/10.1075/ijcl.17.1.08lij)

Lyne, Anthony A. 1985. *The vocabulary of French business correspondence*. Paris: Slatkine-Champion.

Nelson, Gerald, Sean Wallis & Bas Aarts. 2002. *Exploring Natural Language: Working with the British Component of the International Corpus of English*. Amsterdam: John Benjamins. doi: [10.1075/veaw.g29](https://doi.org/10.1075/veaw.g29)

Rosengren, Inger. 1971. The quantitative concept of language and its relation to the structure of frequency dictionaries. *Etudes de linguistique appliquee (Nouvelle Serie)* 1. 103–127.

Soenning, Lukas. 2025. Advancing our understanding of dispersion measures in corpus research. *Corpora*. doi: [10.3366/cor.2025.0326](https://doi.org/10.3366/cor.2025.0326)