Dispersion analysis

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Dispersion analysis

Overview

The tlda package offers various functions for corpus-linguistic dispersion analysis (see Gries 2020; Sönning 2025). The dispersion measures that are implemented are parts-based, which means that they assess the distribution of an item across corpus parts. For any type of dispersion analysis carried out with the tlda package, you therefore need to supply two variables:

the subfrequencies, i.e. the number of occurrences of the item in each corpus part
the size of the corpus parts

You can provide these data in two forms, and the choice between them mainly depends on whether you want to measure the dispersion of a single item only, or for multiple items simultaneously.

Single item: vector form. You can provide the subfrequencies and part sizes as two separate vectors.
Multiple items: term-document matrix. This is a summary table that includes subfrequencies for multiple items, where rows represent items, and columns represent corpus parts. (explained in more detail below)

Accordingly, there are two versions of each function: The function disp(), for instance, calculates dispersion measures for data provided in vector form. The _tdm suffix in a function name indicates the version that works with a term-document matrix. The function disp_tdm() therefore offers the same functionality as disp(), the only difference being that it takes data provided in matrix format.

Term-document matrix (TDM)

When analyzing the dispersion of multiple (and possibly all) items in a corpus, it is convenient (and efficient) to use what is referred to as a term-document matrix (TDM). This is a tabular arrangement that lists the subfrequencies for multiple items. The following is an excerpt from a TDM for ICE-GB (Nelson et al. 2002), where items represent word forms. It shows a selection of 10 items and 8 corpus parts (i.e. text files):

#>          w2c-006 s2a-023 w2a-010 w1b-004 s1b-010 w2a-011 s2b-015 w2d-010
#> ninety         0       0       0       0       0       0       0       0
#> from           3       6       8       8       3       8      15      11
#> large          0       0       0       0       0       0       1       0
#> another        1       0       0       1       1       0       2       2
#> somebody       0       2       0       0       0       0       0       0
#> a             54      30      40      31      61      33      40      79
#> time           4       4       4       0       4       0       6       4
#> oh             0       0       0       0       6       0       0       0
#> twenty         0       2       0       0       1       0       1       0
#> may            1       4       5       6       2      14       0       9

Importantly, the rows in a TDM represent items and the columns represent corpus parts. A ‘proper’ TDM includes all words occurring in the corpus. In that case, we can retrieve the size of the corpus parts by summing the subfrequencies in each column (i.e. for each corpus part). This can be done in R with the function addmargins(), which appends the column sums as a new row at the end of the table (the argument margins = 1 tells it to sum over columns rather than rows).

addmargins(tdm, margin = 1)

There are many settings, however, where we work with ‘improper’ TDMs, where the rows only represent a selection of items in the corpus. This is the case if we focus on certain lexical items of interest, or if we are dealing with phraseological or syntactic structures. In such a case, the size of corpus parts cannot be recovered from the TDM; instead, it needs to be supplied as a separate row.

The tldr package includes a number of TDMs that hold distributional information for a selection of 150 lexical items (a list compiled by Biber et al. 2016 for the evaluation of dispersion measures). In these TDMs, the first row (word_count) records the part sizes (i.e. the number of word tokens). The TDM begins like this:

biber150_ice_gb[1:10, 1:8]
#>            s1a-001 s1a-002 s1a-003 s1a-004 s1a-005 s1a-006 s1a-007 s1a-008
#> word_count    2195    2159    2287    2290    2120    2060    2025    2177
#> a               50      38      44      67      35      34      37      29
#> able             2       4       4       0       0       0       0       0
#> actually         3       6       2       2       6       3       0       8
#> after            0       0       0       0       4       1       1       0
#> against          0       0       0       0       0       0       0       0
#> ah               1       0       0       0       1       6       1       2
#> aha              0       0       0       0       0       0       0       0
#> all              2       5       6       9       7       5       8      13
#> among            0       0       0       0       0       0       0       0

The part sizes can alternatively be given in the last row of the table. In general, you need to tell the function where to find them.

Dispersion scores: Directionality of scaling

Parts-based dispersion scores range between 0 and 1, and the conventional scaling (see Juilland & Chang-Rodriguez 1964; Carroll 1970; Rosengren 1971) assigns higher scores to items that are more dispersed, i.e. that show a wider spread or a more even/balanced distribution across corpus parts. Unfortunately, some more recent dispersion measures have the reverse scaling, which can cause confusion. For this reason, the functions in the tlda package explicitly control the directionality of scaling and treat it separately (!) from the original formula. The default setting for all (!) dispersion measures uses conventional scaling, where higher values reflect higher dispersion (directionality = "conventional"). The reverse scaling can be requested by specifying directionality = "gries".

Overview of functions

The functions disp() and disp_tdm() calculate seven dispersion measures:

R_rel (relative range)
D (Juilland’s D)
D₂ (Carroll’s D₂)
S (Rosengren’s S)
D_P (Gries’s deviation of proportions)
D_A (Burch et al. 2017)
D_KL (based on the Kullback-Leibler divergence)

However, these functions do not provide finer control over the way in which a specific dispersion measure is calculated. This limitation concerns indices for which multiple formulas (or versions, or computational procedures) exist in the literature:

Range: Different versions exist (absolute range, relative range, relative range with size)
D_P: Different formulas in found in the literature
D_A: There is a basic and a computationally more efficient (approximate) procedure
D_KL: Various methods exist for standardizing the Kullback-Leibler divergence to the unit interval [0,1]

By default, disp() and disp_tdm() print information that tells you which version of these measures it uses. This printout also reminds you of the directionality of scaling that has been applied. The following code returns dispersion scores for the item a in ICE-GB (Nelson et al. 2002):

disp(
  subfreq = biber150_ice_gb[2,], # row 2 in the TDM represents "a"
  partsize = biber150_ice_gb[1,] # row 1 in the TDM contains the part sizes
)
#>      Rrel         D        D2         S        DP        DA       DKL 
#> 1.0000000 0.9870710 0.9933147 0.9792124 0.8883657 0.8392881 0.9439996
#> 
#> Scores follow conventional scaling:
#>   0 = maximally uneven/bursty/concentrated distribution (pessimum)
#>   1 = maximally even/dispersed/balanced distribution (optimum)
#> 
#> For Gries's DP, the function uses the modified version suggested by
#>   Egbert et al. (2020)
#> 
#> For DKL, standardization to the unit interval [0,1] is based on the
#>   odds-to-probability transformation, see Gries (2024: 90)

While disp() and disp_tdm() use sensible default settings, you may want to have more control over the way in which these four indices are calculated. You can then turn to the following functions:

disp_R() / disp_R_tdm(): This function allows you to use different versions of range:
- absolute range (the number of corpus parts containing the item)
- relative range (the proporiton of corpus parts containing the item)
- relative range with size (which takes into account the size of the corpus parts)
disp_DP() / disp_DP_tdm(): Here you can choose between different formulas that have been suggested for Gries’s deviation of proportions:
- original version in Gries (2008)
- modification described in Lijffijt & Gries (2012)
- modification described in Egbert et al. (2020)
disp_DA() / disp_DA_tdm(): You can select the computational procedure:
- basic, which implements the formula
- shortcut, which is quicker and yields a close approximation to the basic version
disp_DKL() / disp_DKL_tdm(): Allows for the choice between different methods for standardizing KLD scores (see Gries 2024: 90)
disp_S() / disp_S_tdm(): For Rosengren’s S, uses sensible defaults for frequency adjustment

Frequency adjustment

The tlda package also allows you to adjust dispersion scores for the frequency of the item, by specifying freq_adjust = TRUE. This addresses an important concern raised by Gries (2022, 2024), namely that all parts-based dispersion measures provide a score that in fact blends information on dispersion and frequency. He therefore proposed a method for ‘partialing out’ frequency, i.e. to remove its unwanted effect on the dispersion score we obtain.

To address this issue, Gries (2022; 2024) suggests that the dispersion score for a specific item should be re-expressed based on its dispersion potential in the corpus at hand. The dispersion potential refers to the lowest and highest possible score an item can obtain given its overall corpus frequency as well as the number (and size) of the corpus parts. Dispersion is then re-expressed relative to these endpoints, where the dispersion pessimum is set to 0, and the dispersion optimum to 1 (using conventional scaling). The frequency-adjusted score falls between these bounds and expresses how close the observed distribution is to the theoretical maximum and minimum. In Gries (2024), this is referred to as the min-max transformation.

This adjustment therefore requires a maximally and a minimally dispersed distribution of the item across the parts. These hypothetical extremes may be determined in different ways. Gries (2022, 2024) suggests a computationally expensive strategy, which uses a trial-and-error approach to find the distribution that maximizes the dispersion score yielded by a particular index (e.g. D_KL). The tlda package uses a different approach. It finds the maximally and minimally dispersed distribution independently of the specific measure(s) applied. Instead, it operates based on the distributional features of pervasiveness or evenness. This is to say that it constructs a distribution based on what we may consider, conceptually, as “highly dispersed”. The user must decide whether extremes should represent distributions where the item is maximally/minimally pervasive (freq_adjust_method = "pervasive"), i.e. whether it is spread as widely or narrowly as possible across corpus parts, or whether the extremes should mark distributions that are maximally/minimally even (freq_adjust_method = "even"), i.e. whether the spread of the item across corpus parts is maximally balanced or maximally concentrated (bursty/clumpy). More details and explanations can be found in the vignette vignette("frequency-adjustment").

Calculating dispersion for a single item

We illustrate the use of the tlda package by focussing on the distribution of the item actually in the Spoken BNC2014. The “corpus parts” are the 668 speakers. We start by extracting the distributional information for actually from the built-in TDM biber150_spokenBNC2014.

speaker_word_count <- biber150_spokenBNC2014["word_count",]
subfreq_actually <- biber150_spokenBNC2014["actually",]

General function `disp()`

We start with the umbrella function disp():

disp(
  subfreq = subfreq_actually,
  partsize = speaker_word_count
)
#>      Rrel         D        D2         S        DP        DA       DKL 
#> 0.8817365 0.9662460 0.9440567 0.8908132 0.7461380 0.5466095 0.7642072
#> 
#> Scores follow conventional scaling:
#>   0 = maximally uneven/bursty/concentrated distribution (pessimum)
#>   1 = maximally even/dispersed/balanced distribution (optimum)
#> 
#> For Gries's DP, the function uses the modified version suggested by
#>   Egbert et al. (2020)
#> 
#> For DKL, standardization to the unit interval [0,1] is based on the
#>   odds-to-probability transformation, see Gries (2024: 90)

We can change to the reverse scaling by supplying directionality = "gries". Note how the information given in the printout changes accordingly.

disp(
  subfreq = subfreq_actually,
  partsize = speaker_word_count,
  directionality = "gries"
)
#>       Rrel          D         D2          S         DP         DA        DKL 
#> 0.11826347 0.03375403 0.05594325 0.10918680 0.25386196 0.45339049 0.23579282
#> 
#> Scores follow scaling used by Gries (2008):
#>   0 = maximally even/dispersed/balanced distribution (optimum)
#>   1 = maximally uneven/bursty/concentrated distribution (pessimum)
#> 
#> For Gries's DP, the function uses the modified version suggested by
#>   Egbert et al. (2020)
#> 
#> For DKL, standardization to the unit interval [0,1] is based on the
#>   odds-to-probability transformation, see Gries (2024: 90)

For frequency-adjusted scores, we supply freq_adjust = TRUE. By default, the method even is used, which gives priority to evenness when building a minimally and maximally dispersed reference distribution (see above).

disp(
  subfreq = subfreq_actually,
  partsize = speaker_word_count,
  freq_adjust = TRUE
)
#> Rrel_nofreq    D_nofreq   D2_nofreq    S_nofreq   DP_nofreq   DA_nofreq 
#>   0.9004975   0.8074350   0.8752113   0.8910372   0.7492699   0.5568662 
#>  DKL_nofreq 
#>   0.7396345
#> 
#> Dispersion scores are adjusted for frequency using the min-max
#>   transformation (see Gries 2024: 196-208); please note that the
#>   method implemented here does not work well if corpus parts differ
#>   considerably in size; see vignette('frequency-adjustment')
#> 
#> Scores follow conventional scaling:
#>   0 = maximally uneven/bursty/concentrated distribution (pessimum)
#>   1 = maximally even/dispersed/balanced distribution (optimum)
#> 
#> For Gries's DP, the function uses the modified version suggested by
#>   Egbert et al. (2020)
#> 
#> For DKL, standardization to the unit interval [0,1] is based on the
#>   odds-to-probability transformation, see Gries (2024: 90)

Functions for specific dispersion measures

If you require more flexibility, you can use one of the functions for specific dispersion measures.

For Range, the function disp_R() offers three choices:

relative range (relative), i.e. the proportion of corpus parts containing at least one occurrence of the item (default)
absolute range (absolute), i.e. the number of corpus parts containing at least one occurrence of the item
relative range with size (relative_withsize), the proportional version that takes into account the size of the corpus parts (see Gries 2022: 179-180; Gries 2024: 27-28).

The following code returns relative range with size:

disp_R(
  subfreq = subfreq_actually,
  partsize = speaker_word_count,
  type = "relative_withsize"
)
#> Rrel_withsize 
#>     0.9914442
#> 
#> Scores represent relative range, i.e. the proportion of corpus parts
#>   containing at least one occurrence of the item. The size of the
#>   corpus parts is taken into account, see Gries (2022: 179-180),
#>   Gries (2024: 27-28)

For Gries’s deviation of proportions, the function disp_DP() allows you to choose from three different formulas that are found in the literature. The following code uses the original formula in Gries (2008), but with conventional (!) scaling (see information in print-out):

disp_DP(
  subfreq = subfreq_actually,
  partsize = speaker_word_count,
  formula = "gries_2008"
)
#> DP_nofreq 
#> 0.7492699
#> 
#> The dispersion score is adjusted for frequency using the min-max
#>   transformation (see Gries 2024: 196-208); please note that the
#>   method implemented here does not work well if corpus parts differ
#>   considerably in size; see vignette('frequency-adjustment')
#> 
#> Scores follow conventional scaling:
#>   0 = maximally uneven/bursty/concentrated distribution (pessimum)
#>   1 = maximally even/dispersed/balanced distribution (optimum)
#> 
#> Computed using the original version proposed by Gries (2008)

We can also compare the scores produced by the three formulas. We now suppress the printing of the score (print_score = FALSE) and the background information (verbose = FALSE):

compare_DPs <- rbind(
  disp_DP(subfreq = subfreq_actually,
          partsize = speaker_word_count,
          formula = "gries_2008",
          verbose = FALSE, print_score = FALSE),
  disp_DP(subfreq = subfreq_actually,
          partsize = speaker_word_count,
          formula = "lijffijt_gries_2012",
          verbose = FALSE, print_score = FALSE),
  disp_DP(subfreq = subfreq_actually,
          partsize = speaker_word_count,
          formula = "egbert_etal_2020",
          verbose = FALSE, print_score = FALSE
  ))

rownames(compare_DPs) <- c(
  "Gries (2008)",
  "Lijffijt & Gries (2012)",
  "Egbert et al. (2020)"
)
compare_DPs
#>                         DP_nofreq
#> Gries (2008)            0.7492699
#> Lijffijt & Gries (2012) 0.7492653
#> Egbert et al. (2020)    0.7492653

For D_A, three computational procedures are available, The basic version is a direct implementation of the actual formula for this measure, which is computationally expensive if the number of corpus parts is large. Wilcox (1973: 343) gives a shortcut version, which is much quicker (see this blog post). Finally, shortcut_mod is a slightly adapted form of the shortcut (EXPERIMENTAL), which ensures that scores do not exceed 1 (conventional scaling). The following code uses Wilcox’s (1973) quicker approximate procedure:

disp_DA(
  subfreq = subfreq_actually,
  partsize = speaker_word_count,
  procedure = "shortcut"
  )
#>        DA 
#> 0.5481088
#> 
#> Scores follow conventional scaling:
#>   0 = maximally uneven/bursty/concentrated distribution (pessimum)
#>   1 = maximally even/dispersed/balanced distribution (optimum)
#> 
#> Computed using the computational shortcut suggested by
#>   Wilcox (1967: 343, 'MDA', column 4)

Again, we can compare the scores produced by the different computational procedures, again suppressing the printing of the score and background information:

compare_DAs <- rbind(
  disp_DA(subfreq = subfreq_actually,
          partsize = speaker_word_count,
          procedure = "basic",
          verbose = FALSE, print_score = FALSE),
  disp_DA(subfreq = subfreq_actually,
          partsize = speaker_word_count,
          procedure = "shortcut",
          verbose = FALSE, print_score = FALSE),
  disp_DA(subfreq = subfreq_actually,
          partsize = speaker_word_count,
          procedure = "shortcut_mod",
          verbose = FALSE, print_score = FALSE
  ))

rownames(compare_DAs) <- c(
  "Basic procedure",
  "Shortcut",
  "Shortcut (modified)"
)
compare_DAs
#>                            DA
#> Basic procedure     0.5466095
#> Shortcut            0.5481088
#> Shortcut (modified) 0.5472882

For D_KL, we may opt for a specific standardization method, which refers to the transformation that maps the Kullback-Leibler divergence to the unit interval [0,1]. The method used in Gries (2021: 20) can be implemented using standardization = "base_e":

disp_DKL(
  subfreq = subfreq_actually,
  partsize = speaker_word_count,
  standardization = "base_e"
)
#>       DKL 
#> 0.7345144
#> 
#> Scores follow conventional scaling:
#>   0 = maximally uneven/bursty/concentrated distribution (pessimum)
#>   1 = maximally even/dispersed/balanced distribution (optimum)
#> 
#> Standardization to the unit interval [0,1] using base e,
#>   see Gries (2021: 20)

We compare the output of different standardization methods:

compare_DKLs <- rbind(
  disp_DKL(subfreq = subfreq_actually,
          partsize = speaker_word_count,
          standardization = "o2p",
          verbose = FALSE, print_score = FALSE),
  disp_DKL(subfreq = subfreq_actually,
          partsize = speaker_word_count,
          standardization = "base_e",
          verbose = FALSE, print_score = FALSE),
  disp_DKL(subfreq = subfreq_actually,
          partsize = speaker_word_count,
          standardization = "base_2",
          verbose = FALSE, print_score = FALSE
  ))

rownames(compare_DKLs) <- c(
  "Odds-to-probability",
  "Base e",
  "Base 2"
)
compare_DKLs
#>                           DKL
#> Odds-to-probability 0.7642072
#> Base e              0.7345144
#> Base 2              0.8074553

Calculating dispersion for a term-document matrix

All of the operations can be applied to a term-document matrix (TDM) using the _tdm-suffixed variants of the functions used above. In these functions, the arguments subfreq and partsize are replaced by the two following:

tdm A term-document matrix, where rows represent items and columns represent corpus parts; it must also contain a row giving the size of the corpus parts (first or last row in the TDM)
row_partsize Character string indicating which row in the TDM contains the size of the corpus parts.

In the TDMs that ship with the tlda package, it is the first row that contains the part sizes. You can obtain dispersion scores for all 150 items in the TDM for ICE-GB (biber150_ice_gb) as follows. Since the function disp_tdm() returns seven indices, the output is a matrix, which we store as a new object DM_ice_gb.

DM_ice_gb <- disp_tdm(
    tdm = biber150_ice_gb, 
    row_partsize = "first",
    print_score = FALSE,
    verbose = FALSE)
#> Warning in disp_tdm(tdm = biber150_ice_gb, row_partsize = "first", print_score = FALSE, : 
#>   For some item(s), all subfrequencies are 0; returning NA in this case

We convert the matrix to a data frame, since this makes it easier to work with the results here.

DM_ice_gb <- data.frame(DM_ice_gb)

We then only print the first ten items, and round scores to two decimal places:

round(DM_ice_gb[1:10,], 2)
#>          Rrel    D   D2    S   DP   DA  DKL
#> a        1.00 0.99 0.99 0.98 0.89 0.84 0.94
#> able     0.45 0.93 0.84 0.42 0.46 0.31 0.42
#> actually 0.58 0.93 0.86 0.48 0.46 0.32 0.44
#> after    0.71 0.95 0.91 0.64 0.59 0.45 0.55
#> against  0.38 0.92 0.81 0.34 0.38 0.25 0.37
#> ah       0.17 0.86 0.67 0.14 0.17 0.10 0.25
#> aha      0.01 0.50 0.22 0.01 0.01 0.01 0.12
#> all      0.97 0.97 0.96 0.88 0.74 0.64 0.76
#> among    0.21 0.89 0.72 0.20 0.21 0.15 0.29
#> an       0.98 0.97 0.97 0.91 0.76 0.67 0.80

We can now inspect the distribution of scores for these 150 items visually. For instance, we may be interested in the distribution of the D_P scores for the 150 items.

par(mar = c(4, 4, 1, 0.3), xpd = TRUE)

hist(
  DM_ice_gb$DP, 
  main = NULL, 
  xlab = "DP", 
  xlim = c(0,1), 
  breaks = seq(0,1,.05), 
  col = "grey60")

The same plot for Juilland’s D demonstrates its sensitivity to the number of corpus parts: Most scores are bunched up near 1, since we are calculating dispersion across 500 corpus parts (i.e. text files).

par(mar = c(4, 4, 1, 0.3), xpd = TRUE)

hist(
  DM_ice_gb$D, 
  main = NULL, 
  xlab = "DP", 
  xlim = c(0,1), 
  breaks = seq(0,1,.05), 
  col = "grey60")

The same is true, although less dramatically, for Carroll’s D₂:

par(mar = c(4, 4, 1, 0.3), xpd = TRUE)

hist(
  DM_ice_gb$D2, 
  main = NULL, 
  xlab = "DP", 
  xlim = c(0,1), 
  breaks = seq(0,1,.05), 
  col = "grey60")

To inspect the correlation between the scores produced by different measures, we can draw a scatterplot matrix:

pairs(DM_ice_gb, gap = 0, cex = .5, cex.labels = 1)

To inspect the association of dispersion scores with frequency, we add a new column to the data frame. To obtain the corpus frequency for the 150 items, we add up their subfrequencies by summing across the rows in the TDM biber150_ice_gb, excluding row 1 (!), which contains the sizes of the corpus parts.

DM_ice_gb$frequency <- rowSums(biber150_ice_gb[-1,])

Now we can look at the association between dispersion scores and the (logged) corpus frequency of these 150 items. The following scatterplot looks at D_P:

par(mar = c(4, 4, 1, 0.3), xpd = TRUE)

plot(
  DM_ice_gb$DP ~ log(DM_ice_gb$frequency),
  xlab = "Log frequency",
  ylab = "DP",
  ylim = c(0,1))

We can express this association using Spearman’s rank correlation coefficient:

cor(
  DM_ice_gb$DP, 
  log(DM_ice_gb$frequency), 
  method = "spearman",
  use = "complete.obs")
#> [1] 0.9576285

Let us now adjust scores for frequency by repeating the above steps but supplying freq_adjust = TRUE to the disp_tdm() function. Note that this takes a bit longer to run.

DM_ice_gb_nofreq <- disp_tdm(
    tdm = biber150_ice_gb, 
    row_partsize = "first",
    freq_adjust = TRUE,
    freq_adjust_method = "even",
    print_score = FALSE,
    verbose = FALSE)
#> Warning in disp_tdm(tdm = biber150_ice_gb, row_partsize = "first", freq_adjust = TRUE, : 
#>   For some item(s), all subfrequencies are 0; returning NA in this case
#> Warning in disp_tdm(tdm = biber150_ice_gb, row_partsize = "first", freq_adjust = TRUE, : 
#>   For some item(s), the corpus frequency is 1; no frequency adjustment
#>   made in this case; function returns unadjusted dispersion score

DM_ice_gb_nofreq <- data.frame(DM_ice_gb_nofreq)

DM_ice_gb_nofreq$frequency <- rowSums(biber150_ice_gb[-1,])

str(DM_ice_gb_nofreq)
#> 'data.frame':    150 obs. of  8 variables:
#>  $ Rrel_nofreq: num  1 0.581 0.577 0.707 0.526 ...
#>  $ D_nofreq   : num  0.956 0.957 0.939 0.962 0.945 ...
#>  $ D2_nofreq  : num  0.989 0.879 0.859 0.914 0.852 ...
#>  $ S_nofreq   : num  0.979 0.52 0.482 0.645 0.456 ...
#>  $ DP_nofreq  : num  0.889 0.566 0.484 0.646 0.509 ...
#>  $ DA_nofreq  : num  0.839 0.411 0.337 0.509 0.351 ...
#>  $ DKL_nofreq : num  0.934 0.485 0.38 0.525 0.444 ...
#>  $ frequency  : num  20483 390 1067 871 364 ...

The association with frequency is now attenuated:

oldpar <- par(mar = c(5.1, 4.1, 4.1, 2.1))

par(mar = c(4, 4, 1, 0.3), xpd = TRUE)

plot(
  DM_ice_gb_nofreq$DP_nofreq ~ log(DM_ice_gb_nofreq$frequency),
  xlab = "Log frequency",
  ylab = "DP",
  ylim = c(0,1))


par(oldpar)

This is also reflected in Spearman’s rank correlation coefficient:

cor(
  DM_ice_gb_nofreq$DP_nofreq, 
  log(DM_ice_gb_nofreq$frequency), 
  method = "spearman",
  use = "complete.obs")
#> [1] 0.389141

References

Biber, Douglas, Randi Reppen, Erin Schnur & Romy Ghanem. 2016. On the (non)utility of Juilland’s D to measure lexical dispersion in large corpora. International Journal of Corpus Linguistics 21(4). 439–464. doi: 10.1075/ijcl.21.4.01bib

Burch, Brent, Jesse Egbert & Douglas Biber. 2017. Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science 3(2). 189–216. doi: 10.1558/jrds.33066

Carroll, John B. 1970. An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour 3(2). 61–65. doi: 10.1002/j.2333-8504.1970.tb00778.x

Egbert, Jesse, Brent Burch & Douglas Biber. 2020. Lexical dispersion and corpus design. International Journal of Corpus Linguistics 25(1). 89–115. doi: 10.1075/ijcl.18010.egb

Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4). 403–437. doi: 10.1075/ijcl.13.4.02gri

Gries, Stefan Th. 2020. Analyzing dispersion. In Magali Paquot & Stefan Th. Gries (eds.), A practical handbook of corpus linguistics, 99–118. New York: Springer. doi: 10.1007/978-3-030-46216-1_5

Gries, Stefan Th. 2021. A new approach to (key) keywords analysis: Using frequency, and now also dispersion. Research in Corpus Linguistics 9(2). 1−33. doi: 10.32714/ricl.09.02.02

Juilland, Alphonse G. & Eugenio Chang-Rodriguez. 1964. Frequency dictionary of Spanish words. The Hague: Mouton de Gruyter. doi: 10.1515/9783112415467

Keniston, Hayward. 1920. Common words in Spanish. Hispania 3(2). 85–96. doi: 10.2307/331305

Lijffijt, Jefrey & Stefan Th. Gries. 2012. Correction to Stefan Th. Gries’ ‘Dispersions and adjusted frequencies in corpora’. International Journal of Corpus Linguistics 17(1). 147–149. doi: 10.1075/ijcl.17.1.08lij

Lyne, Anthony A. 1985. The vocabulary of French business correspondence. Paris: Slatkine-Champion.

Nelson, Gerald, Sean Wallis & Bas Aarts. 2002. Exploring Natural Language: Working with the British Component of the International Corpus of English. Amsterdam: John Benjamins. doi: 10.1075/veaw.g29

Rosengren, Inger. 1971. The quantitative concept of language and its relation to the structure of frequency dictionaries. Etudes de linguistique appliquee (Nouvelle Serie) 1. 103–127.

Soenning, Lukas. 2025. Advancing our understanding of dispersion measures in corpus research. Corpora. doi: 10.3366/cor.2025.0326

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.

Dispersion analysis

Overview

Term-document matrix (TDM)

Dispersion scores: Directionality of scaling

Overview of functions

Frequency adjustment

Calculating dispersion for a single item

General function disp()

Functions for specific dispersion measures

Calculating dispersion for a term-document matrix

References

General function `disp()`