Frequency-adjusted dispersion scores

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Frequency adjustment: Background

A fundamental issue in dispersion analysis is the fact that all of the most commonly used parts-based dispersion measures respond not just to the dispersion of an item across corpus parts, but also to its corpus frequency (Gries 2022, 2024). It is important to note that this undesirable behavior of indices is not due to the natural association between frequency and dispersion. After all, more frequent items (e.g. function words) tend to have wider semantic generality and are therefore not tied to particular contexts of language use. Instead, the fact that a dispersion score blends information on dispersion and frequency results from the way these scores are calculated – that is, they are in some sense due to a flaw in the formulas used.

Gries (2022: 184-191; 2024: 196-208) proposed a strategy for ‘partialing out’ the frequency component from a dispersion score. The method is conceptually simple: Given an item’s observed corpus frequency and the number (and size) of corpus parts, we determine the lowest and the highest level of dispersion this item could theoretically have achieved in this corpus. This is done by finding two hypothetical distributions of subfrequencies (i.e. occurrences of the item in each corpus part): the one that produces the lowest level of dispersion, and the one that produces the highest level of dispersion. The observed score is then re-expressed relative to these bounds, on a scale from 0 to 1.

The figure below illustrates this strategy. Suppose we calculate a dispersion score of .05 for an item. Then we determine the dispersion minimum (.01) and the dispersion maximum (.11). The observed score is just about halfway between .01 and .11. If we map .01 (minimum) to 0, and .11 (maximum) to 1, we can express the observed score relative to these limits. This gives us a frequency-adjusted dispersion score of .40. It tells us that the observed score is a little bit closer to the minimum than the maximum. This is referred to as a min-max transformation.

Finding the dispersion extremes for an item

The min-max transformation requires two extreme distributions as reference points. The question, then, is how to find the subfrequencies that produce these hypothetical bounds. It seems that there are two broad approaches to this problem.

The first strategy is to build distributions that are maximally or minimally dispersed based on our linguistic understanding of this distributional feature. This means that we build extreme distributions based on the conceptual meaning of ‘dispersion’. This method is not guaranteed to produce the lowest possible dispersion score for each dispersion measures. This is to say that there may be an alternative set of subfrequencies for one or several dispersion indices that yields an even lower dispersion score.

The second strategy is to find distributions that minimize the score returned by a particular dispersion measure. Dispersion extremes are then defined based on the output of a specific formula. This search for distributional extremes relies on a trial-and-error process, as many different candidate distributions must be considered. The disadvantage of this method is that it is computationally expensive; further, it will return different dispersion extremes for different dispersion measures. This may be considered unreasonable from a linguistic point of view.

Gries’s approach (2024) combines these two methods: The dispersion minimum is constructed conceptually, the dispersion maximum via a trial-and-error process.

The tlda package implements the first approach and builds extreme distributions based on the conceptual meaning of ‘dispersion’. As auch, the term has two partly overlapping meanings: Dispersion can be understood as referring to the pervasiveness of an item, i.e. how widely it is spread across corpus parts. It can also be understood as denoting the evenness of an item’s distribution, i.e. how evenly it is spread across corpus parts. As will be illustrated shortly, we can construct extreme distributions based on these two concepts, pervasiveness and evenness. This approach defines hypothetical extremes based on the construct of interest, and in terms of our linguistic understanding of what characterizes, in the context of a particular application, a ‘dispersed’ distribution. Further, the hypothetical sets of subfrequencies will be the same irrespective of the specific index we wish to use.

In the next two sections, we provide graphical illustrations of this conceptual approach. Following this, we will point out a number of issues that arise if corpus parts differ considerably in size. It will be seen that the conceptual approach, which is used in part by Gries (2024) and in full in the tlda package, then runs into problems: The dispersion scores based on a set of subfrequencies may be above or below the scores yielded by the conceptually defined extremes. By default, the tlda package constrains frequency-adjusted scores to the unit interval. The issue is thereby only sidestepped, however, not solved. Instead, a number of recommendations will be given, depending on the dispersion measure of interest.

Illustrative data

We will illustrate the conceptual approach to finding dispersion extremes using an imaginary set of corpus data, which is visualized below. Our corpus includes 30 texts, which represent our corpus parts. These texts differ in length – in the figure below, each square represents a word token. The longest text has 30 words, the shortest one 2 words. The filled squares denote occurrences of the item; overall, there are 61 occurrences, which means that the item has a corpus frequency of 61.

Illustrative set of data.

Our task is now to reallocate these 61 occurrences across corpus parts in a way that produces a minimally dispersed and a maximally dispersed set of subfrequencies. The way we assign observations to corpus parts depends on which meaning of ‘dispersion’ we prefer for the task at hand: pervasiveness or evenness. We will illustrate how these distributional features influence the hypothetical extremes we design.

Finding extremes based on pervasiveness

If we foreground pervasiveness, our goal is to design a set of subfrequencies that occurs in (i) as few corpus parts as possible (dispersion minimum) and in (ii) as many corpus parts as possible (dispersion maximum).

To find the dispersion minimum, we could reallocate all occurrences of the item to the largest corpus part(s), as this would assure that it is spread across the fewest possible number of corpus parts. This is illustrated in the following figure.

A minimally pervasive distribution.

To create a distribution with the widest possible spread of the item across corpus parts, we first arrange corpus parts by size, in decreasing order. For our illustrative data, this has already been done. Then, one occurrence of the item is assigned to each part, starting with the largest part, and continuing to(wards) the smallest part. If the number of occurrences of the item exceeds the number of corpus parts, we start afresh with the largest corpus part. If the smallest corpus part(s) cannot hold any further occurrences (i.e. if they are “full”), the allocation returns to the largest corpus part. The result of this allocation process is illustrated in the figure below.

A maximally pervasive distribution.

Finding extremes based on evenness

If we instead prefer to build extreme subfrequencies based on evenness, we first need to decide on how to define evenness. The idea used by Gries’s (2008) deviation of proportions seems like a suitable approach: If an item is distributed evenly across corpus parts, it should be spread across these in proportion to their size. Thus, if part A accounts for 20% of the words in the corpus, it should also contain roughly 20% of the occurrences of the item. The distribution of subfrequencies is then proportional to that of the part sizes, and may be said to be balanced or even.

In a maximally uneven (i.e. minimally dispersed) distribution, occurrences would be clustered in the smallest corpus part(s), where we least expect them. The following figure shows the set of maximally uneven subfrequencies. We start by filling in the smallest corpus part and then, if the number of occurrences exceeds the length of the smallest corpus part, we continue with the next-smallest and so on.

A minimally even distribution.

To construct a maximally even (i.e. maximally dispersed) distribution of subfrequencies, we assign occurrences to the corpus parts in proportion to their size. The figure below shows a set of evenly dispersed subfrequencies for our illustrative data.

A maximally even distribution.

Issues

Linguistically implausible distributions

One issue that cannot be ignored completely is the fact that the minimally dispersed distributions created by these methods are completely unrealistic. This is because in many situations they imagine corpus parts that consist of only one word type, repeated multiple (perhaps thousands of) times. A more realistic scheme may put an upper limit on the number of times an item can occur in a corpus part. For instance, its proportional share could be capped at a certain normalized frequency, perhaps at around 5% (the frequency of the in spoken and written English) or a higher value.

Dispersion scores may exceed minimum or maximum

Another issue arises when we are dealing with corpus parts of (very) different length. Distributions that would then be considered as minimally/maximally pervasive or minimally/maximally even may yield a dispersion score that is less extreme than that based on the actual distribution of subfrequencies. Whether a particular measure is affected by this depends on how the dispersion extremes were constructed, i.e. whether they were composed to minimize/maximize evenness or pervasiveness:

For S, D_P, and D_KL, which are evenness measures, dispersion extremes that are defined based on evenness work well.
For Range, which is a pervasiveness measure, dispersion extremes that are defined based on pervasiveness work well.
Frequency adjustment for D, D₂, and D_A, on the other hand, may yield dispersion scores outside the [0,1] interval for both methods, i.e. when dispersion extremes are defined based on evenness or pervasiveness.

This translated into the following preliminary recommendations:

If corpus parts are fairly similar in size, no problems should arise. It should also matter little whether frequency adjustment is based on pervasiveness or evenness extremes, as both schemes then produce very similar distributions.
If corpus parts vary considerably in size, meaning that there are some very long parts, and several very short ones, frequency adjustment should proceed with caution. Unfortunately, specific criteria cannot be given at the moment. In such situations, the safest strategy seems to be the following:
- Do not calculate frequency-adjusted scores for D, D₂, and D_A.
- For S, D_P, and D_KL, use freq_adjust_method = "even".
- For Range, use freq_adjust_method = "pervasive".

For S, D_P, D_KL, and Range, these recommendations are implemented as defaults in the corresponding functions (i.e. disp_S()/disp_S_tdm(), disp_DP()/disp_DP_tdm(), disp_DKL()/disp_DKL_tdm(), disp_R()/disp_R_tdm()).

While more work is required to understand how dispersion measures respond to different dispersion extremes, some observations may be made. Thus, it easy to see why Range will be affected if dispersion extremes are defined based on evenness: For a minimally even distribution, we fill in corpus parts starting with the smallest one(s). The number of parts containing the item may therefore exceed that in the actual corpus – especially if the smallest parts are very small.

The reason why S, D_P, and D_KL do not work well with maximally/minimally pervasive distributions could be due to the fact that the size of corpus parts is completely disregarded, which can lead to very disproportionate distributions of subfrequencies. Thus, if we allocate one occurrence to each corpus part, this may produce an exceptionally high normalized frequency for the shortest part, but a very low one for the longest part.

It is less straightforward to see why D, D₂, and D_A may perform poorly under both schemes. What these measures have in common, however, is that they rely on normalized subfrequencies. Since the allocation scheme fills in corpus parts until they are “full” (see issue 1 above), extremely high normalized frequencies may result. This is especially problematic if the smallest corpus parts are very small.

It seems that Biber et al.’s (2016) set of lexical items is particularly suitable for further investigation of these issues. Coupled with a set of corpus parts that differ dramatically in size, we can identify which measures and items are affected.

The built-in data set biber150_spokenBNC1994 considers individual speakers in the Spoken BNC1994 as corpus parts. The word counts are distributed very unevenly across speakers; they range from 1 to 68,136. The extremely skewed distribution of part sizes is apparent in the following histogram:

oldpar <- par(mar = c(5.1, 4.1, 4.1, 2.1))

par(mar = c(4, 4, 1, 0.3), xpd = TRUE)

hist(
  biber150_spokenBNC1994[1,],
  main = NULL, 
  xlab = "Size of corpus parts (speakers)", 
  breaks = seq(0, 70000, length=40), 
  col = "grey60")


par(oldpar)

We start by calculating frequency-adjusted scores for the 150 items in this dataset. Note that we use unit_interval = FALSE to avoid forcing adjusted scores into the unit interval (i.e. replacing scores smaller than 0 with 0, and scores greater than 1 with 1).

DM_even <- disp_tdm(
  biber150_spokenBNC1994, 
  row_partsize = "first",
  freq_adjust = TRUE,
  freq_adjust_method = "even",
  unit_interval = FALSE,
  print_score = FALSE,
  verbose = FALSE,
  suppress_warning = TRUE)

DM_pervasive <- disp_tdm(
  biber150_spokenBNC1994, 
  row_partsize = "first",
  freq_adjust = TRUE,
  freq_adjust_method = "pervasive",
  unit_interval = FALSE,
  print_score = FALSE,
  verbose = FALSE,
  suppress_warning = TRUE)

We then examine, for each dispersion measure, the range of scores across the 150 items, starting with the adjusted scores based on freq_adjust_method = "even":

round(
  apply(
    DM_even,
    2,
    range, na.rm = TRUE),
  2)
#>      Rrel_nofreq D_nofreq D2_nofreq S_nofreq DP_nofreq DA_nofreq DKL_nofreq
#> [1,]       -0.07    -4.35     -1.89     0.00      0.00     -0.77       0.10
#> [2,]        1.00     0.63      0.85     0.99      0.93      0.60       0.96

Then we look at how many items are affected by this, for each measure:

apply(
  DM_even,
  2,
  function(x){
    sum(x < 0 | x > 1, na.rm = TRUE)
  })
#> Rrel_nofreq    D_nofreq   D2_nofreq    S_nofreq   DP_nofreq   DA_nofreq 
#>           3          67          15           0           0          19 
#>  DKL_nofreq 
#>           0

Next, we consider the range of scores for adjusted based on freq_adjust_method = "pervasive":

round(
  apply(
    DM_pervasive,
    2,
    range, na.rm = TRUE),
  2)
#>      Rrel_nofreq D_nofreq D2_nofreq S_nofreq DP_nofreq DA_nofreq DKL_nofreq
#> [1,]           0     0.00      0.00    -0.67     -0.53      0.00      -1.29
#> [2,]           1     1.06      1.16     1.76      2.05      2.49       4.06

Number of items affected by this, for each measure:

apply(
  DM_pervasive,
  2,
  function(x){
    sum(x < 0 | x > 1, na.rm = TRUE)
  })
#> Rrel_nofreq    D_nofreq   D2_nofreq    S_nofreq   DP_nofreq   DA_nofreq 
#>           0          60          61          75          88          62 
#>  DKL_nofreq 
#>          92

References

Biber, Douglas, Randi Reppen, Erin Schnur & Romy Ghanem. 2016. On the (non)utility of Juilland’s D to measure lexical dispersion in large corpora. International Journal of Corpus Linguistics 21(4). 439–464. doi: 10.1075/ijcl.21.4.01bib

Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4). 403–437. doi: 10.1075/ijcl.13.4.02gri

Gries, Stefan Th. 2022. What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies 5(2). 171–205. doi: 10.1075/jsls.21029.gri

Gries, Stefan Th. 2024. Frequency, dispersion, association, and keyness: Revising and tupleizing corpus-linguistic measures. Amsterdam: Benjamins. doi: 10.1075/scl.115

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.