The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
The tlda
package offers various functions for
corpus-linguistic dispersion analysis (see Gries 2020; Sönning 2025).
The dispersion measures that are implemented are parts-based, which
means that they assess the distribution of an item across corpus parts.
For any type of dispersion analysis carried out with the
tlda
package, you therefore need to supply two
variables:
You can provide these data in two forms, and the choice between them mainly depends on whether you want to measure the dispersion of a single item only, or for multiple items simultaneously.
Accordingly, there are two versions of each function: The function
disp()
, for instance, calculates dispersion measures for
data provided in vector form. The _tdm
suffix in a function
name indicates the version that works with a term-document matrix. The
function disp_tdm()
therefore offers the same functionality
as disp()
, the only difference being that it takes data
provided in matrix format.
When analyzing the dispersion of multiple (and possibly all) items in a corpus, it is convenient (and efficient) to use what is referred to as a term-document matrix (TDM). This is a tabular arrangement that lists the subfrequencies for multiple items. The following is an excerpt from a TDM for ICE-GB (Nelson et al. 2002), where items represent word forms. It shows a selection of 10 items and 8 corpus parts (i.e. text files):
#> w2c-006 s2a-023 w2a-010 w1b-004 s1b-010 w2a-011 s2b-015 w2d-010
#> ninety 0 0 0 0 0 0 0 0
#> from 3 6 8 8 3 8 15 11
#> large 0 0 0 0 0 0 1 0
#> another 1 0 0 1 1 0 2 2
#> somebody 0 2 0 0 0 0 0 0
#> a 54 30 40 31 61 33 40 79
#> time 4 4 4 0 4 0 6 4
#> oh 0 0 0 0 6 0 0 0
#> twenty 0 2 0 0 1 0 1 0
#> may 1 4 5 6 2 14 0 9
Importantly, the rows in a TDM represent items and the columns
represent corpus parts. A ‘proper’ TDM includes all words occurring in
the corpus. In that case, we can retrieve the size of the corpus parts
by summing the subfrequencies in each column (i.e. for each corpus
part). This can be done in R with the function
addmargins()
, which appends the column sums as a new row at
the end of the table (the argument margins = 1
tells it to
sum over columns rather than rows).
There are many settings, however, where we work with ‘improper’ TDMs, where the rows only represent a selection of items in the corpus. This is the case if we focus on certain lexical items of interest, or if we are dealing with phraseological or syntactic structures. In such a case, the size of corpus parts cannot be recovered from the TDM; instead, it needs to be supplied as a separate row.
The tldr
package includes a number of TDMs that hold
distributional information for a selection of 150 lexical items (a list
compiled by Biber et al. 2016 for the evaluation of dispersion
measures). In these TDMs, the first row (word_count
)
records the part sizes (i.e. the number of word tokens). The TDM begins
like this:
biber150_ice_gb[1:10, 1:8]
#> s1a-001 s1a-002 s1a-003 s1a-004 s1a-005 s1a-006 s1a-007 s1a-008
#> word_count 2195 2159 2287 2290 2120 2060 2025 2177
#> a 50 38 44 67 35 34 37 29
#> able 2 4 4 0 0 0 0 0
#> actually 3 6 2 2 6 3 0 8
#> after 0 0 0 0 4 1 1 0
#> against 0 0 0 0 0 0 0 0
#> ah 1 0 0 0 1 6 1 2
#> aha 0 0 0 0 0 0 0 0
#> all 2 5 6 9 7 5 8 13
#> among 0 0 0 0 0 0 0 0
The part sizes can alternatively be given in the last row of the table. In general, you need to tell the function where to find them.
Parts-based dispersion scores range between 0 and 1, and the
conventional scaling (see Juilland & Chang-Rodriguez 1964; Carroll
1970; Rosengren 1971) assigns higher scores to items that are more
dispersed, i.e. that show a wider spread or a more even/balanced
distribution across corpus parts. Unfortunately, some more recent
dispersion measures have the reverse scaling, which can cause confusion.
For this reason, the functions in the tlda
package
explicitly control the directionality of scaling and treat it separately
(!) from the original formula. The default setting for all (!)
dispersion measures uses conventional scaling, where higher values
reflect higher dispersion
(directionality = "conventional"
). The reverse scaling can
be requested by specifying directionality = "gries"
.
The functions disp()
and disp_tdm()
calculate seven dispersion measures:
However, these functions do not provide finer control over the way in which a specific dispersion measure is calculated. This limitation concerns indices for which multiple formulas (or versions, or computational procedures) exist in the literature:
By default, disp()
and disp_tdm()
print
information that tells you which version of these measures it uses. This
printout also reminds you of the directionality of scaling that has been
applied. The following code returns dispersion scores for the item
a in ICE-GB (Nelson et al. 2002):
disp(
subfreq = biber150_ice_gb[2,], # row 2 in the TDM represents "a"
partsize = biber150_ice_gb[1,] # row 1 in the TDM contains the part sizes
)
#> Rrel D D2 S DP DA DKL
#> 1.0000000 0.9870710 0.9933147 0.9792124 0.8883657 0.8392881 0.9439996
#>
#> Scores follow conventional scaling:
#> 0 = maximally uneven/bursty/concentrated distribution (pessimum)
#> 1 = maximally even/dispersed/balanced distribution (optimum)
#>
#> For Gries's DP, the function uses the modified version suggested by
#> Egbert et al. (2020)
#>
#> For DKL, standardization to the unit interval [0,1] is based on the
#> odds-to-probability transformation, see Gries (2024: 90)
While disp()
and disp_tdm()
use sensible
default settings, you may want to have more control over the way in
which these four indices are calculated. You can then turn to the
following functions:
disp_R()
/ disp_R_tdm()
: This function
allows you to use different versions of range:
disp_DP()
/ disp_DP_tdm()
: Here you can
choose between different formulas that have been suggested for Gries’s
deviation of proportions:
disp_DA()
/ disp_DA_tdm()
: You can select
the computational procedure:
disp_DKL()
/ disp_DKL_tdm()
: Allows for
the choice between different methods for standardizing KLD scores (see
Gries 2024: 90)disp_S()
/ disp_S_tdm()
: For Rosengren’s
S, uses sensible defaults for frequency adjustmentThe tlda
package also allows you to adjust dispersion
scores for the frequency of the item, by specifying
freq_adjust = TRUE
. This addresses an important concern
raised by Gries (2022, 2024), namely that all parts-based dispersion
measures provide a score that in fact blends information on dispersion
and frequency. He therefore proposed a method for ‘partialing out’
frequency, i.e. to remove its unwanted effect on the dispersion score we
obtain.
To address this issue, Gries (2022; 2024) suggests that the dispersion score for a specific item should be re-expressed based on its dispersion potential in the corpus at hand. The dispersion potential refers to the lowest and highest possible score an item can obtain given its overall corpus frequency as well as the number (and size) of the corpus parts. Dispersion is then re-expressed relative to these endpoints, where the dispersion pessimum is set to 0, and the dispersion optimum to 1 (using conventional scaling). The frequency-adjusted score falls between these bounds and expresses how close the observed distribution is to the theoretical maximum and minimum. In Gries (2024), this is referred to as the min-max transformation.
This adjustment therefore requires a maximally and a minimally
dispersed distribution of the item across the parts. These hypothetical
extremes may be determined in different ways. Gries (2022, 2024)
suggests a computationally expensive strategy, which uses a
trial-and-error approach to find the distribution that maximizes the
dispersion score yielded by a particular index
(e.g. DKL). The tlda
package uses a
different approach. It finds the maximally and minimally dispersed
distribution independently of the specific measure(s) applied. Instead,
it operates based on the distributional features of
pervasiveness or evenness. This is to say that it
constructs a distribution based on what we may consider, conceptually,
as “highly dispersed”. The user must decide whether extremes should
represent distributions where the item is maximally/minimally
pervasive (freq_adjust_method = "pervasive"
),
i.e. whether it is spread as widely or narrowly as possible across
corpus parts, or whether the extremes should mark distributions that are
maximally/minimally even (freq_adjust_method = "even"
),
i.e. whether the spread of the item across corpus parts is maximally
balanced or maximally concentrated (bursty/clumpy). More details and
explanations can be found in the vignette
vignette("frequency-adjustment")
.
We illustrate the use of the tlda
package by focussing
on the distribution of the item actually in the Spoken BNC2014.
The “corpus parts” are the 668 speakers. We start by extracting the
distributional information for actually from the built-in TDM
biber150_spokenBNC2014
.
speaker_word_count <- biber150_spokenBNC2014["word_count",]
subfreq_actually <- biber150_spokenBNC2014["actually",]
disp()
We start with the umbrella function disp()
:
disp(
subfreq = subfreq_actually,
partsize = speaker_word_count
)
#> Rrel D D2 S DP DA DKL
#> 0.8817365 0.9662460 0.9440567 0.8908132 0.7461380 0.5466095 0.7642072
#>
#> Scores follow conventional scaling:
#> 0 = maximally uneven/bursty/concentrated distribution (pessimum)
#> 1 = maximally even/dispersed/balanced distribution (optimum)
#>
#> For Gries's DP, the function uses the modified version suggested by
#> Egbert et al. (2020)
#>
#> For DKL, standardization to the unit interval [0,1] is based on the
#> odds-to-probability transformation, see Gries (2024: 90)
We can change to the reverse scaling by supplying
directionality = "gries"
. Note how the information given in
the printout changes accordingly.
disp(
subfreq = subfreq_actually,
partsize = speaker_word_count,
directionality = "gries"
)
#> Rrel D D2 S DP DA DKL
#> 0.11826347 0.03375403 0.05594325 0.10918680 0.25386196 0.45339049 0.23579282
#>
#> Scores follow scaling used by Gries (2008):
#> 0 = maximally even/dispersed/balanced distribution (optimum)
#> 1 = maximally uneven/bursty/concentrated distribution (pessimum)
#>
#> For Gries's DP, the function uses the modified version suggested by
#> Egbert et al. (2020)
#>
#> For DKL, standardization to the unit interval [0,1] is based on the
#> odds-to-probability transformation, see Gries (2024: 90)
For frequency-adjusted scores, we supply
freq_adjust = TRUE
. By default, the method
even
is used, which gives priority to evenness when
building a minimally and maximally dispersed reference distribution (see
above).
disp(
subfreq = subfreq_actually,
partsize = speaker_word_count,
freq_adjust = TRUE
)
#> Rrel_nofreq D_nofreq D2_nofreq S_nofreq DP_nofreq DA_nofreq
#> 0.9004975 0.8074350 0.8752113 0.8910372 0.7492699 0.5568662
#> DKL_nofreq
#> 0.7396345
#>
#> Dispersion scores are adjusted for frequency using the min-max
#> transformation (see Gries 2024: 196-208); please note that the
#> method implemented here does not work well if corpus parts differ
#> considerably in size; see vignette('frequency-adjustment')
#>
#> Scores follow conventional scaling:
#> 0 = maximally uneven/bursty/concentrated distribution (pessimum)
#> 1 = maximally even/dispersed/balanced distribution (optimum)
#>
#> For Gries's DP, the function uses the modified version suggested by
#> Egbert et al. (2020)
#>
#> For DKL, standardization to the unit interval [0,1] is based on the
#> odds-to-probability transformation, see Gries (2024: 90)
If you require more flexibility, you can use one of the functions for specific dispersion measures.
For Range, the function disp_R()
offers three
choices:
relative
), i.e. the proportion
of corpus parts containing at least one occurrence of the item
(default)absolute
), i.e. the number of
corpus parts containing at least one occurrence of the itemrelative_withsize
), the
proportional version that takes into account the size of the corpus
parts (see Gries 2022: 179-180; Gries 2024: 27-28).The following code returns relative range with size:
disp_R(
subfreq = subfreq_actually,
partsize = speaker_word_count,
type = "relative_withsize"
)
#> Rrel_withsize
#> 0.9914442
#>
#> Scores represent relative range, i.e. the proportion of corpus parts
#> containing at least one occurrence of the item. The size of the
#> corpus parts is taken into account, see Gries (2022: 179-180),
#> Gries (2024: 27-28)
For Gries’s deviation of proportions, the function
disp_DP()
allows you to choose from three different
formulas that are found in the literature. The following code uses the
original formula in Gries (2008), but with conventional
(!)
scaling (see information in print-out):
disp_DP(
subfreq = subfreq_actually,
partsize = speaker_word_count,
formula = "gries_2008"
)
#> DP_nofreq
#> 0.7492699
#>
#> The dispersion score is adjusted for frequency using the min-max
#> transformation (see Gries 2024: 196-208); please note that the
#> method implemented here does not work well if corpus parts differ
#> considerably in size; see vignette('frequency-adjustment')
#>
#> Scores follow conventional scaling:
#> 0 = maximally uneven/bursty/concentrated distribution (pessimum)
#> 1 = maximally even/dispersed/balanced distribution (optimum)
#>
#> Computed using the original version proposed by Gries (2008)
We can also compare the scores produced by the three formulas. We now
suppress the printing of the score (print_score = FALSE
)
and the background information (verbose = FALSE
):
compare_DPs <- rbind(
disp_DP(subfreq = subfreq_actually,
partsize = speaker_word_count,
formula = "gries_2008",
verbose = FALSE, print_score = FALSE),
disp_DP(subfreq = subfreq_actually,
partsize = speaker_word_count,
formula = "lijffijt_gries_2012",
verbose = FALSE, print_score = FALSE),
disp_DP(subfreq = subfreq_actually,
partsize = speaker_word_count,
formula = "egbert_etal_2020",
verbose = FALSE, print_score = FALSE
))
rownames(compare_DPs) <- c(
"Gries (2008)",
"Lijffijt & Gries (2012)",
"Egbert et al. (2020)"
)
compare_DPs
#> DP_nofreq
#> Gries (2008) 0.7492699
#> Lijffijt & Gries (2012) 0.7492653
#> Egbert et al. (2020) 0.7492653
For DA, three computational procedures are
available, The basic
version is a direct implementation of
the actual formula for this measure, which is computationally expensive
if the number of corpus parts is large. Wilcox (1973: 343) gives a
shortcut
version, which is much quicker (see this blog
post). Finally, shortcut_mod
is a slightly adapted form
of the shortcut (EXPERIMENTAL), which ensures that scores do not exceed
1 (conventional scaling). The following code uses Wilcox’s (1973)
quicker approximate procedure:
disp_DA(
subfreq = subfreq_actually,
partsize = speaker_word_count,
procedure = "shortcut"
)
#> DA
#> 0.5481088
#>
#> Scores follow conventional scaling:
#> 0 = maximally uneven/bursty/concentrated distribution (pessimum)
#> 1 = maximally even/dispersed/balanced distribution (optimum)
#>
#> Computed using the computational shortcut suggested by
#> Wilcox (1967: 343, 'MDA', column 4)
Again, we can compare the scores produced by the different computational procedures, again suppressing the printing of the score and background information:
compare_DAs <- rbind(
disp_DA(subfreq = subfreq_actually,
partsize = speaker_word_count,
procedure = "basic",
verbose = FALSE, print_score = FALSE),
disp_DA(subfreq = subfreq_actually,
partsize = speaker_word_count,
procedure = "shortcut",
verbose = FALSE, print_score = FALSE),
disp_DA(subfreq = subfreq_actually,
partsize = speaker_word_count,
procedure = "shortcut_mod",
verbose = FALSE, print_score = FALSE
))
rownames(compare_DAs) <- c(
"Basic procedure",
"Shortcut",
"Shortcut (modified)"
)
compare_DAs
#> DA
#> Basic procedure 0.5466095
#> Shortcut 0.5481088
#> Shortcut (modified) 0.5472882
For DKL, we may opt for a specific
standardization method, which refers to the transformation that maps the
Kullback-Leibler divergence to the unit interval [0,1]. The method used
in Gries (2021: 20) can be implemented using
standardization = "base_e"
:
disp_DKL(
subfreq = subfreq_actually,
partsize = speaker_word_count,
standardization = "base_e"
)
#> DKL
#> 0.7345144
#>
#> Scores follow conventional scaling:
#> 0 = maximally uneven/bursty/concentrated distribution (pessimum)
#> 1 = maximally even/dispersed/balanced distribution (optimum)
#>
#> Standardization to the unit interval [0,1] using base e,
#> see Gries (2021: 20)
We compare the output of different standardization methods:
compare_DKLs <- rbind(
disp_DKL(subfreq = subfreq_actually,
partsize = speaker_word_count,
standardization = "o2p",
verbose = FALSE, print_score = FALSE),
disp_DKL(subfreq = subfreq_actually,
partsize = speaker_word_count,
standardization = "base_e",
verbose = FALSE, print_score = FALSE),
disp_DKL(subfreq = subfreq_actually,
partsize = speaker_word_count,
standardization = "base_2",
verbose = FALSE, print_score = FALSE
))
rownames(compare_DKLs) <- c(
"Odds-to-probability",
"Base e",
"Base 2"
)
compare_DKLs
#> DKL
#> Odds-to-probability 0.7642072
#> Base e 0.7345144
#> Base 2 0.8074553
All of the operations can be applied to a term-document matrix (TDM)
using the _tdm
-suffixed variants of the functions used
above. In these functions, the arguments subfreq
and
partsize
are replaced by the two following:
tdm
A term-document matrix, where rows represent items
and columns represent corpus parts; it must also contain a row giving
the size of the corpus parts (first or last row in the TDM)row_partsize
Character string indicating which row in
the TDM contains the size of the corpus parts.In the TDMs that ship with the tlda
package, it is the
first row that contains the part sizes. You can obtain dispersion scores
for all 150 items in the TDM for ICE-GB (biber150_ice_gb
)
as follows. Since the function disp_tdm()
returns seven
indices, the output is a matrix, which we store as a new object
DM_ice_gb
.
DM_ice_gb <- disp_tdm(
tdm = biber150_ice_gb,
row_partsize = "first",
print_score = FALSE,
verbose = FALSE)
#> Warning in disp_tdm(tdm = biber150_ice_gb, row_partsize = "first", print_score = FALSE, :
#> For some item(s), all subfrequencies are 0; returning NA in this case
We convert the matrix to a data frame, since this makes it easier to work with the results here.
We then only print the first ten items, and round scores to two decimal places:
round(DM_ice_gb[1:10,], 2)
#> Rrel D D2 S DP DA DKL
#> a 1.00 0.99 0.99 0.98 0.89 0.84 0.94
#> able 0.45 0.93 0.84 0.42 0.46 0.31 0.42
#> actually 0.58 0.93 0.86 0.48 0.46 0.32 0.44
#> after 0.71 0.95 0.91 0.64 0.59 0.45 0.55
#> against 0.38 0.92 0.81 0.34 0.38 0.25 0.37
#> ah 0.17 0.86 0.67 0.14 0.17 0.10 0.25
#> aha 0.01 0.50 0.22 0.01 0.01 0.01 0.12
#> all 0.97 0.97 0.96 0.88 0.74 0.64 0.76
#> among 0.21 0.89 0.72 0.20 0.21 0.15 0.29
#> an 0.98 0.97 0.97 0.91 0.76 0.67 0.80
We can now inspect the distribution of scores for these 150 items visually. For instance, we may be interested in the distribution of the DP scores for the 150 items.
par(mar = c(4, 4, 1, 0.3), xpd = TRUE)
hist(
DM_ice_gb$DP,
main = NULL,
xlab = "DP",
xlim = c(0,1),
breaks = seq(0,1,.05),
col = "grey60")
The same plot for Juilland’s D demonstrates its sensitivity to the number of corpus parts: Most scores are bunched up near 1, since we are calculating dispersion across 500 corpus parts (i.e. text files).
par(mar = c(4, 4, 1, 0.3), xpd = TRUE)
hist(
DM_ice_gb$D,
main = NULL,
xlab = "DP",
xlim = c(0,1),
breaks = seq(0,1,.05),
col = "grey60")
The same is true, although less dramatically, for Carroll’s D2:
par(mar = c(4, 4, 1, 0.3), xpd = TRUE)
hist(
DM_ice_gb$D2,
main = NULL,
xlab = "DP",
xlim = c(0,1),
breaks = seq(0,1,.05),
col = "grey60")
To inspect the correlation between the scores produced by different measures, we can draw a scatterplot matrix:
To inspect the association of dispersion scores with frequency, we
add a new column to the data frame. To obtain the corpus frequency for
the 150 items, we add up their subfrequencies by summing across the rows
in the TDM biber150_ice_gb
, excluding row 1 (!), which
contains the sizes of the corpus parts.
Now we can look at the association between dispersion scores and the (logged) corpus frequency of these 150 items. The following scatterplot looks at DP:
par(mar = c(4, 4, 1, 0.3), xpd = TRUE)
plot(
DM_ice_gb$DP ~ log(DM_ice_gb$frequency),
xlab = "Log frequency",
ylab = "DP",
ylim = c(0,1))
We can express this association using Spearman’s rank correlation coefficient:
cor(
DM_ice_gb$DP,
log(DM_ice_gb$frequency),
method = "spearman",
use = "complete.obs")
#> [1] 0.9576285
Let us now adjust scores for frequency by repeating the above steps
but supplying freq_adjust = TRUE
to the
disp_tdm()
function. Note that this takes a bit longer to
run.
DM_ice_gb_nofreq <- disp_tdm(
tdm = biber150_ice_gb,
row_partsize = "first",
freq_adjust = TRUE,
freq_adjust_method = "even",
print_score = FALSE,
verbose = FALSE)
#> Warning in disp_tdm(tdm = biber150_ice_gb, row_partsize = "first", freq_adjust = TRUE, :
#> For some item(s), all subfrequencies are 0; returning NA in this case
#> Warning in disp_tdm(tdm = biber150_ice_gb, row_partsize = "first", freq_adjust = TRUE, :
#> For some item(s), the corpus frequency is 1; no frequency adjustment
#> made in this case; function returns unadjusted dispersion score
DM_ice_gb_nofreq <- data.frame(DM_ice_gb_nofreq)
DM_ice_gb_nofreq$frequency <- rowSums(biber150_ice_gb[-1,])
str(DM_ice_gb_nofreq)
#> 'data.frame': 150 obs. of 8 variables:
#> $ Rrel_nofreq: num 1 0.581 0.577 0.707 0.526 ...
#> $ D_nofreq : num 0.956 0.957 0.939 0.962 0.945 ...
#> $ D2_nofreq : num 0.989 0.879 0.859 0.914 0.852 ...
#> $ S_nofreq : num 0.979 0.52 0.482 0.645 0.456 ...
#> $ DP_nofreq : num 0.889 0.566 0.484 0.646 0.509 ...
#> $ DA_nofreq : num 0.839 0.411 0.337 0.509 0.351 ...
#> $ DKL_nofreq : num 0.934 0.485 0.38 0.525 0.444 ...
#> $ frequency : num 20483 390 1067 871 364 ...
The association with frequency is now attenuated:
oldpar <- par(mar = c(5.1, 4.1, 4.1, 2.1))
par(mar = c(4, 4, 1, 0.3), xpd = TRUE)
plot(
DM_ice_gb_nofreq$DP_nofreq ~ log(DM_ice_gb_nofreq$frequency),
xlab = "Log frequency",
ylab = "DP",
ylim = c(0,1))
This is also reflected in Spearman’s rank correlation coefficient:
Biber, Douglas, Randi Reppen, Erin Schnur & Romy Ghanem. 2016. On the (non)utility of Juilland’s D to measure lexical dispersion in large corpora. International Journal of Corpus Linguistics 21(4). 439–464. doi: 10.1075/ijcl.21.4.01bib
Burch, Brent, Jesse Egbert & Douglas Biber. 2017. Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science 3(2). 189–216. doi: 10.1558/jrds.33066
Carroll, John B. 1970. An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour 3(2). 61–65. doi: 10.1002/j.2333-8504.1970.tb00778.x
Egbert, Jesse, Brent Burch & Douglas Biber. 2020. Lexical dispersion and corpus design. International Journal of Corpus Linguistics 25(1). 89–115. doi: 10.1075/ijcl.18010.egb
Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4). 403–437. doi: 10.1075/ijcl.13.4.02gri
Gries, Stefan Th. 2020. Analyzing dispersion. In Magali Paquot & Stefan Th. Gries (eds.), A practical handbook of corpus linguistics, 99–118. New York: Springer. doi: 10.1007/978-3-030-46216-1_5
Gries, Stefan Th. 2021. A new approach to (key) keywords analysis: Using frequency, and now also dispersion. Research in Corpus Linguistics 9(2). 1−33. doi: 10.32714/ricl.09.02.02
Juilland, Alphonse G. & Eugenio Chang-Rodriguez. 1964. Frequency dictionary of Spanish words. The Hague: Mouton de Gruyter. doi: 10.1515/9783112415467
Keniston, Hayward. 1920. Common words in Spanish. Hispania 3(2). 85–96. doi: 10.2307/331305
Lijffijt, Jefrey & Stefan Th. Gries. 2012. Correction to Stefan Th. Gries’ ‘Dispersions and adjusted frequencies in corpora’. International Journal of Corpus Linguistics 17(1). 147–149. doi: 10.1075/ijcl.17.1.08lij
Lyne, Anthony A. 1985. The vocabulary of French business correspondence. Paris: Slatkine-Champion.
Nelson, Gerald, Sean Wallis & Bas Aarts. 2002. Exploring Natural Language: Working with the British Component of the International Corpus of English. Amsterdam: John Benjamins. doi: 10.1075/veaw.g29
Rosengren, Inger. 1971. The quantitative concept of language and its relation to the structure of frequency dictionaries. Etudes de linguistique appliquee (Nouvelle Serie) 1. 103–127.
Soenning, Lukas. 2025. Advancing our understanding of dispersion measures in corpus research. Corpora. doi: 10.3366/cor.2025.0326
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.