The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This package includes a number of utility functions and resources for analyzing linguistic data. At the moment, the focus is on corpus-linguistic dispersion analysis (see Gries 2020; Sönning 2025), which quantifies how widely and/or evenly and item is distributed across corpus parts. This kind of analysis requires two variables:
The package includes functions that allow you to calculate seven different parts-based dispersion measures, including their frequency-adjusted version. Subfrequencies and part sizes can be supplied in two forms, either as vectors or as a term-document matrix. For some measures, different formulas are found in the literature, and the user can choose among them. The following indices are implemented:
The function disp()
calculates seven dispersion measures
based on two vectors:
subfreq
a set of subfrequencies, i.e. the number of
occurrences of the item in each corpus partpartsize
a vector with the size of the corpus
partsThe argument directionality
controls the scaling of the
scores:
conventional
: higher values reflect a more
even distributiongries
: higher values reflect a less
even distributionThe function prints information about the directionality of scaling and details about the formula used.
As an example, we will use data from Lyne’s (1985) classic study and consider the distribution of the French lemma ALLEMAND across the ten (nearly) equal-sized parts (‘Tenths’) of his corpus of French business correspondence. The part sizes are taken from Figure 1 (p. 85) and the subfrequencies from Appendix I (p. 299).
library(tlda)
<- c(2, 0, 1, 1, 3, 0, 3, 0, 0, 0)
x <- c(8143, 8058, 8271, 8125, 7959, 7941, 8146, 8001, 8003, 7930)
y
disp(
subfreq = x,
partsize = y,
directionality = "conventional"
)#> Rrel D D2 S DP DA DKL
#> 0.5000000 0.6038797 0.6521444 0.4763652 0.5009295 0.3081640 0.4665078
#>
#> Scores follow conventional scaling:
#> 0 = maximally uneven/bursty/concentrated distribution (pessimum)
#> 1 = maximally even/dispersed/balanced distribution (optimum)
#>
#> For Gries's DP, the function uses the modified version suggested by
#> Egbert et al. (2020)
#>
#> For DKL, standardization to the unit interval [0,1] is based on the
#> odds-to-probability transformation, see Gries (2024: 90)
If we prefer the reversed scaling used by Gries (2008), we can change
the value of the argument directionality
, like so:
disp(
subfreq = x,
partsize = y,
directionality = "gries"
)#> Rrel D D2 S DP DA DKL
#> 0.5000000 0.3961203 0.3478556 0.5236348 0.4990705 0.6918360 0.5334922
#>
#> Scores follow scaling used by Gries (2008):
#> 0 = maximally even/dispersed/balanced distribution (optimum)
#> 1 = maximally uneven/bursty/concentrated distribution (pessimum)
#>
#> For Gries's DP, the function uses the modified version suggested by
#> Egbert et al. (2020)
#>
#> For DKL, standardization to the unit interval [0,1] is based on the
#> odds-to-probability transformation, see Gries (2024: 90)
To calculate dispersion for multiple items, it makes sense to provide the data in the form of a term-document matrix. In this tabular arrangement,
A number of example data sets are shipped with the tlda
package, including biber150_ice_gb
, a term-document matrix
recording the text-level subfrequencies for Biber et al.’s (2016) 150
lexical items in ICE-GB (Nelson et al. 2002). Importantly, the first row
gives the number of word tokens in the text file. This is an excerpt
from the matrix:
1:5, 1:5]
biber150_ice_gb[#> s1a-001 s1a-002 s1a-003 s1a-004 s1a-005
#> word_count 2195 2159 2287 2290 2120
#> a 50 38 44 67 35
#> able 2 4 4 0 0
#> actually 3 6 2 2 6
#> after 0 0 0 0 4
The function disp_tdm()
calculates seven dispersion
measures for each item in the matrix. The output is therefore also a
matrix. In this function, the arguments subfreq
and
partsize
are replaced by the two following:
tdm
A term-document matrix, where rows represent items
and columns corpus parts; it must also contain a row giving the size of
the corpus parts (first or last row in the TDM)row_partsize
Character string indicating which row in
the TDM contains the size of the corpus parts.The following calculates dispersion scores for the first ten items in the term-document matrix (rounded to two decimal places).
disp_tdm(
tdm = biber150_ice_gb[1:11,],
row_partsize = "first",
digits = 2,
print_score = TRUE,
verbose = FALSE)
#> Rrel D D2 S DP DA DKL
#> a 1.00 0.99 0.99 0.98 0.89 0.84 0.94
#> able 0.45 0.93 0.84 0.42 0.46 0.31 0.42
#> actually 0.58 0.93 0.86 0.48 0.46 0.32 0.44
#> after 0.71 0.95 0.91 0.64 0.59 0.45 0.55
#> against 0.38 0.92 0.81 0.34 0.38 0.25 0.37
#> ah 0.17 0.86 0.67 0.14 0.17 0.10 0.25
#> aha 0.01 0.50 0.22 0.01 0.01 0.01 0.12
#> all 0.97 0.97 0.96 0.88 0.74 0.64 0.76
#> among 0.21 0.89 0.72 0.20 0.21 0.15 0.29
#> an 0.98 0.97 0.97 0.91 0.76 0.67 0.80
Install the latest tlda
release from CRAN:
install.packages("tlda")
Or the development version of tlda
from GitHub with:
# install.packages("pak")
::pak("lsoenning/tlda") pak
Biber, Douglas, Randi Reppen, Erin Schnur & Romy Ghanem. 2016. On the (non)utility of Juilland’s D to measure lexical dispersion in large corpora. International Journal of Corpus Linguistics 21(4). 439–464. doi: 10.1075/ijcl.21.4.01bib
Burch, Brent, Jesse Egbert & Douglas Biber. 2017. Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science 3(2). 189–216. doi: 10.1558/jrds.33066
Carroll, John B. 1970. An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour 3(2). 61–65. doi: 10.1002/j.2333-8504.1970.tb00778.x
Egbert, Jesse, Brent Burch & Douglas Biber. 2020. Lexical dispersion and corpus design. International Journal of Corpus Linguistics 25(1). 89–115. doi: 10.1075/ijcl.18010.egb
Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4). 403–437. doi: 10.1075/ijcl.13.4.02gri
Gries, Stefan Th. 2020. Analyzing dispersion. In Magali Paquot & Stefan Th. Gries (eds.), A practical handbook of corpus linguistics, 99–118. New York: Springer. doi: 10.1007/978-3-030-46216-1_5
Gries, Stefan Th. 2021. A new approach to (key) keywords analysis: Using frequency, and now also dispersion. Research in Corpus Linguistics 9(2). 1−33. doi: 10.32714/ricl.09.02.02
Juilland, Alphonse G. & Eugenio Chang-Rodriguez. 1964. Frequency dictionary of Spanish words. The Hague: Mouton de Gruyter. doi: 10.1515/9783112415467
Keniston, Hayward. 1920. Common words in Spanish. Hispania 3(2). 85–96. doi: 10.2307/331305
Lijffijt, Jefrey & Stefan Th. Gries. 2012. Correction to Stefan Th. Gries’ ‘Dispersions and adjusted frequencies in corpora’. International Journal of Corpus Linguistics 17(1). 147–149. doi: 10.1075/ijcl.17.1.08lij
Lyne, Anthony A. 1985. The vocabulary of French business correspondence. Paris: Slatkine-Champion.
Nelson, Gerald, Sean Wallis & Bas Aarts. 2002. Exploring Natural Language: Working with the British Component of the International Corpus of English. Amsterdam: John Benjamins. doi: 10.1075/veaw.g29
Rosengren, Inger. 1971. The quantitative concept of language and its relation to the structure of frequency dictionaries. Etudes de linguistique appliquee (Nouvelle Serie) 1. 103–127.
Sönning, Lukas. 2025. Advancing our understanding of dispersion measures in corpus research. Corpora.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.