The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
The fastei
library implements the
Expectation-Maximization algorithm to estimate probabilities in
non-parametric Ecological Inference for the RxC case. It offers both an
exact method and four approximation methods suitable for large datasets.
The application demonstrated here is based on an electoral context
where, for each ballot box, we know the number of voters in each
demographic group and the number of votes received by each
candidate.
We use data from the First Round of the Chilean Presidential Election
of 2021, where at each ballot box we have the voters of eight age ranges
(groups), and candidate votes obtained. The function
get_eim_chile()
loads this data at either the country,
region, or electoral district level.
## eim ecological inference model
## Candidates' vote matrix (X) [b x c]:
## C1 C2 C3 C4 C5 C6 C7 C8
## 37M 36 91 19 48 2 7 2 5
## 38M 29 103 17 54 1 6 3 3
## 39M 32 81 13 73 1 5 5 1
## 40M 25 95 21 38 3 4 2 2
## 41M 31 86 19 48 0 4 5 2
## .
## .
## .
## Group-level voter matrix (W) [b x g]:
## X18.19 X20.29 X30.39 X40.49 X50.59 X60.69 X70.79 X80.
## 37M 5 27 11 50 23 39 36 19
## 38M 5 33 11 48 22 42 36 19
## 39M 4 19 11 71 29 30 35 11
## 40M 6 21 6 49 36 20 29 23
## 41M 3 27 6 55 23 29 33 19
## .
## .
## .
As shown in the output, it returns an eim object that contains two
matrices: the number of votes per candidate and the number of votes per
group. The rows of both matrices correspond to the specific ballot-box,
and the columns are candidates and groups, respectively. This eim object
is used as input to run the EM algorithm that estimates the voting
probabilities. In this example, it uses the default method
mult
, which is the most efficient in terms of runtime.
Running the algorithm is done by calling run_em()
.
## C1 C2 C3 C4 C5 C6
## X18.19 0.18021390 0.3812221 0.08594430 0.3191075 0.010454776 0.018692204
## X20.29 0.20222421 0.3427176 0.04921371 0.3472181 0.008598547 0.016276474
## X30.39 0.16208968 0.4129782 0.04180096 0.3493445 0.003027951 0.008108756
## X40.49 0.09194714 0.4897596 0.07860136 0.3271135 0.002118651 0.004123056
## X50.59 0.14863755 0.4140519 0.09393523 0.2893466 0.001088838 0.034452989
## X60.69 0.10069659 0.4989324 0.09071251 0.2744208 0.001833485 0.021835588
## X70.79 0.10508691 0.5207636 0.08438978 0.2354592 0.005996253 0.021857511
## X80. 0.07591972 0.5151956 0.13409775 0.2169307 0.006055633 0.031385766
## C7 C8
## X18.19 0.004027269 0.0003379456
## X20.29 0.018823328 0.0149280799
## X30.39 0.018774827 0.0038750467
## X40.49 0.002890563 0.0034461514
## X50.59 0.015023353 0.0034635700
## X60.69 0.007070275 0.0044983903
## X70.79 0.017167716 0.0092789396
## X80. 0.011800456 0.0086143403
Note that each row corresponds to the probability that a demographic
group (g
) voted for a candidate (c
). It is
worth noting how the estimated probabilities differ substantially across
groups.
We can compute the standard deviation of the estimated probabilities
using bootstrapping. This can be done with the function
bootstrap()
.
## C1 C2 C3 C4 C5 C6
## X18.19 0.08429831 0.11040608 0.04737099 0.06536731 0.03442258 0.03390104
## X20.29 0.06380323 0.08345643 0.04483642 0.06006561 0.02727428 0.03764120
## X30.39 0.05936512 0.06478433 0.04968344 0.07416344 0.02017893 0.02990580
## X40.49 0.04967044 0.06378142 0.04957294 0.05635050 0.02247519 0.03341556
## X50.59 0.04735214 0.06012512 0.03920637 0.05760971 0.01701997 0.03335058
## X60.69 0.04765265 0.05693243 0.04058074 0.05274889 0.01785652 0.03073866
## X70.79 0.05751887 0.06840338 0.04374161 0.06113917 0.02118821 0.03024047
## X80. 0.06637382 0.08292988 0.05394495 0.07826136 0.02380783 0.03046137
## C7 C8
## X18.19 0.03859014 0.02984271
## X20.29 0.03250173 0.02832356
## X30.39 0.03646492 0.02261705
## X40.49 0.02638713 0.02537271
## X50.59 0.03180943 0.02025445
## X60.69 0.03020492 0.01997521
## X70.79 0.03038882 0.02171418
## X80. 0.03174904 0.02352783
The standard deviations obtained in the district “Apoquindo” are low in general. One reason for this is the high number of ballot boxes in this district. In contrast, standard deviations of estimated probabilities in districts with fewer ballot boxes, such as “Navidad”, are larger.
eim_nav <- get_eim_chile(elect_district = "NAVIDAD")
eim_nav <- bootstrap(eim_nav, seed = 42, nboot = 30)
eim_nav$sd
## C1 C2 C3 C4 C5 C6 C7
## X18.19 0.2214451 0.1970844 0.1783613 0.3167673 0.11334607 0.13637371 0.15305458
## X20.29 0.1606269 0.1770246 0.1806477 0.1466003 0.07800926 0.09678682 0.16489767
## X30.39 0.1693693 0.2047489 0.1799093 0.1654572 0.09240416 0.11132817 0.11587464
## X40.49 0.1365256 0.1792028 0.1866755 0.1616137 0.08721944 0.11954829 0.10547254
## X50.59 0.1501858 0.1712607 0.1878540 0.1312316 0.06048176 0.12660728 0.10100224
## X60.69 0.1280791 0.1598155 0.1674383 0.1043716 0.05945490 0.12754337 0.08819076
## X70.79 0.1403220 0.1619747 0.1727775 0.1313626 0.08908717 0.12246832 0.10328463
## X80. 0.1755121 0.1795565 0.1929457 0.1752385 0.09301918 0.12165925 0.13777668
## C8
## X18.19 0.09203398
## X20.29 0.06509148
## X30.39 0.07590694
## X40.49 0.06716454
## X50.59 0.09324964
## X60.69 0.06153528
## X70.79 0.06626062
## X80. 0.11209605
It is possible to see the difference in a visual way by plotting the standard deviations of the estimated probabilities across groups.
library(ggplot2)
library(reshape2)
library(viridis)
plot_district <- function(matrix1, district1, matrix2, district2, sd = FALSE) {
value <- ifelse(sd == FALSE, "prob", "sd")
df1 <- melt(matrix1)
df2 <- melt(matrix2)
df1$Matrix <- district1
df2$Matrix <- district2
combined_df <- rbind(df1, df2)
color <- ifelse(value == "prob", "plasma", "viridis")
# Add text to each cell of the matrix
combined_df$label <- sprintf("%.2f", combined_df$value)
combined_df$text_color <- ifelse(combined_df$value > round(max(combined_df$value) * 0.75 + min(combined_df$value) * 0.25, 2), "black", "white")
districts <- sort(c(district1, district2))
# Call the plot
ggplot(combined_df, aes(x = Var2, y = Var1, fill = value)) +
geom_tile() +
geom_text(aes(label = label, color = text_color), size = 3) +
scale_fill_viridis(
name = value,
option = color,
) +
scale_color_identity() +
facet_wrap(~Matrix) +
coord_fixed() +
theme_bw() +
labs(
title = ifelse(value == "prob",
paste("Estimated probabilities in districts:", districts[1], "and", districts[2]),
paste("Standard deviation of estimated probabilities in districts:", districts[1], "and", districts[2])
),
x = "Candidates' votes", y = "Voters' age range", fill = value
)
}
plot_district(
matrix1 = eim_nav$sd, district1 = "Navidad",
matrix2 = eim_apo$sd, district2 = "Apoquindo", sd = TRUE
)
Navidad and Apoquindo standard deviation comparison
Demographic groups can be merged to have probability estimates with lower error. A greedy strategy that maximizes the variability of the distribution of groups across ballot boxes is used, which ensures standard deviations are below a specific threshold. The package provides the following function for the latter:
eim_nav_proxy <- get_agg_proxy(eim_nav, seed = 42, sd_threshold = 0.1, sd_statistic = "average")
eim_nav_proxy$group_agg
## [1] 4 8
As shown in the output, the heuristic finds a group aggregation that merges the first and last four age ranges, namely, macro-groups with voters aged 18-49, and older than 50. We can evaluate the effectiveness of this grouping by comparing the mean standard deviation to the original formulation:
## [1] 0.05028999
It is possible to see the aggregated standard deviations visually:
plot_matrix <- function(mat, sd = FALSE, y_labels = NULL) {
# Initial configurations
if (!sd) mat <- t(mat)
df <- reshape2::melt(mat)
colnames(df) <- c("Row", "Column", "Value")
df$Row <- factor(df$Row, levels = rev(sort(unique(df$Row))))
df$Column <- factor(df$Column, levels = sort(unique(df$Column)))
if (!sd) {
df$Label <- sprintf("%d", df$Value)
title_text <- "Voters distribution"
x_lab <- "Ballot Box"
y_lab <- "Dem. Group"
fill_lab <- "Voters"
df$text_color <- ifelse(df$Value > 30, "black", "white")
option <- "inferno"
start <- 0.5
limits <- NULL
} else {
df$Label <- sprintf("%.2f", df$Value)
title_text <- "Standard deviation of estimated probabilities on district: Navidad"
x_lab <- "Candidates' votes"
y_lab <- "Voters' age range"
fill_lab <- "sd"
df$text_color <- ifelse(df$Value > 0.13, "black", "white")
option <- "viridis"
start <- 0
limits <- c(0, 0.2)
}
# Plot
p <- ggplot(df, aes(x = Column, y = Row, fill = Value)) +
geom_tile() +
geom_text(aes(label = Label, color = text_color), size = 3) +
scale_color_identity() +
scale_fill_viridis_c(option = option, begin = start, limits = limits) +
coord_fixed() +
theme_bw() +
theme(axis.text.y = element_text(size = 7), axis.text.x = element_text(size = 7)) +
labs(
title = title_text,
x = x_lab,
y = y_lab,
fill = fill_lab
)
# Add custom y-axis labels if provided
if (!is.null(y_labels)) {
p <- p + scale_y_discrete(labels = y_labels)
}
p
}
We can visualize the standard deviations of the estimated probabilities.
Navidad aggregated standard deviation with proxy method
An exhaustive algorithm is also included in the package. It explores all combinations of adjacent groups in order to maximize the log-likelihood subject to having standard deviations below a given threshold. It might require substantial computation time; therefore it is recommended to use the default method, “mult”.
eim_nav_opt <- get_agg_opt(eim_nav, seed = 42, sd_threshold = 0.1, sd_statistic = "average")
eim_nav_opt$group_agg
## [1] 4 6 8
The optimal group aggregation differs slightly from one obtained before. In this case, the group aggregation consists in the three age range: 18-49, 50-69, and older than 70.
We can visualize the standard deviations of the estimated probabilities.
Navidad aggregated standard deviation with opt method
A relevant question is how significantly different are the probability estimates of two sets of data, such as two different districts. For instance, we can compare the estimated probabilities of the district “APOQUINDO” and “PROVIDENCIA”, which belong to the same ward.
Providencia and Apoquindo comparison
A Welch’s test can be applied to each component of the two probability matrix estimates:
comparison <- welchtest(
object1 = eim_prov,
object2 = eim_apo,
method = "mult",
nboot = 30,
seed = 42
)
round(comparison$pvals, 3)
## C1 C2 C3 C4 C5 C6 C7 C8
## X18.19 0.000 0.226 0.293 0.000 0.261 0.988 0.492 0.989
## X20.29 0.000 0.000 0.561 0.000 0.422 0.646 0.611 0.855
## X30.39 0.000 0.000 0.001 0.000 0.335 0.303 0.678 0.943
## X40.49 0.000 0.000 0.000 0.000 0.861 0.077 0.649 0.983
## X50.59 0.002 0.000 0.000 0.004 0.012 0.102 0.893 0.172
## X60.69 0.000 0.000 0.858 0.043 0.766 0.066 0.135 0.714
## X70.79 0.053 0.073 0.086 0.474 0.582 0.161 0.098 0.161
## X80. 0.340 0.184 0.252 0.323 0.395 0.497 0.958 0.852
In most cases, voting probabilities are not significantly different. On the other hand, it may be noteworthy to observe the difference between two electoral districts whose voting tendencies are expected to differ.
eim_gra <- get_eim_chile("LA GRANJA")
eim_gra <- run_em(eim_gra)
eim_bar <- get_eim_chile("LO BARNECHEA")
eim_bar <- run_em(eim_bar)
Lo Barnechea and La Granja comparison
comparison2 <- welchtest(
object1 = eim_gra,
object2 = eim_bar,
method = "mult",
nboot = 30,
seed = 42,
)
round(comparison2$pvals, 3)
## C1 C2 C3 C4 C5 C6 C7 C8
## X18.19 0.000 0 0.000 0.000 0.269 0.000 0.282 0.842
## X20.29 0.000 0 0.385 0.000 0.014 0.000 0.000 0.959
## X30.39 0.000 0 0.009 0.000 0.016 0.000 0.000 0.409
## X40.49 0.000 0 0.000 0.000 0.021 0.000 0.000 0.153
## X50.59 0.000 0 0.000 0.000 0.260 0.000 0.000 0.334
## X60.69 0.000 0 0.000 0.000 0.157 0.000 0.588 0.608
## X70.79 0.571 0 0.000 0.562 0.491 0.889 0.013 0.518
## X80. 0.980 0 0.000 0.396 0.953 0.097 0.032 0.360
It is possible to simulate an artificial election and get its values as an eim object with the simulated probability. This is useful for testing the package’s performance and comparing the different methods available.
eim_sim <- simulate_election(num_ballots = 15, num_groups = 2, num_candidates = 3, seed = 42)
eim_sim
## eim ecological inference model
## Candidates' vote matrix (X) [b x c]:
## [,1] [,2] [,3]
## [1,] 23 31 46
## [2,] 22 35 43
## [3,] 30 20 50
## [4,] 24 35 41
## [5,] 24 29 47
## .
## .
## .
## Group-level voter matrix (W) [b x g]:
## [,1] [,2]
## [1,] 75 25
## [2,] 77 23
## [3,] 73 27
## [4,] 80 20
## [5,] 73 27
## .
## .
## .
The real voting probabilities of each group for each candidate can be obtained as follows:
## [,1] [,2] [,3]
## [1,] 0.2588692 0.3615164 0.3796145
## [2,] 0.2913730 0.1829977 0.5256293
In this case, we can access these values since this is a simulation; however, this is not possible when working with real data.
The estimated voting probabilities are:
## [,1] [,2] [,3]
## [1,] 0.2398303 0.3292528 0.4309169
## [2,] 0.3014838 0.1827996 0.5157166
It is possible to provide a specific voting probability for each candidate for each group.
input_probability <- matrix(c(0.9, 0.05, 0.05, 0.2, 0.3, 0.5), nrow = 2, byrow = TRUE)
input_probability
## [,1] [,2] [,3]
## [1,] 0.9 0.05 0.05
## [2,] 0.2 0.30 0.50
eim_sim2 <- simulate_election(
num_ballots = 30, num_groups = 2, num_candidates = 3, seed = 42,
prob = input_probability
)
eim_sim2
## eim ecological inference model
## Candidates' vote matrix (X) [b x c]:
## [,1] [,2] [,3]
## [1,] 69 14 17
## [2,] 75 12 13
## [3,] 68 10 22
## [4,] 64 16 20
## [5,] 72 12 16
## .
## .
## .
## Group-level voter matrix (W) [b x g]:
## [,1] [,2]
## [1,] 79 21
## [2,] 78 22
## [3,] 73 27
## [4,] 70 30
## [5,] 73 27
## .
## .
## .
## [,1] [,2] [,3]
## [1,] 0.9 0.05 0.05
## [2,] 0.2 0.30 0.50
There is a parameter, lambda, that controls for the heterogeneity of
voters’ groups across ballot boxes. A value of lambda = 0 indicates that
voters are all randomly assigned in ballot boxes; in contrast to lambda
= 1, in which case voters are assigned in ballot boxes according to
their demographic group. This is explained in detail in the
documentation in simulate_election()
.
eim_sim3 <- simulate_election(
num_ballots = 20, num_groups = 4, num_candidates = 2, seed = 42,
lambda = 0.1
)
Voters’ heatmap for a low lambda value
Having the following estimated probabilities:
## [,1] [,2]
## [1,] 0.3686854 0.6313146
## [2,] 0.6627212 0.3372788
## [3,] 0.2112870 0.7887130
## [4,] 0.1805033 0.8194967
On the other side, it is possible to simulate an heterogeneous sample
eim_sim4 <- simulate_election(
num_ballots = 20, num_groups = 4, num_candidates = 2, seed = 42,
lambda = 0.9
)
Voters’ heatmap for a high lambda value
With the underlying probabilities
## [,1] [,2]
## [1,] 0.3731878 0.6268122
## [2,] 0.5431853 0.4568147
## [3,] 0.9365629 0.0634371
## [4,] 0.8711210 0.1288790
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.