The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Demonstration of the package usage

Charles Thraves, Pablo Ubilla, Daniel Hermosilla

2025-05-14

The fastei library implements the Expectation-Maximization algorithm to estimate probabilities in non-parametric Ecological Inference for the RxC case. It offers both an exact method and four approximation methods suitable for large datasets. The application demonstrated here is based on an electoral context where, for each ballot box, we know the number of voters in each demographic group and the number of votes received by each candidate.

Estimate Voting Probabilities

We use data from the First Round of the Chilean Presidential Election of 2021, where at each ballot box we have the voters of eight age ranges (groups), and candidate votes obtained. The function get_eim_chile() loads this data at either the country, region, or electoral district level.

library(fastei)

eim_apo <- get_eim_chile(elect_district = "APOQUINDO")

eim_apo
## eim ecological inference model
## Candidates' vote matrix (X) [b x c]:
##     C1  C2 C3 C4 C5 C6 C7 C8
## 37M 36  91 19 48  2  7  2  5
## 38M 29 103 17 54  1  6  3  3
## 39M 32  81 13 73  1  5  5  1
## 40M 25  95 21 38  3  4  2  2
## 41M 31  86 19 48  0  4  5  2
## .
## .
## .
## Group-level voter matrix (W) [b x g]:
##     X18.19 X20.29 X30.39 X40.49 X50.59 X60.69 X70.79 X80.
## 37M      5     27     11     50     23     39     36   19
## 38M      5     33     11     48     22     42     36   19
## 39M      4     19     11     71     29     30     35   11
## 40M      6     21      6     49     36     20     29   23
## 41M      3     27      6     55     23     29     33   19
## .
## .
## .

As shown in the output, it returns an eim object that contains two matrices: the number of votes per candidate and the number of votes per group. The rows of both matrices correspond to the specific ballot-box, and the columns are candidates and groups, respectively. This eim object is used as input to run the EM algorithm that estimates the voting probabilities. In this example, it uses the default method mult, which is the most efficient in terms of runtime. Running the algorithm is done by calling run_em().

eim_apo <- run_em(eim_apo)
eim_apo$prob
##                C1        C2         C3        C4          C5          C6
## X18.19 0.18021390 0.3812221 0.08594430 0.3191075 0.010454776 0.018692204
## X20.29 0.20222421 0.3427176 0.04921371 0.3472181 0.008598547 0.016276474
## X30.39 0.16208968 0.4129782 0.04180096 0.3493445 0.003027951 0.008108756
## X40.49 0.09194714 0.4897596 0.07860136 0.3271135 0.002118651 0.004123056
## X50.59 0.14863755 0.4140519 0.09393523 0.2893466 0.001088838 0.034452989
## X60.69 0.10069659 0.4989324 0.09071251 0.2744208 0.001833485 0.021835588
## X70.79 0.10508691 0.5207636 0.08438978 0.2354592 0.005996253 0.021857511
## X80.   0.07591972 0.5151956 0.13409775 0.2169307 0.006055633 0.031385766
##                 C7           C8
## X18.19 0.004027269 0.0003379456
## X20.29 0.018823328 0.0149280799
## X30.39 0.018774827 0.0038750467
## X40.49 0.002890563 0.0034461514
## X50.59 0.015023353 0.0034635700
## X60.69 0.007070275 0.0044983903
## X70.79 0.017167716 0.0092789396
## X80.   0.011800456 0.0086143403

Note that each row corresponds to the probability that a demographic group (g) voted for a candidate (c). It is worth noting how the estimated probabilities differ substantially across groups.

Standard deviation estimates

We can compute the standard deviation of the estimated probabilities using bootstrapping. This can be done with the function bootstrap().

eim_apo <- bootstrap(eim_apo, seed = 42, nboot = 30)
eim_apo$sd
##                C1         C2         C3         C4         C5         C6
## X18.19 0.08429831 0.11040608 0.04737099 0.06536731 0.03442258 0.03390104
## X20.29 0.06380323 0.08345643 0.04483642 0.06006561 0.02727428 0.03764120
## X30.39 0.05936512 0.06478433 0.04968344 0.07416344 0.02017893 0.02990580
## X40.49 0.04967044 0.06378142 0.04957294 0.05635050 0.02247519 0.03341556
## X50.59 0.04735214 0.06012512 0.03920637 0.05760971 0.01701997 0.03335058
## X60.69 0.04765265 0.05693243 0.04058074 0.05274889 0.01785652 0.03073866
## X70.79 0.05751887 0.06840338 0.04374161 0.06113917 0.02118821 0.03024047
## X80.   0.06637382 0.08292988 0.05394495 0.07826136 0.02380783 0.03046137
##                C7         C8
## X18.19 0.03859014 0.02984271
## X20.29 0.03250173 0.02832356
## X30.39 0.03646492 0.02261705
## X40.49 0.02638713 0.02537271
## X50.59 0.03180943 0.02025445
## X60.69 0.03020492 0.01997521
## X70.79 0.03038882 0.02171418
## X80.   0.03174904 0.02352783

The standard deviations obtained in the district “Apoquindo” are low in general. One reason for this is the high number of ballot boxes in this district. In contrast, standard deviations of estimated probabilities in districts with fewer ballot boxes, such as “Navidad”, are larger.

eim_nav <- get_eim_chile(elect_district = "NAVIDAD")
eim_nav <- bootstrap(eim_nav, seed = 42, nboot = 30)
eim_nav$sd
##               C1        C2        C3        C4         C5         C6         C7
## X18.19 0.2214451 0.1970844 0.1783613 0.3167673 0.11334607 0.13637371 0.15305458
## X20.29 0.1606269 0.1770246 0.1806477 0.1466003 0.07800926 0.09678682 0.16489767
## X30.39 0.1693693 0.2047489 0.1799093 0.1654572 0.09240416 0.11132817 0.11587464
## X40.49 0.1365256 0.1792028 0.1866755 0.1616137 0.08721944 0.11954829 0.10547254
## X50.59 0.1501858 0.1712607 0.1878540 0.1312316 0.06048176 0.12660728 0.10100224
## X60.69 0.1280791 0.1598155 0.1674383 0.1043716 0.05945490 0.12754337 0.08819076
## X70.79 0.1403220 0.1619747 0.1727775 0.1313626 0.08908717 0.12246832 0.10328463
## X80.   0.1755121 0.1795565 0.1929457 0.1752385 0.09301918 0.12165925 0.13777668
##                C8
## X18.19 0.09203398
## X20.29 0.06509148
## X30.39 0.07590694
## X40.49 0.06716454
## X50.59 0.09324964
## X60.69 0.06153528
## X70.79 0.06626062
## X80.   0.11209605

It is possible to see the difference in a visual way by plotting the standard deviations of the estimated probabilities across groups.

library(ggplot2)
library(reshape2)
library(viridis)

plot_district <- function(matrix1, district1, matrix2, district2, sd = FALSE) {
    value <- ifelse(sd == FALSE, "prob", "sd")
    df1 <- melt(matrix1)
    df2 <- melt(matrix2)
    df1$Matrix <- district1
    df2$Matrix <- district2
    combined_df <- rbind(df1, df2)
    color <- ifelse(value == "prob", "plasma", "viridis")

    # Add text to each cell of the matrix
    combined_df$label <- sprintf("%.2f", combined_df$value)
    combined_df$text_color <- ifelse(combined_df$value > round(max(combined_df$value) * 0.75 + min(combined_df$value) * 0.25, 2), "black", "white")
    districts <- sort(c(district1, district2))

    # Call the plot
    ggplot(combined_df, aes(x = Var2, y = Var1, fill = value)) +
        geom_tile() +
        geom_text(aes(label = label, color = text_color), size = 3) +
        scale_fill_viridis(
            name = value,
            option = color,
        ) +
        scale_color_identity() +
        facet_wrap(~Matrix) +
        coord_fixed() +
        theme_bw() +
        labs(
            title = ifelse(value == "prob",
                paste("Estimated probabilities in districts:", districts[1], "and", districts[2]),
                paste("Standard deviation of estimated probabilities in districts:", districts[1], "and", districts[2])
            ),
            x = "Candidates' votes", y = "Voters' age range", fill = value
        )
}
plot_district(
    matrix1 = eim_nav$sd, district1 = "Navidad",
    matrix2 = eim_apo$sd, district2 = "Apoquindo", sd = TRUE
)
Navidad and Apoquindo standard deviation comparison

Navidad and Apoquindo standard deviation comparison

Reduce Estimation Error using Group Aggregation

Demographic groups can be merged to have probability estimates with lower error. A greedy strategy that maximizes the variability of the distribution of groups across ballot boxes is used, which ensures standard deviations are below a specific threshold. The package provides the following function for the latter:

eim_nav_proxy <- get_agg_proxy(eim_nav, seed = 42, sd_threshold = 0.1, sd_statistic = "average")
eim_nav_proxy$group_agg
## [1] 4 8

As shown in the output, the heuristic finds a group aggregation that merges the first and last four age ranges, namely, macro-groups with voters aged 18-49, and older than 50. We can evaluate the effectiveness of this grouping by comparing the mean standard deviation to the original formulation:

mean(eim_nav$sd) - mean(eim_nav_proxy$sd)
## [1] 0.05028999

It is possible to see the aggregated standard deviations visually:

plot_matrix <- function(mat, sd = FALSE, y_labels = NULL) {
    # Initial configurations
    if (!sd) mat <- t(mat)
    df <- reshape2::melt(mat)
    colnames(df) <- c("Row", "Column", "Value")
    df$Row <- factor(df$Row, levels = rev(sort(unique(df$Row))))
    df$Column <- factor(df$Column, levels = sort(unique(df$Column)))
    if (!sd) {
        df$Label <- sprintf("%d", df$Value)
        title_text <- "Voters distribution"
        x_lab <- "Ballot Box"
        y_lab <- "Dem. Group"
        fill_lab <- "Voters"
        df$text_color <- ifelse(df$Value > 30, "black", "white")
        option <- "inferno"
        start <- 0.5
        limits <- NULL
    } else {
        df$Label <- sprintf("%.2f", df$Value)
        title_text <- "Standard deviation of estimated probabilities on district: Navidad"
        x_lab <- "Candidates' votes"
        y_lab <- "Voters' age range"
        fill_lab <- "sd"
        df$text_color <- ifelse(df$Value > 0.13, "black", "white")
        option <- "viridis"
        start <- 0
        limits <- c(0, 0.2)
    }

    # Plot
    p <- ggplot(df, aes(x = Column, y = Row, fill = Value)) +
        geom_tile() +
        geom_text(aes(label = Label, color = text_color), size = 3) +
        scale_color_identity() +
        scale_fill_viridis_c(option = option, begin = start, limits = limits) +
        coord_fixed() +
        theme_bw() +
        theme(axis.text.y = element_text(size = 7), axis.text.x = element_text(size = 7)) +
        labs(
            title = title_text,
            x = x_lab,
            y = y_lab,
            fill = fill_lab
        )
    # Add custom y-axis labels if provided
    if (!is.null(y_labels)) {
        p <- p + scale_y_discrete(labels = y_labels)
    }
    p
}

We can visualize the standard deviations of the estimated probabilities.

plot_matrix(eim_nav_proxy$sd, sd = TRUE, y_labels = c("X18.49", "X50."))
Navidad aggregated standard deviation with proxy method

Navidad aggregated standard deviation with proxy method

An exhaustive algorithm is also included in the package. It explores all combinations of adjacent groups in order to maximize the log-likelihood subject to having standard deviations below a given threshold. It might require substantial computation time; therefore it is recommended to use the default method, “mult”.

eim_nav_opt <- get_agg_opt(eim_nav, seed = 42, sd_threshold = 0.1, sd_statistic = "average")
eim_nav_opt$group_agg
## [1] 4 6 8

The optimal group aggregation differs slightly from one obtained before. In this case, the group aggregation consists in the three age range: 18-49, 50-69, and older than 70.

We can visualize the standard deviations of the estimated probabilities.

plot_matrix(eim_nav_opt$sd, sd = TRUE, y_labels = c("X18.49", "X50.69", "X70."))
Navidad aggregated standard deviation with opt method

Navidad aggregated standard deviation with opt method

Test difference between estimates

A relevant question is how significantly different are the probability estimates of two sets of data, such as two different districts. For instance, we can compare the estimated probabilities of the district “APOQUINDO” and “PROVIDENCIA”, which belong to the same ward.

eim_prov <- get_eim_chile("PROVIDENCIA")
eim_prov <- run_em(eim_prov)
eim_apo <- run_em(eim_apo)
plot_district(eim_apo$prob, "Apoquindo", eim_prov$prob, "Providencia")
Providencia and Apoquindo comparison

Providencia and Apoquindo comparison

A Welch’s test can be applied to each component of the two probability matrix estimates:

comparison <- welchtest(
    object1 = eim_prov,
    object2 = eim_apo,
    method = "mult",
    nboot = 30,
    seed = 42
)

round(comparison$pvals, 3)
##           C1    C2    C3    C4    C5    C6    C7    C8
## X18.19 0.000 0.226 0.293 0.000 0.261 0.988 0.492 0.989
## X20.29 0.000 0.000 0.561 0.000 0.422 0.646 0.611 0.855
## X30.39 0.000 0.000 0.001 0.000 0.335 0.303 0.678 0.943
## X40.49 0.000 0.000 0.000 0.000 0.861 0.077 0.649 0.983
## X50.59 0.002 0.000 0.000 0.004 0.012 0.102 0.893 0.172
## X60.69 0.000 0.000 0.858 0.043 0.766 0.066 0.135 0.714
## X70.79 0.053 0.073 0.086 0.474 0.582 0.161 0.098 0.161
## X80.   0.340 0.184 0.252 0.323 0.395 0.497 0.958 0.852

In most cases, voting probabilities are not significantly different. On the other hand, it may be noteworthy to observe the difference between two electoral districts whose voting tendencies are expected to differ.

eim_gra <- get_eim_chile("LA GRANJA")
eim_gra <- run_em(eim_gra)
eim_bar <- get_eim_chile("LO BARNECHEA")
eim_bar <- run_em(eim_bar)
Lo Barnechea and La Granja comparison

Lo Barnechea and La Granja comparison

comparison2 <- welchtest(
    object1 = eim_gra,
    object2 = eim_bar,
    method = "mult",
    nboot = 30,
    seed = 42,
)

round(comparison2$pvals, 3)
##           C1 C2    C3    C4    C5    C6    C7    C8
## X18.19 0.000  0 0.000 0.000 0.269 0.000 0.282 0.842
## X20.29 0.000  0 0.385 0.000 0.014 0.000 0.000 0.959
## X30.39 0.000  0 0.009 0.000 0.016 0.000 0.000 0.409
## X40.49 0.000  0 0.000 0.000 0.021 0.000 0.000 0.153
## X50.59 0.000  0 0.000 0.000 0.260 0.000 0.000 0.334
## X60.69 0.000  0 0.000 0.000 0.157 0.000 0.588 0.608
## X70.79 0.571  0 0.000 0.562 0.491 0.889 0.013 0.518
## X80.   0.980  0 0.000 0.396 0.953 0.097 0.032 0.360

Simulating Election Results

It is possible to simulate an artificial election and get its values as an eim object with the simulated probability. This is useful for testing the package’s performance and comparing the different methods available.

eim_sim <- simulate_election(num_ballots = 15, num_groups = 2, num_candidates = 3, seed = 42)
eim_sim
## eim ecological inference model
## Candidates' vote matrix (X) [b x c]:
##      [,1] [,2] [,3]
## [1,]   23   31   46
## [2,]   22   35   43
## [3,]   30   20   50
## [4,]   24   35   41
## [5,]   24   29   47
## .
## .
## .
## Group-level voter matrix (W) [b x g]:
##      [,1] [,2]
## [1,]   75   25
## [2,]   77   23
## [3,]   73   27
## [4,]   80   20
## [5,]   73   27
## .
## .
## .

The real voting probabilities of each group for each candidate can be obtained as follows:

eim_sim$real_prob
##           [,1]      [,2]      [,3]
## [1,] 0.2588692 0.3615164 0.3796145
## [2,] 0.2913730 0.1829977 0.5256293

In this case, we can access these values since this is a simulation; however, this is not possible when working with real data.

The estimated voting probabilities are:

eim_sim <- run_em(eim_sim)
eim_sim$prob
##           [,1]      [,2]      [,3]
## [1,] 0.2398303 0.3292528 0.4309169
## [2,] 0.3014838 0.1827996 0.5157166

It is possible to provide a specific voting probability for each candidate for each group.

input_probability <- matrix(c(0.9, 0.05, 0.05, 0.2, 0.3, 0.5), nrow = 2, byrow = TRUE)
input_probability
##      [,1] [,2] [,3]
## [1,]  0.9 0.05 0.05
## [2,]  0.2 0.30 0.50
eim_sim2 <- simulate_election(
    num_ballots = 30, num_groups = 2, num_candidates = 3, seed = 42,
    prob = input_probability
)
eim_sim2
## eim ecological inference model
## Candidates' vote matrix (X) [b x c]:
##      [,1] [,2] [,3]
## [1,]   69   14   17
## [2,]   75   12   13
## [3,]   68   10   22
## [4,]   64   16   20
## [5,]   72   12   16
## .
## .
## .
## Group-level voter matrix (W) [b x g]:
##      [,1] [,2]
## [1,]   79   21
## [2,]   78   22
## [3,]   73   27
## [4,]   70   30
## [5,]   73   27
## .
## .
## .
eim_sim2$real_prob
##      [,1] [,2] [,3]
## [1,]  0.9 0.05 0.05
## [2,]  0.2 0.30 0.50

There is a parameter, lambda, that controls for the heterogeneity of voters’ groups across ballot boxes. A value of lambda = 0 indicates that voters are all randomly assigned in ballot boxes; in contrast to lambda = 1, in which case voters are assigned in ballot boxes according to their demographic group. This is explained in detail in the documentation in simulate_election().

eim_sim3 <- simulate_election(
    num_ballots = 20, num_groups = 4, num_candidates = 2, seed = 42,
    lambda = 0.1
)
plot_matrix(eim_sim3$W)
Voters' heatmap for a low lambda value

Voters’ heatmap for a low lambda value

Having the following estimated probabilities:

run_em(eim_sim3)$prob
##           [,1]      [,2]
## [1,] 0.3686854 0.6313146
## [2,] 0.6627212 0.3372788
## [3,] 0.2112870 0.7887130
## [4,] 0.1805033 0.8194967

On the other side, it is possible to simulate an heterogeneous sample

eim_sim4 <- simulate_election(
    num_ballots = 20, num_groups = 4, num_candidates = 2, seed = 42,
    lambda = 0.9
)
plot_matrix(eim_sim4$W)
Voters' heatmap for a high lambda value

Voters’ heatmap for a high lambda value

With the underlying probabilities

run_em(eim_sim4)$prob
##           [,1]      [,2]
## [1,] 0.3731878 0.6268122
## [2,] 0.5431853 0.4568147
## [3,] 0.9365629 0.0634371
## [4,] 0.8711210 0.1288790

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.