The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

ISCA - Guide

library(ISCA)

Why ISCA?

ISCA stands for “Inductive Subgroup Comparison Approach”, a data-driven approach to identifying and comparing groups in data originally proposed in Drouhot (2021). Specifically, ISCA identifies the main subgroups within Group 1 (a majority or reference group), and matches members of Group 2 (a minority group) to different subgroups from the majority based on social similarity. In other words, ISCA allows minority members to be compared to socially similar members of the majority. For instance, when comparing two ethnic groups (a majority and a minority group), ISCA might identify a subgroup comprising men of low social status in the majority group. It will then identify similar members of the minority group (also men of low social status) and match them to the appropriate subgroup within the majority, so that the comparison is more appropriate than would be if we compared all members to the majority group to all members of the minority groups. ISCA is especially useful when majority and minority groups to be compared are internally heterogeneous, resulting in different baselines within the majority group against which the minority group should be compared (again, to illustrate, for many comparisons it is best to study low status minority members vis-à-vis low status majority members rather than majority members in general).

On a more theoretical level, ISCA helps researchers avoid essentializing minority and majority groups as (e.g. ethnically, racially, religiously) homogenous categories in their quantitative analyses, and recover the internal social structures of social groups. Unlike common regression approach which obscure intra-group heterogeneity by yielding a single estimate for each pre-defined, fixed (nominal) group category, ISCA relies on a mixture of fuzzy clustering, Monte Carlo simulation, and regression analysis to assess minority and majority groups within a population as socially comparable subgroups, yielding an estimate for each subgroup. Estimates can then be compared across subgroups, as well as to a grand mean capturing traditional, single estimates at the group level. Together, these comparisons allow to see which empirical subgroups may be driving overall trends observed at the aggregate level, and, conversely, which pointed trends at the subgroup level get lost when only looking at nominal group-level estimates. For an in-depth discussion, see Drouhot 2021: 804-811. If you are already familiar with the logic of ISCA, you can skip the next section and move right on to the first function.

What is ISCA? Some Background (adapted from Drouhot 2021)

ISCA consists of three distinct methodological steps.

  1. Through fuzzy cluster analysis— a family of data partitioning techniques from the larger umbrella of unsupervised machine learning (Kaufman and Rousseeuw 2005; Molina and Garip 2019) — the first step is to inductively identify the key subgroups within the majority population.

  2. Minority groups are assigned to one of the subgroups making up the majority population on the basis of (social) similarity.

  3. Hypotheses are tested and outcomes of interest are modelled in these matched minority-majority subgroups.

Thus, ISCA effectively switches the relevant categories of analysis in intergroup comparisons from nominal (e.g., religious, ethnic, or otherwise categorically defined) groups to data-driven subgroups and allows for within as well as the more traditional between-group comparisons.

From the point of view of the integration / assimilation literature within the social sciences, ISCA allows for the study of distinct assimilation/integration pathways by taking intragroup heterogeneity among both the majority and minority population into account and is characterized by three key features.

  1. It is inductive, insofar as subgroups of interest emerge from the data rather than prenotions from the researcher. This is a point worth emphasizing: you could decide to compare low-income immigrants to low-income natives, for example. But you cannot, by definition, know in advance whether income is the right dimension to organize your comparison. Additionally, it is possible for several dimensions (income, gender, age, urban location, etc.) to consolidate into subgroups (Blau 1974). In other words, what matters for group heterogeneity may well be specific configurations of variables rather than specific variables. ISCA relies on the inherent reflexivity afforded by unsupervised machine-learning approaches: while domain-specific knowledge remains key to interpreting results, researchers need not impose assumptions about the structure of the data or the analytical appropriateness of a given social category in organizing heterogeneity within the data.

  2. ISCA is probabilistic because it does not rely on “hard” clustering assignment (such as that obtained with k-means clustering), where observations can belong to one cluster only, as such an approach would lead to reifying subgroups themselves and possibly overstating cross-cluster differences. Rather, it relies on fuzzy clustering, where membership in each cluster is uncertain and expressed through a membership score, which is then used to assign cases to groups in a stochastic and thus truly probabilistic manner.

    More specifically, each individual observation receives k membership scores, where k is the number of clusters. Membership scores range from 0 to 1, and the sum of each individual’s membership scores adds up to 1. If all individuals have a very high probability of belonging to only one cluster, then fuzzy and hard clustering do not differ much. However, if membership probabilities are balanced across clusters, hard clustering may result in arbitrary, and possibly misleading group assignments.1

  3. Finally, ISCA is iterative. As there exists significant uncertainty around subgroup boundaries (as expressed by membership scores from fuzzy clustering that are balanced), results following stochastic assignment may misrepresent the underlying uncertainty about cluster membership when membership is assigned only once. Thus, ISCA relies on Monte Carlo simulation and multiple iterations of the assignment and modelling steps to obtain stable results. Rather than a statistical or analytical nuisance, the procedure hence regards assignment uncertainty as meaningful since it reflects the blurry boundaries between ideal-typical subgroups making up the majority and minority categories of interest.

The following three functions allow researchers to conduct the ISCA in R.

Function / Step 1

Identifying Heterogeneity in Majority Group through Fuzzy Clustering and Stochastically Assign Individuals to a Subgroup

The function translates to step 1 and 2 (see Drouhot 2021: 808-10). First, identify and choose variables that are known to be associated with your outcome variable based on past research. In the original implementation, ISCA does not divide up a majority group sample directly in terms of the outcome variable but uses (demographic) variables that are known to be related to it. Note, however, that splitting groups directly in terms of the outcome of interest is of course possible, and can be appropriate depending on the research questions and related substantive concerns.

To obtain subgroups in the majority sample, the ISCA package then relies on the fuzzy c-means clustering algorithm (Bezdek, Ehrlich, and Full 1984) and specifically the cmeans() function of the e1071-package (Meyer et al. 2023). It is advisable to transform continuous variables in dummies (e.g., coded 0 for values below the median and 1 for values above) to avoid an arbitrary weighting of attributes due to different scales, which would affect the clustering results in undesirable ways.

You need to decide the number of clusters you would like to model. There is a large literature on this across social sciences, as well as computer and data science. Hence, a large number of diagnostic tests exist to determine the most appropriate number, each with strength and weakness; for instance, Drouhot (2021: 839-40 in Appendix B) used 6 tests from the fclust package (Ferraro, Giordani, and Serafini 2019). Additionally, users can compare multiple solutions and go for the one where each subgroup is the most easily interpretable. In particular, clustering solution yielding either one very large, multiple very small, or redundant groups should be approached with caution.

For transparency, the function uses the following arguments within the cmeans() function: iter.max=100, verbose=TRUE, dist="manhattan", method="cmeans", m=1.5 (default, but can be adjusted in the ISCA_random_assignments()-function), rate.par = NULL.

Once you know the number of clusters you want to model and the variables with which the clusters should be created, the first function from the ISCA-package can be used.

The ISCA_random_assignments()-function requires the following arguments:

The output is a dataframe that has one new column containing the cluster assignment for each probabilistic draw. This means that the output is based on the originally provided dataset and all variables including those not used to create the clusters remain unchanged so that they could be used in later analyses (e.g., for step 2 and 3). The output is the foundation of the other two functions in the ISCA-package. For purposes of illustration, the output below is from a simulated dataset containing 1000 observations. One observation can be interpreted as one individual. The output below only shows the first six individuals in this fictitious dataset.

data("sim_data")
head(sim_data)
##   female age education income religiosity discrimination native
## 1      1  74         6   1608           2              2      0
## 2      1  20         4    994           4              4      1
## 3      1  69         9   1227          10              4      0
## 4      0  21         6      0           2              5      0
## 5      1  64         6   2132          10              3      0
## 6      1  27         5   1974           1              4      1
ISCA_step1 <- ISCA_random_assignments(data=sim_data, filter=native, majority_group=1, minority_group=c(0), fuzzifier = 1.5,
                                      n_clusters=3, draws=5, cluster_vars= c("female", "age", "education", "income"))
## Iteration:   1, Error: 219.9604189517
## Iteration:   2, Error: 218.2123581526
## Iteration:   3, Error: 217.9846545604
## Iteration:   4, Error: 217.9588491548
## Iteration:   5 converged, Error: 217.9588491548
head(ISCA_step1)
##   female age education income religiosity discrimination native A1 A2 A3 A4 A5
## 1      1  20         4    994           4              4      1  2  3  1  2  2
## 2      1  27         5   1974           1              4      1  1  1  1  1  1
## 3      0  71         5   1042           4              4      1  2  2  2  2  2
## 4      1  53         4   2316           7              5      1  2  1  1  1  1
## 5      0  51         5   1213           1              5      1  2  2  2  2  2
## 6      1  56         5    682           5              6      1  3  3  2  3  3

In this example, four variables (i.e., female, age, education, income) are used to create 4 clusters to which majority and minority subjects are assigned probabilistically across 5 iterations. The variable in the data set that stores information on minority vs. majority status is called ‘native’. In it, value 1 indicates majority status, the values 0 indicates minority status. In that particular example, clustering results (as indicated by column A1 to A5; 5 multinomial assignments based on the distribution formed by membership function in the 3 clusters) shows substantial stability across assignments A1-A5, indicating that the distribution of membership functions is probably very skewed towards one cluster. Note that for observation #1, the assignment is less stable than other observations. In this particular example based on a fictitious and random data, the obtained subgroups do not coalesce on all variables; rather, income appear most influential in organizing subgroups (i.e., subgroups may be heterogeneous in terms of age and religiosity for instance, but homogenous in terms of income). Note that the discrimination and religiosity variables are part of the dataset but were not used to create the clusters because the clusters focus more on social structural indicators and the inclusion of the outcome variable is not always recommended (see Drouhot 2021).

Note also that the function probabilistically assigns members of both majority and minority groups to clusters. Specifically, the function first starts by probabilistically assigning majority group members to the empirically defined clusters, and then continues to probabilistically assign each minority group member to the cluster they most closely resemble. In other words, the clusters are created based only on the majority group, and minority group members are subsequently matched to a cluster of reference based on social similarity. This is why in the head(ISCA_step1) output above the first six individuals are all natives, indicated by the value one. When displaying the last 6 rows of ISCA_step1, we can see that the output contains minority members as well:

tail(ISCA_step1, 6)
##      female age education income religiosity discrimination native A1 A2 A3 A4
## 995       1  20         3    445           3              3      0  3  3  3  3
## 996       0  53         4    121           5              2      0  3  3  3  3
## 997       0  73         6    871           7              5      0  2  2  3  2
## 998       0  67         5    579           7              4      0  3  3  3  3
## 999       1  30         2      0           3              3      0  3  3  3  3
## 1000      1  59         6   2241           5              3      0  1  1  1  1
##      A5
## 995   3
## 996   3
## 997   2
## 998   3
## 999   3
## 1000  1

Function / Step 2

The second function gives an overview of the distribution of the variables of interest per cluster and across all iterations. That includes:

The function also computes a grand mean and cluster error for a count variable for each cluster.

The substantive goal for the analyst here is to better understand the obtained clusters, by interpreting them based on domain-specific and other expert knowledge of the populations and context at hand. When producing a descriptive table, remember that observations should first be filtered to include only members of the majority group since they are the ones determining the social structure obtained at step 1 (minority members are only matched to clusters of majority members they most closely resemble). Additionally, for purposes of comparison, the table could also include variables that were not part of the cluster building, for example other independent or dependent variables.

Variables of interest could be the variables that were used to build the clusters. This helps to understand the clusters and assign substantive meaning to them. In that case the dataset should first be filtered to include only members of the majority group. The table could also include variables that were not part of the cluster building, for example other independent or dependent variables.

Variables with only two values are automatically recognized as categorical, dichotomous variables and no grand standard deviation is calculated. If you work with multinomial factor variables, it is best to include them as dummy-variables.

The ISCA_clustertable()-function requires the following arguments:

majority_only <- ISCA_step1[ISCA_step1["native"] == 1, ]

result_ISCA_clustertable <- ISCA_clustertable(data = majority_only, 
                                              cluster_vars = c("native", "female", "age", "education", "income"),
                                              draws = 5)

head(result_ISCA_clustertable, 50)
##                   variable         1         2        3
## 1               assignment    1.0000    2.0000   3.0000
## 2        grand_mean_native    1.0000    1.0000   1.0000
## 3     cluster_error_native    0.0000    0.0000   0.0000
## 4          grand_sd_native    0.0000    0.0000   0.0000
## 5        grand_mean_female    0.4673    0.5089   0.5655
## 6     cluster_error_female    0.0089    0.0125   0.0168
## 7           grand_mean_age   48.2452   47.5843  48.2351
## 8        cluster_error_age    0.1220    0.4307   0.7271
## 9             grand_sd_age   18.2713   18.2838  18.0426
## 10    grand_mean_education    4.9916    5.0076   4.9839
## 11 cluster_error_education    0.0516    0.0386   0.0409
## 12      grand_sd_education    1.4005    1.2592   1.2286
## 13       grand_mean_income 1964.4856 1228.5364 369.6069
## 14    cluster_error_income   15.1964   30.7765  28.0879
## 15         grand_sd_income  469.1825  381.0639 410.0264
## 16        grand_mean_count  140.0000  187.8000 136.2000
## 17     cluster_error_count    3.5355    4.9699   7.1903

Remember that the function examples are based on a simple artificial dataset. Interpreting the clusters here is therefore not very straightforward and useful. With real life data, clearer patterns are likely to emerge, and are to be interpreted by the researcher based on domain-specific knowledge.

Function / Step 3

Within-Cluster Modelling of Outcomes and Cross-Group Comparisons

The third function runs OLS regressions for each cluster and iteration. The output contains a list with two entries which can be extracted. The first entry contains the OLS regression estimates for each cluster averaged across iterations, including estimates for a model which pools all clusters. The second entry contains averaged adjusted R-squared values per cluster to indicate model fit.

The ISCA_modeling()-function requires the following arguments:

Note that the goal here becomes to compare majority and minority groups within each cluster as well as differences between clusters within a given categorically defined groups, particularly whether minority consistently differ from majority members and whether predictors affect the main outcome of interest in a consistent manner across subgroups. Additionally, if a supplementary model at the categorical (i.e., nominal) group level is run, the comparison of the cluster-specific estimates with group-level estimates is particularly meaningful because it allows to see which subgroups drive aggregate trends, and conversely, which significant trends at the subgroup level get obscured at the aggregate level.

Currently, there are no tests for the differences in effects between clusters, although these can be obtained from ad hoc testing as such through the Paternoster test for the equality of coefficients (Paternoster et al. 1998).

ISCA_modeling_results <- ISCA_modeling(data= ISCA_step1, 
                                   model_spec="religiosity ~ native + female + age + education + discrimination",
                                   draws = 5, n_clusters = 3, weights = NULL)
## `mutate_if()` ignored the following grouping variables:
## • Column `cluster`
estimates <- as.data.frame(ISCA_modeling_results[1])
head(estimates, 30) # OLS estimates per cluster across iterations
##    cluster           term mean_coefficients mean_std.error mean_p_value
## 1        1    (Intercept)            4.1783         0.1869       0.0000
## 2        1            age           -0.0027         0.0043       0.6502
## 3        1 discrimination            0.2077         0.0458       0.1409
## 4        1      education            0.0930         0.0402       0.4670
## 5        1         female            0.3082         0.0648       0.3542
## 6        1         native           -1.3254         0.0934       0.0001
## 7        2    (Intercept)            4.8992         0.3641       0.0000
## 8        2            age           -0.0014         0.0041       0.6824
## 9        2 discrimination            0.0439         0.0706       0.5639
## 10       2      education           -0.0014         0.0368       0.7705
## 11       2         female            0.9220         0.0961       0.0012
## 12       2         native           -1.3263         0.1075       0.0000
## 13       3    (Intercept)            5.6342         0.3043       0.0000
## 14       3            age            0.0038         0.0018       0.6894
## 15       3 discrimination           -0.1685         0.0617       0.2333
## 16       3      education            0.0264         0.0274       0.8027
## 17       3         female            0.4384         0.1354       0.2077
## 18       3         native           -1.4836         0.1506       0.0000
## 19       0    (Intercept)            4.9147         0.5211       0.0000
## 20       0         native           -1.3733         0.1755       0.0000
## 21       0         female            0.5760         0.1753       0.0010
## 22       0            age           -0.0003         0.0048       0.9514
## 23       0      education            0.0363         0.0656       0.5798
## 24       0 discrimination            0.0269         0.0713       0.7063
rsquared <- as.data.frame(ISCA_modeling_results[2])
head(rsquared)  # Corresponding Adjusted R Squared values per cluster across iterations
##   mean.fit.cluster1 mean.fit.cluster2 mean.fit.cluster3 pooled.model
## 1        0.04760011        0.06899341        0.06387298   0.06104137

In this example, the variables native, female, age, education and discrimination are regressed on religiosity. No weights are specified. Results are averaged across 5 iterations and estimates are calculated for each of the three clusters. Additionally, there is a model 0 which pooled all clusters. Note that unlike in the second function (i.e., ISCA_clustertable()), clusters are not represented as columns but are instead stored in the “cluster” variable (i.e., as rows). This way, users can use the output for post-processing tasks, such as plotting the estimates.

The code below helps to transform the output so that clusters are depicted as columns. This might ease the comparison of the effect coefficients across clusters. Remember that 0 indicates the pooled model.

cluster_as_columns <- estimates %>%
  tidyr::pivot_longer(cols = c(mean_coefficients, mean_std.error, mean_p_value),
               names_to = "Statistics",
               values_to = "value") %>%
  tidyr::pivot_wider(names_from = cluster, values_from = value)

print(cluster_as_columns)
## # A tibble: 18 × 6
##    term           Statistics            `1`     `2`     `3`     `0`
##    <chr>          <chr>               <dbl>   <dbl>   <dbl>   <dbl>
##  1 (Intercept)    mean_coefficients  4.18    4.90    5.63    4.91  
##  2 (Intercept)    mean_std.error     0.187   0.364   0.304   0.521 
##  3 (Intercept)    mean_p_value       0       0       0       0     
##  4 age            mean_coefficients -0.0027 -0.0014  0.0038 -0.0003
##  5 age            mean_std.error     0.0043  0.0041  0.0018  0.0048
##  6 age            mean_p_value       0.650   0.682   0.689   0.951 
##  7 discrimination mean_coefficients  0.208   0.0439 -0.168   0.0269
##  8 discrimination mean_std.error     0.0458  0.0706  0.0617  0.0713
##  9 discrimination mean_p_value       0.141   0.564   0.233   0.706 
## 10 education      mean_coefficients  0.093  -0.0014  0.0264  0.0363
## 11 education      mean_std.error     0.0402  0.0368  0.0274  0.0656
## 12 education      mean_p_value       0.467   0.770   0.803   0.580 
## 13 female         mean_coefficients  0.308   0.922   0.438   0.576 
## 14 female         mean_std.error     0.0648  0.0961  0.135   0.175 
## 15 female         mean_p_value       0.354   0.0012  0.208   0.001 
## 16 native         mean_coefficients -1.33   -1.33   -1.48   -1.37  
## 17 native         mean_std.error     0.0934  0.108   0.151   0.176 
## 18 native         mean_p_value       0.0001  0       0       0

References

Bezdek, James C., Robert Ehrlich, and William Full. 1984. “FCM: The Fuzzy c-Means Clustering Algorithm.” Computers and Geosciences 10 (2-3): 191–203. https://doi.org/10.1016/0098-3004(84)90020-7.
Blau, Peter M. 1974. “MParameters of Social Structure.” American Sociological Review 39: 615–35.
Drouhot, Lucas. 2021. “Cracks in the Melting Pot? Religiosity and Assimilation Among the Diverse Muslim Population in France.” American Journal of Sociology 126 (4): 795–851. https://doi.org/10.1086/712804.
Ferraro, M. B., P. Giordani, and A. Serafini. 2019. “Fclust: An r Package for Fuzzy Clustering.” The R Journal 11. https://journal.r-project.org/archive/2019/RJ-2019-017/RJ-2019-017.pdf.
Kaufman, Leonard, and Peter J. Rousseeuw. 2005. Finding Groups in Data: An Introduction to Cluster Analysis. Hoboken, N.J.: Wiley.
Meyer, David, Evgenia Dimitriadou, Kurt Hornik, Andreas Weingessel, and Friedrich Leisch. 2023. E1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. https://CRAN.R-project.org/package=e1071.
Molina, Mario, and Filiz Garip. 2019. “Machine Learning for Sociology.” Annual Review of Sociology 45: 27–45.
Paternoster, Raymond, Robert Brame, Paul Mazerolle, and Alex Piquero. 1998. “Using the Correct Statistical Test for the Equality of Regression Coefficients.” Criminology 36 (4): 859–66. https://doi.org/10.1111/j.1745-9125.1998.tb01268.x.

  1. Membership scores are calculated as follows: \(w_{i_j}=\frac{1}{\sum_{k=1}^{c} (|| x_i - c_j || / || x_i - c_k ||)^{2/(m-1)}}\)↩︎

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.