Introduction
Analysts typically work with raw or unitary data as many have access to either student information systems or data warehouses that store information at the student level. Most functions in the DisImpact
package are designed with such data structures in mind. However, when analysts collaborate with other data providers or have limited access to data, the data provided are typically summarized or aggregated to protect student privacy. For example, the California Community Colleges Chancellor’s Office (CCCCO) Student Success Metrics (SSM) dashboard allows users to download the data that underlies the visualizations.
This data set is summarized by cohort, outcome, time window, and value, meaning each row corresponds to a data point in a visualization within the dashboard. The DisImpact
package allows one to calculate disproportionate impact (DI) for such a data structure using the di_iterate_on_long
function, which is very similar to the di_iterate
function illustrated in the Scaling DI Calculations vignette.
Load DisImpact
and toy data set
First, load the necessary packages.
Second, load a toy data set.
## [1] 5760 20
## Warning: package 'knitr' was built under R version 4.0.5
cohort | localeName | academicYear | metricID | title | categoryID | disagg1 | subgroup1 | disagg2 | subgroup2 | value | denom | perc | dataType | missingFlag | ferpaFlag | X20 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
After 3 Years | Community College A | 2015 | SM 504C3 | Completed Transfer-Level Math and English | 501 | Age | 19 or Less | Gender | All Other Values | NA | NA | NA | Percent | 1 | 1 | NA |
After 3 Years | Community College A | 2015 | SM 504C3 | Completed Transfer-Level Math and English | 501 | Age | 19 or Less | Gender | Female | 169 | 957 | 0.17659 | Percent | 0 | 0 | NA |
After 3 Years | Community College A | 2015 | SM 504C3 | Completed Transfer-Level Math and English | 501 | Age | 19 or Less | Gender | Male | 182 | 1149 | 0.15840 | Percent | 0 | 0 | NA |
After 3 Years | Community College A | 2015 | SM 504C3 | Completed Transfer-Level Math and English | 501 | Age | 19 or Less | None | None | 353 | 2131 | 0.16565 | Percent | 0 | 0 | NA |
After 3 Years | Community College A | 2015 | SM 504C3 | Completed Transfer-Level Math and English | 501 | Age | 20 to 24 | Gender | All Other Values | NA | NA | NA | Percent | 1 | 1 | NA |
After 3 Years | Community College A | 2015 | SM 504C3 | Completed Transfer-Level Math and English | 501 | Age | 20 to 24 | Gender | Female | NA | NA | NA | Percent | 1 | 1 | NA |
To get a description of each variable, type ?ssm_cohort
in the R console.
Select relevant rows
In the following code, we select relevant rows that correspond to the outcomes of interest (categoryLabel
), the disaggregations of interest (disagg1
), and all non-missing and non-FERPA-suppressed groups:
d_relevant <- ssm_cohort %>%
filter(
categoryLabel %in% c('Completed Both Transfer-Level Math and English Within the District in the First Year Aligned with SCFF'
, 'Attained the Vision Goal Definition of Completion'
, 'Earned an Associate Degree'
, 'Transferred to a Four-Year Postsecondary Institution'
)
, disagg1 %in% c('Ethnicity', 'Foster Youth', 'Veterans')
, disagg2 == 'None' # There's also Gender
, missingFlag == 0
, ferpaFlag == 0
)
d_relevant %>%
group_by(disagg1, subgroup1) %>%
tally
## # A tibble: 13 x 3
## # Groups: disagg1 [3]
## disagg1 subgroup1 n
## <chr> <chr> <int>
## 1 Ethnicity All Masked Values 14
## 2 Ethnicity Asian 14
## 3 Ethnicity Black or African American 4
## 4 Ethnicity Filipino 14
## 5 Ethnicity Hispanic 14
## 6 Ethnicity Two or More Races 14
## 7 Ethnicity White 14
## 8 Foster Youth All Masked Values 11
## 9 Foster Youth Foster Youth 3
## 10 Foster Youth Not Foster Youth 3
## 11 Veterans All Masked Values 12
## 12 Veterans Not Veteran 2
## 13 Veterans Veteran 1
In the following code, we select similar rows to the previous selection, but also allow for each group within the first level of disaggregation to also be disaggregated by gender (disagg2
):
d_relevant_gender <- ssm_cohort %>%
filter(
categoryLabel %in% c('Completed Both Transfer-Level Math and English Within the District in the First Year Aligned with SCFF'
, 'Attained the Vision Goal Definition of Completion'
, 'Earned an Associate Degree'
, 'Transferred to a Four-Year Postsecondary Institution'
)
, disagg1 %in% c('Ethnicity', 'Foster Youth', 'Veterans')
# , disagg2 == 'None' # There's also Gender
, disagg2 == 'Gender'
, missingFlag == 0
, ferpaFlag == 0
)
d_relevant_gender %>%
group_by(disagg1, subgroup1, disagg2, subgroup2) %>%
tally
## # A tibble: 21 x 5
## # Groups: disagg1, subgroup1, disagg2 [12]
## disagg1 subgroup1 disagg2 subgroup2 n
## <chr> <chr> <chr> <chr> <int>
## 1 Ethnicity All Masked Values Gender All Other Values 14
## 2 Ethnicity Asian Gender Female 14
## 3 Ethnicity Asian Gender Male 14
## 4 Ethnicity Filipino Gender Female 10
## 5 Ethnicity Filipino Gender Male 7
## 6 Ethnicity Hispanic Gender Female 14
## 7 Ethnicity Hispanic Gender Male 14
## 8 Ethnicity Two or More Races Gender Female 13
## 9 Ethnicity Two or More Races Gender Male 13
## 10 Ethnicity White Gender Female 14
## # ... with 11 more rows
For an ethnicity group like Asian (or any group specified by the disaggregation variable disagg1
), the data set d_relevant
would have a row for the group, and the data set d_relevant
would have multiple rows, one corresponding to each gender class.
Execute di_iterate_on_long
on a data set
Let’s illustrate the di_iterate_on_long
function with some key arguments:
data
: A data frame for which to iterate DI calculations for a set of variables.num_var
: A variable name (character value) fromdata
where the variable stores success counts (the numerator in success rates). Success rates are calculated by aggregatingnum_var
anddenom_var
for each unique combination of values indisagg_var_col
,group_var_col
,disagg_var_col_2
,group_var_col_2
,cohort_var_col
, andsummarize_by_vars
. If such combinations are unique (single row), then rows are not collapsed.denom_var
: A variable name (character value) fromdata
where the variable stores the group size (the denominator in success rates).disagg_var_col
: A variable name (character value) fromdata
where the variable stores the different disaggregation scenarios. The disaggregation variable could include such values as ‘Ethnicity’, ‘Age Group’, and ‘Foster Youth’, corresponding to three disaggregation scenarios.group_var_col
: A variable name (character value) fromdata
where the variable stores the group name for each group within a level of disaggregation specified indisagg_var_col
. For example, the group names could include ‘Asian’, ‘White’, ‘Black’, ‘Latinx’, ‘Native American’, and ‘Other’ for a disaggregation on ethnicity; ‘Under 18’, ‘18-21’, ‘22-25’, and ‘25+’ for an age group disaggregation; and ‘Yes’ and ‘No’ for a foster youth status disaggregation.disagg_var_col_2
: (Optional) A variable name (character value) fromdata
where the variable stores an optional second disaggregation variable, which allows for the intersectionality of variables listed indisagg_var_col
anddisagg_var_col_2
. The second disaggregation variable could describe something not indisagg_var_col_2
, such as ‘Gender’, which would require all groups described ingroup_var_col
to be broken out by gender.group_var_col_2
: (Optional) A variable name (character value) fromdata
where the variable stores the group name for each group within a second level of disaggregation specified indisagg_var_col_2
. For example, the group names could include ‘Male’, ‘Female’, ‘Non-binary’, and ‘Unknown’ if ‘Gender’ is a value in the variabledisagg_var_col_2
.cohort_var_col
: (Optional) A variable name (character value) fromdata
where the variable stores the cohort label for the data described in each row.summarize_by_vars
: (Optional) A character vector of variable names indata
for whichnum_var
anddenom_var
are used for aggregation to calculate success rates for the dispropotionate impact (DI) analysis set up bydisagg_var_col
,group_var_col
,disagg_var_col_2
, andgroup_var_col_2
. For example,summarize_by_vars=c('Outcome')
could specify a single variable/column that describes the outcome or metric innum_var
, where the outcome values might include ‘Completion of Transfer-Level Math’, ‘Completion of Transfer-Level English’,‘Transfer’, ‘Associate Degree’.
To see the details of these and other arguments, type ?di_iterate_on_long
in the R console.
# Example 1: By outcome, cohort
di_summ_1 <- di_iterate_on_long(data=d_relevant
, num_var='value'
, denom_var='denom'
, disagg_var_col='disagg1'
, group_var_col='subgroup1'
, cohort_var_col='academicYear'
, summarize_by_vars=c('categoryLabel', 'cohort')
, ppg_reference_groups='all but current' # PPG-1
, di_80_index_reference_groups='all but current' # Relative rates analogous to PPG-1 for reference group
)
## Joining, by = c("cohort", "academicYear", "categoryLabel", "disagg1")
## Joining, by = c("cohort", "academicYear", "categoryLabel", "disagg1", "subgroup1")
## Joining, by = c("categoryLabel", "cohort", "disagg1", "academicYear")
## Joining, by = c("..scenario..", "..group..")
## [1] 120
## [1] 120
## categoryLabel
## 1 Completed Both Transfer-Level Math and English Within the District in the First Year Aligned with SCFF
## 2 Completed Both Transfer-Level Math and English Within the District in the First Year Aligned with SCFF
## 3 Completed Both Transfer-Level Math and English Within the District in the First Year Aligned with SCFF
## 4 Completed Both Transfer-Level Math and English Within the District in the First Year Aligned with SCFF
## 5 Completed Both Transfer-Level Math and English Within the District in the First Year Aligned with SCFF
## 6 Completed Both Transfer-Level Math and English Within the District in the First Year Aligned with SCFF
## cohort disagg1 academicYear subgroup1 n success
## 1 After 3 Years Ethnicity 2015 All Masked Values 132 10
## 2 After 3 Years Ethnicity 2015 Asian 938 115
## 3 After 3 Years Ethnicity 2015 Filipino 97 16
## 4 After 3 Years Ethnicity 2015 Hispanic 903 69
## 5 After 3 Years Ethnicity 2015 Two or More Races 162 21
## 6 After 3 Years Ethnicity 2015 White 1178 133
## pct ppg_reference ppg_reference_group moe pct_lo
## 1 0.07575758 0.1079927 all but current 0.08529805 -0.009540476
## 2 0.12260128 0.1007282 all but current 0.03199813 0.090603145
## 3 0.16494845 0.1050407 all but current 0.09950392 0.065444529
## 4 0.07641196 0.1176705 all but current 0.03261236 0.043799602
## 5 0.12962963 0.1056034 all but current 0.07699607 0.052633558
## 6 0.11290323 0.1034946 all but current 0.03000000 0.082903226
## pct_hi di_indicator_ppg success_needed_not_di_ppg
## 1 0.1610556 0 0
## 2 0.1545994 0 0
## 3 0.2644524 0 0
## 4 0.1090243 1 8
## 5 0.2066257 0 0
## 6 0.1429032 0 0
## success_needed_full_parity_ppg di_prop_index di_indicator_prop_index
## 1 5 0.7097070 1
## 2 0 1.1485450 0
## 3 0 1.5452589 0
## 4 38 0.7158373 1
## 5 0 1.2143875 0
## 6 0 1.0576923 0
## success_needed_not_di_prop_index success_needed_full_parity_prop_index
## 1 2 5
## 2 0 0
## 3 0 0
## 4 11 38
## 5 0 0
## 6 0 0
## di_80_index_reference_group di_80_index di_indicator_80_index
## 1 all but current 0.7015066 1
## 2 all but current 1.2171501 0
## 3 all but current 1.5703282 0
## 4 all but current 0.6493721 1
## 5 all but current 1.2275132 0
## 6 all but current 1.0909091 0
## success_needed_not_di_80_index success_needed_full_parity_80_index
## 1 2 5
## 2 0 0
## 3 0 0
## 4 17 38
## 5 0 0
## 6 0 0
To calculate DI with cohort year collapsed, then one could omit the cohort_var_col
argument for rows with common disagg_var_col
, group_var_col
, and those in summarize_by_vars
to be aggregated or collapsed:
# Example 2: by outcome, collapse cohort academic years
di_summ_2 <- di_iterate_on_long(data=d_relevant
, num_var='value'
, denom_var='denom'
, disagg_var_col='disagg1'
, group_var_col='subgroup1'
# , cohort_var_col='academicYear'
, summarize_by_vars=c('categoryLabel', 'cohort')
, ppg_reference_groups='all but current'
, di_80_index_reference_groups='all but current'
)
## Joining, by = c("cohort", "categoryLabel", "disagg1")
## Joining, by = c("cohort", "categoryLabel", "disagg1", "subgroup1")
## Joining, by = c("categoryLabel", "cohort", "disagg1")
## Joining, by = c("..scenario..", "..group..")
## [1] 43
## [1] 120
## categoryLabel
## 1 Completed Both Transfer-Level Math and English Within the District in the First Year Aligned with SCFF
## 2 Completed Both Transfer-Level Math and English Within the District in the First Year Aligned with SCFF
## 3 Completed Both Transfer-Level Math and English Within the District in the First Year Aligned with SCFF
## 4 Completed Both Transfer-Level Math and English Within the District in the First Year Aligned with SCFF
## 5 Completed Both Transfer-Level Math and English Within the District in the First Year Aligned with SCFF
## 6 Completed Both Transfer-Level Math and English Within the District in the First Year Aligned with SCFF
## cohort disagg1 subgroup1 n success pct
## 1 After 3 Years Ethnicity All Masked Values 798 117 0.1466165
## 2 After 3 Years Ethnicity Asian 6391 1316 0.2059146
## 3 After 3 Years Ethnicity Filipino 622 174 0.2797428
## 4 After 3 Years Ethnicity Hispanic 4757 647 0.1360101
## 5 After 3 Years Ethnicity Two or More Races 984 251 0.2550813
## 6 After 3 Years Ethnicity White 6797 1262 0.1856701
## ppg_reference ppg_reference_group moe pct_lo pct_hi
## 1 0.1865634 all but current 0.03469162 0.1119249 0.1813082
## 2 0.1754724 all but current 0.03000000 0.1759146 0.2359146
## 3 0.1820249 all but current 0.03929442 0.2404483 0.3190372
## 4 0.1998851 all but current 0.03000000 0.1060101 0.1660101
## 5 0.1814533 all but current 0.03124126 0.2238400 0.2863226
## 6 0.1846685 all but current 0.03000000 0.1556701 0.2156701
## di_indicator_ppg success_needed_not_di_ppg success_needed_full_parity_ppg
## 1 1 5 32
## 2 0 0 0
## 3 0 0 0
## 4 1 162 304
## 5 0 0 0
## 6 0 0 0
## di_prop_index di_indicator_prop_index success_needed_not_di_prop_index
## 1 0.7925135 1 2
## 2 1.1130399 0 0
## 3 1.5121070 0 0
## 4 0.7351819 1 71
## 5 1.3788032 0 0
## 6 1.0036118 0 0
## success_needed_full_parity_prop_index di_80_index_reference_group di_80_index
## 1 32 all but current 0.7858807
## 2 0 all but current 1.1734871
## 3 0 all but current 1.5368383
## 4 304 all but current 0.6804415
## 5 0 all but current 1.4057685
## 6 0 all but current 1.0054242
## di_indicator_80_index success_needed_not_di_80_index
## 1 1 3
## 2 0 0
## 3 0 0
## 4 1 114
## 5 0 0
## 6 0 0
## success_needed_full_parity_80_index
## 1 32
## 2 0
## 3 0
## 4 304
## 5 0
## 6 0
Second layer of disaggregation / Intersectionality
Sometimes, users may want to incorporate a second layer of disaggregation / intersection with another a second variable (e.g., gender). The Intersectionality vignette discusses this in some detail. One could do this using the second derived data set, d_relevant_gender
, which contains summarized data with rows split out by gender:
# Example 3: by outcome, intersecting gender
di_summ_3 <- di_iterate_on_long(data=d_relevant_gender
, num_var='value'
, denom_var='denom'
, disagg_var_col='disagg1'
, group_var_col='subgroup1'
, disagg_var_col_2='disagg2'
, group_var_col_2='subgroup2'
, cohort_var_col='academicYear'
, summarize_by_vars=c('categoryLabel', 'cohort')
, ppg_reference_groups='overall'
, di_80_index_reference_groups='all but current'
)
## Joining, by = c("cohort", "academicYear", "categoryLabel", "disagg1", "disagg2")
## Joining, by = c("cohort", "academicYear", "categoryLabel", "disagg1", "subgroup1", "disagg2", "subgroup2")
## Joining, by = c("categoryLabel", "cohort", "disagg1", "disagg2", "academicYear")
## Joining, by = c("..scenario..", "..group..")
## [1] 201
## [1] 201
## categoryLabel
## 1 Completed Both Transfer-Level Math and English Within the District in the First Year Aligned with SCFF
## 2 Completed Both Transfer-Level Math and English Within the District in the First Year Aligned with SCFF
## 3 Completed Both Transfer-Level Math and English Within the District in the First Year Aligned with SCFF
## 4 Completed Both Transfer-Level Math and English Within the District in the First Year Aligned with SCFF
## 5 Completed Both Transfer-Level Math and English Within the District in the First Year Aligned with SCFF
## 6 Completed Both Transfer-Level Math and English Within the District in the First Year Aligned with SCFF
## cohort disagg1 disagg2 academicYear subgroup1
## 1 After 3 Years Ethnicity Gender 2015 All Masked Values
## 2 After 3 Years Ethnicity Gender 2015 Asian
## 3 After 3 Years Ethnicity Gender 2015 Asian
## 4 After 3 Years Ethnicity Gender 2015 Hispanic
## 5 After 3 Years Ethnicity Gender 2015 Hispanic
## 6 After 3 Years Ethnicity Gender 2015 Two or More Races
## subgroup2 n success pct ppg_reference ppg_reference_group
## 1 All Other Values 444 46 0.10360360 0.1064011 overall
## 2 Female 490 54 0.11020408 0.1064011 overall
## 3 Male 427 60 0.14051522 0.1064011 overall
## 4 Female 457 34 0.07439825 0.1064011 overall
## 5 Male 438 35 0.07990868 0.1064011 overall
## 6 Male 98 13 0.13265306 0.1064011 overall
## moe pct_lo pct_hi di_indicator_ppg success_needed_not_di_ppg
## 1 0.04650874 0.05709486 0.1501123 0 0
## 2 0.04427189 0.06593219 0.1544760 0 0
## 3 0.04742552 0.09308970 0.1879407 0 0
## 4 0.04584247 0.02855578 0.1202407 0 0
## 5 0.04682621 0.03308246 0.1267349 0 0
## 6 0.09899495 0.03365811 0.2316480 0 0
## success_needed_full_parity_ppg di_prop_index di_indicator_prop_index
## 1 2 0.9737077 0
## 2 0 1.0357416 0
## 3 0 1.3206177 0
## 4 15 0.6992242 1
## 5 12 0.7510134 1
## 6 0 1.2467260 0
## success_needed_not_di_prop_index success_needed_full_parity_prop_index
## 1 0 2
## 2 0 0
## 3 0 0
## 4 6 17
## 5 3 14
## 6 0 0
## di_80_index_reference_group di_80_index di_indicator_80_index
## 1 all but current 0.9700203 0
## 2 all but current 1.0417730 0
## 3 all but current 1.3818822 0
## 4 all but current 0.6691466 1
## 5 all but current 0.7253068 1
## 6 all but current 1.2556108 0
## success_needed_not_di_80_index success_needed_full_parity_80_index
## 1 0 2
## 2 0 0
## 3 0 0
## 4 7 17
## 5 4 14
## 6 0 0
Additional Information
For additional illustrations of various parameter changes in di_iterate_on_long
, please see the Scaling DI Calculations vignette as the di_iterate_on_long
function is very similar to di_iterate
that’s applied to a unitary data set.
Appendix: R and R Package Versions
This vignette was generated using an R session with the following packages. There may be some discrepancies when the reader replicates the code caused by version mismatch.
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=C
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.33 dplyr_1.0.6 DisImpact_0.0.16
##
## loaded via a namespace (and not attached):
## [1] magrittr_2.0.1 tidyselect_1.1.1 R6_2.5.0 rlang_0.4.11
## [5] fansi_0.5.0 highr_0.9 stringr_1.4.0 tools_4.0.2
## [9] xfun_0.23 utf8_1.2.1 cli_3.1.0 jquerylib_0.1.4
## [13] htmltools_0.5.1.1 ellipsis_0.3.2 yaml_2.2.1 digest_0.6.27
## [17] tibble_3.1.2 lifecycle_1.0.0 crayon_1.4.1 tidyr_1.1.3
## [21] purrr_0.3.4 sass_0.4.0 prettydoc_0.4.1 vctrs_0.3.8
## [25] glue_1.4.2 evaluate_0.14 rmarkdown_2.9 stringi_1.4.6
## [29] compiler_4.0.2 bslib_0.2.5.1 pillar_1.6.1 generics_0.1.0
## [33] jsonlite_1.7.2 pkgconfig_2.0.3