The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Linking survey data with SGICs (Subject Generated Identification-Codes)? Awesome! Just remember, you need to validate those IDs. That’s how you get clean data and make sure the link-up goes smoothly.
This vignette shows you:
How to perform plausibility checks on different SGIC components.
How to perform plausibility checks on non-SGIC variables that may serve as additional identifiers.
How to detect duplicate cases using a combination of variables as unique identifiers.
To check the plausibility of ID-related variables in a dataset,
trustmebro
provides several functions beginning with the
prefix inspect. Every inspect-function returns a
boolean value, indicating whether a value has passed or failed the
plausibility check.
We`ll start by loading trustmebro and dplyr:
The survey data we use is the
trustmebro::sailor_students
dataset. It contains fictional
student assessment data from students of the sailor moon universe.
sailor_students
#> # A tibble: 12 × 6
#> sgic school class gender testscore_langauge testscore_calculus
#> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 "MUC__0308" 54321 "3-B " "Male" 425 394
#> 2 "HÄT 2701" 22345 "2-A" "???" 4596 123
#> 3 "MUK3801" 22345 " 2-B" "Femal… 2456 9485
#> 4 "SAM10" 22345 "3-B" "Femal… 2345 3
#> 5 "T0601" 65432 "1-C" "Femal… 1234 NA
#> 6 " UIT3006 " 12345 "3-3" <NA> 123 394
#> 7 "@@@@@@" <NA> "3_2 " "Femal… 56 2938
#> 8 <NA> 12345 "3@41" " Fe… 986 3948
#> 9 " " unkown <NA> "Femal… 284 205
#> 10 "MOA2210" 12345 " " "Femal… 105 21
#> 11 "MUK3801" 22345 "2-B" "Femal… 9586 934
#> 12 "T0601" 65432 "1-C" "Femal… NA 764
The variable sgic
stores SGICs created by students. Each
SGIC is a seven-character string created according to the following
instructions:
Characters 1-3 (letters):
First letter of given name (1st character)
Last letter of given name (2nd character)
First letter of family name (3rd character)
Characters 4-7 (digits):
Birthday (4th and 5th character)
Month of birth (6th and 7th character)
We can use trustmebro::inspect_characterid
to check if
the provided SGICs adhere to the expected pattern of three letters
followed by four digits. The expected structure can be defined using the
regular expression "^[A-Za-z]{3}[0-9]{4}$"
, which we can
then pass to the function using the pattern =
argument. For
seamless integration into your data workflow, this function can be
conveniently combined with dplyr::mutate
:
sailor_students %>%
mutate(structure_check =
inspect_characterid(
sgic, pattern = "^[A-Za-z]{3}[0-9]{4}$")) %>%
select(sgic, structure_check)
#> # A tibble: 12 × 2
#> sgic structure_check
#> <chr> <lgl>
#> 1 "MUC__0308" FALSE
#> 2 "HÄT 2701" FALSE
#> 3 "MUK3801" TRUE
#> 4 "SAM10" FALSE
#> 5 "T0601" FALSE
#> 6 " UIT3006 " FALSE
#> 7 "@@@@@@" FALSE
#> 8 <NA> FALSE
#> 9 " " FALSE
#> 10 "MOA2210" TRUE
#> 11 "MUK3801" TRUE
#> 12 "T0601" FALSE
We created trustmebro::inspect_characterid
with SGICs in
mind, but of course, any other non-SGIC strings can also be checked
using a specified regular expression.
Since the SGIC should end with a date of birth, you can verify the
plausibility of this date of birth using
trustmebro::inspect_birthdaymonth
. This function checks if
a string contains exactly four digits representing a valid date of
birth. As before, you can combine
trustmebro::inspect_birthdaymonth
with
dplyr::mutate
to generate a plausibility check
variable:
sailor_students %>%
mutate(birthdate_check =
inspect_birthdaymonth(sgic)) %>%
select(sgic, birthdate_check)
#> # A tibble: 12 × 2
#> sgic birthdate_check
#> <chr> <lgl>
#> 1 "MUC__0308" TRUE
#> 2 "HÄT 2701" TRUE
#> 3 "MUK3801" FALSE
#> 4 "SAM10" FALSE
#> 5 "T0601" TRUE
#> 6 " UIT3006 " TRUE
#> 7 "@@@@@@" FALSE
#> 8 <NA> FALSE
#> 9 " " FALSE
#> 10 "MOA2210" TRUE
#> 11 "MUK3801" FALSE
#> 12 "T0601" TRUE
Some SGICs only use the single day or month a person was born. In
this case, you can use of trustmebro::inspect_birthday
or
trustmebro::inspect_birthmonth
accordingly.
Besides a SGIC, other variables in a given dataset might be used to
identify cases. As mentioned above,
trustmebro::inspect_characterid
can be used for any string
that should follow a specific pattern. Furthermore, this package also
provides functions for checking other data types beyond strings.
We can use trustmebro::inspect_numberid
to check if a
number matches an expected length. In our dataset, school
should be a five-digit number. combined with dplyr::mutate
,
we can add a plausibility variable for the schoolnumber, just as we did
before:
sailor_students %>%
mutate(school_check =
inspect_numberid(school, 5)) %>%
select(school, school_check)
#> # A tibble: 12 × 2
#> school school_check
#> <chr> <lgl>
#> 1 54321 TRUE
#> 2 22345 TRUE
#> 3 22345 TRUE
#> 4 22345 TRUE
#> 5 65432 TRUE
#> 6 12345 TRUE
#> 7 <NA> FALSE
#> 8 12345 TRUE
#> 9 unkown FALSE
#> 10 12345 TRUE
#> 11 22345 TRUE
#> 12 65432 TRUE
In the process of using non-SGIC variables as identifiers,
categorical data is often recoded to ensure consistency within a
workflow. We can use trustmebro::inspect_valinvec
to check
if a value exists in a recode map. The recode map should be a named
vector, where the names represent the keys. In our dataset, we want to
inspect if all values in gender
conform to this recode
map:
The function checks if a value is present as a key. Combine with
dplyr::mutate
to add a variable that contains the check
results:
sailor_students %>%
mutate(gender_check =
inspect_valinvec(gender, recode_gender)) %>%
select(gender, gender_check)
#> # A tibble: 12 × 2
#> gender gender_check
#> <chr> <lgl>
#> 1 "Male" TRUE
#> 2 "???" FALSE
#> 3 "Female" TRUE
#> 4 "Female " FALSE
#> 5 "Female" TRUE
#> 6 <NA> FALSE
#> 7 "Female" TRUE
#> 8 " Female" FALSE
#> 9 "Female" TRUE
#> 10 "Female" TRUE
#> 11 "Female" TRUE
#> 12 "Female" TRUE
So far, we’ve checked if SGIC
, school
and
gender
contain plausible values. Last, we want to ensure
that these variables, when used together as identifiers, uniquely
identify a single case and that there are no duplicate entries based on
these variables. trustmebro::find_dupes
checks whether the
combination of identifiers is unique by adding a has_dupes variable to
the dataset. To find duplicates in your data, use it like this:
sailor_students %>% find_dupes(school, sgic, gender) %>%
select(school, sgic, gender, has_dupes)
#> # A tibble: 12 × 4
#> school sgic gender has_dupes
#> <chr> <chr> <chr> <lgl>
#> 1 54321 "MUC__0308" "Male" FALSE
#> 2 22345 "HÄT 2701" "???" FALSE
#> 3 22345 "MUK3801" "Female" TRUE
#> 4 22345 "SAM10" "Female " FALSE
#> 5 65432 "T0601" "Female" TRUE
#> 6 12345 " UIT3006 " <NA> FALSE
#> 7 <NA> "@@@@@@" "Female" FALSE
#> 8 12345 <NA> " Female" FALSE
#> 9 unkown " " "Female" FALSE
#> 10 12345 "MOA2210" "Female" FALSE
#> 11 22345 "MUK3801" "Female" TRUE
#> 12 65432 "T0601" "Female" TRUE
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.