Title: | National Institutes of Health Brain Development Cohorts Data Hub Tools |
Description: | A suite of functions to work with data from the National Institutes of Health Brain Development Cohorts Data Hub. The package provides tools to create, clean, process, and filter datasets and associated metadata. These utilities are intended to simplify reproducible data-preparation for future research. |
URL: | https://software.nbdc-datahub.org/NBDCtools/ |
Version: | 1.0.1 |
Depends: | R (≥ 4.3.0) |
Imports: | arrow, chk, cli, dplyr, glue, magrittr, purrr, readr, stringr, sjlabelled, jsonlite, hms, tidyr, rlang, utils, tibble, stats, sjmisc, haven |
License: | GPL (≥ 3) |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
Suggests: | testthat (≥ 3.0.0), usethis, knitr, rmarkdown, naniar |
Config/testthat/edition: | 3 |
Config/Needs/website: | rmarkdown |
LazyData: | true |
VignetteBuilder: | knitr |
NeedsCompilation: | no |
Packaged: | 2025-09-05 23:14:46 UTC; lz |
Author: | Janosch Linkersdoerfer
|
Maintainer: | Janosch Linkersdoerfer <dairc.service@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2025-09-10 08:40:02 UTC |
NBDCtools: National Institutes of Health Brain Development Cohorts Data Hub Tools
Description
A suite of functions to work with data from the National Institutes of Health Brain Development Cohorts Data Hub. The package provides tools to create, clean, process, and filter datasets and associated metadata. These utilities are intended to simplify reproducible data-preparation for future research.
Author(s)
Maintainer: Janosch Linkersdoerfer dairc.service@gmail.com (ORCID)
Authors:
Le Zhang lezhang100@gmail.com (ORCID)
See Also
Useful links:
Benchmark Models
Description
Benchmark Models
Usage
benchmark_models
Format
A list of benchmark models to estimate time and memory usage for loading data.
time_small: A model for estimating time for small datasets (n_var < 1000).
time_large: A model for estimating time for larger datasets (n_var >= 1000).
ram: A model for estimating RAM usage based on the number of variables.
Internal use only: This dataset is used internally by some functions and used in the package vignettes. It is not intended for direct use by the end user.
Convert column names in a data frame
Description
This function renames columns in a data frame to another type of column name specified in the data dictionary.
For example, this can be used to convert the ABCD column names introduced in
the 6.0 release to the previously used column names. If you instead want to
convert the column names in a file, use convert_names_file()
.
Note: Please use this function with caution and make sure that the data in
the converted column is equivalent to the data in the original column. Also,
please make sure that the names can be mapped one-to-one. Some variables in
the ABCD data dictionary have been collapsed from previous releases and thus
might have multiple names in the name_to
column that map to a single name
(see skip_sep_check
argument below).
Usage
convert_names_data(
data,
dd,
name_from = "name",
name_to,
ignore_cols = union(get_id_cols_abcd(), get_id_cols_hbcd()),
skip_sep_check = FALSE
)
Arguments
data |
tibble. The input data frame with columns to be renamed. |
dd |
tibble. The data dictionary table. One can use |
name_from |
character. The column name type in the data dictionary that
the columns in |
name_to |
character. The column name type in the data dictionary
that the columns in |
ignore_cols |
character vector. The columns to ignore (Default: identifier columns used in ABCD and HBCD). |
skip_sep_check |
logical. Whether to skip the check for
For columns with multiple names, it the recommended to use functions like
If |
Value
tibble. The data with renamed column names.
Examples
## Not run:
# rename columns to previous ABCD names used by NDA
convert_names_data(
data,
dd = get_dd("abcd"),
name_from = "name",
name_to = "name_nda"
)
# rename columns to Stata names
convert_names_data(
data,
dd = get_dd("abcd"),
name_from = "name",
name_to = "name_stata"
)
## End(Not run)
Convert column names in a file
Description
This function replaces all matched column names in a file with another type of column name specified in the data dictionary.
For example, this function can be used to convert script files that specified
previously used column names to the the ABCD column names introduced in the
6.0 release. If you instead want to convert the column names in a data frame,
use convert_names_data()
.
Note: Please use this function with caution and make sure that the data in
the converted column is equivalent to the data in the original column. Also,
please make sure that the names can be mapped one-to-one. Some variables in
the ABCD data dictionary have been collapsed from previous releases and thus
might have multiple names in the name_from
column that map to a single name
(see skip_sep_check
argument below).
Usage
convert_names_file(
file_in,
file_out = NULL,
dd,
name_from,
name_to,
skip_sep_check = FALSE
)
Arguments
file_in |
character. The input file path. |
file_out |
character. The output file path. If not provided, defaults to the input file path with a "_converted" suffix. |
dd |
tibble. The data dictionary table. One can use |
name_from |
character. The column name type in the data dictionary that
the columns in |
name_to |
character. The column name type in the data dictionary
that the columns in |
skip_sep_check |
logical. Whether to skip the check for
For columns with multiple names, it the recommended to use functions like
If |
Details
Word matching
The function uses word boundaries to match the names in the file.
It Uses regex word boundaries (\\b
) to
ensure exact word matches. This prevents partial matches within larger
words. For example, matching "age" will not match "cage" or "page".
Speed
The data dictionary is big from get_dd()
, so the function
would loop through all the names in the data dictionary.
If there are only a few names to replace,
it is the best to trim the data dictionary to only those names
before using this function.
Value
character. The path to the output file with converted names, invisible.
Examples
## Not run:
convert_names_file(
file_in = "analysis_script.R",
dd = get_dd("abcd"),
name_from = "name_nda",
name_to = "name"
)
# Specify custom output file
convert_names_file(
file_in = "analysis_script.py",
file_out = "analysis_script_new.py",
dd = get_dd("abcd"),
name_from = "name_nda",
name_to = "name"
)
## End(Not run)
Create BIDS sidecar
Description
Creates a Brain Imaging Data Structure (BIDS) JSON sidecar file from the metadata (data dictionary and levels table). Returns the JSON object or writes it to a file.
Usage
create_bids_sidecar(
data,
study,
release = "latest",
var_coding = "values",
metadata_description = "Dataset exported using NBDCtools",
path_out = NULL,
pretty = TRUE
)
Arguments
data |
tibble. The raw data or data with labels, see
|
study |
character. NBDC study (One of |
release |
character. Release version (Default: |
var_coding |
character. the variable coding, one of "values", "labels".
If the data is processed with |
metadata_description |
string, the description of the metadata |
path_out |
character. the path to the output file.
If |
pretty |
logical. Whether to pretty print the json. |
Value
the json object or the path to the json file
Examples
## Not run:
data |> create_bids_sidecar()
data |> create_bids_sidecar(path_out = "data.json")
## End(Not run)
Create a dataset
Description
This high-level function simplifies the process of creating a dataset from
the ABCD or HBCD Study data by allowing users to create an analysis-ready
dataset in a single step. It executes the lower-level functions provided in
the NBDCtools
package in sequence to load, join, and transform the data.
The function expects study data to be stored as one .parquet
or .tsv
file
per database table within a specified directory, provided as dir_data
.
Variables specified in vars
and tables
will be full-joined together,
while variables specified in vars_add
and tables_add
will be left-joined
to these variables. For more details, see join_tabulated()
.
In addition to the main create_dataset()
function, there are two
study-specific variations:
-
create_dataset_abcd()
: for the ABCD study. -
create_dataset_hbcd()
: for the HBCD study.
They have the same arguments as the create_dataset()
function, except
that the study
argument is set to the respective study by default, and
should not be set by the user.
Usage
create_dataset(
dir_data,
study,
vars = NULL,
tables = NULL,
vars_add = NULL,
tables_add = NULL,
release = "latest",
format = "parquet",
bypass_ram_check = FALSE,
categ_to_factor = TRUE,
add_labels = TRUE,
value_to_label = FALSE,
value_to_na = FALSE,
time_to_hms = FALSE,
bind_shadow = FALSE,
...
)
create_dataset_abcd(...)
create_dataset_hbcd(...)
Arguments
dir_data |
character. Path to the directory with the data files in
|
study |
character. NBDC study (One of |
vars |
character (vector). Name(s) of variable(s) to be joined.
(Default: |
tables |
character (vector). Name(s) of table(s) to be joined (Default:
|
vars_add |
character (vector). Name(s) of additional variable(s) to be
left-joined to the variables selected in |
tables_add |
character (vector). Name(s) of additional table(s) to be
left-joined to the variables selected in |
release |
character. Release version (Default: |
format |
character. Data format (One of |
bypass_ram_check |
logical. If This argument is only used for the ABCD study, as the HBCD data is small enough to be loaded without RAM issues with most personal computers. As HBCD data grows in the future, this may change. |
categ_to_factor |
logical. Whether to convert categorical
variables to factors class, see |
add_labels |
logical. Whether to adds variable and value labels to the
variables, see |
value_to_label |
logical. Whether to convert the categorical
variables' numeric values to labels, see |
value_to_na |
logical. Whether to convert categorical
missingness/non-response codes to |
time_to_hms |
logical. Whether to convert time variables to
|
bind_shadow |
logical. Whether to bind the shadow matrix to the
dataset (Default: |
... |
additional arguments passed to downstream functions after
the |
Details
Order
This high-level function executes the different steps in the following order:
Read the data/shadow matrix using
join_tabulated()
.Convert categorical variables to factors using
transf_factor()
.Add labels to the variables and values using
transf_label()
.Convert categorical variables' numeric values to labels using
transf_value_to_label()
.Convert categorical missingness/non-response codes to
NA
usingtransf_value_to_na()
.Convert time variables to
hms
class usingtransf_time_to_hms()
.If
bind_shadow
and the study is"HBCD"
, replace the missing values in the shadow due to joining multiple datasets usingshadow_replace_binding_missing()
.Bind the shadow matrix to the data using
shadow_bind_data()
.
Not all steps are executed by default. The above order represents the maximal order of execution.
bind_shadow
If bind_shadow
is TRUE
, the shadow matrix will be added to the data using
shadow_bind_data()
.
-
HBCD study: For the
HBCD
study, this function uses the shadow matrix from thedir_data
directory by default (the HBCD Study releases a_shadow.parquet
/_shadow.tsv
file per table that accompanies the data). Alternatively, one can setnaniar_shadow = TRUE
as part of the...
arguments to usenaniar::as_shadow()
to create a shadow matrix from the data. -
ABCD study: The
ABCD
Study does not currently release shadow matrices. Ifbind_shadow
is set toTRUE
, the function will create the shadow matrix from the data usingnaniar::as_shadow()
; no extrananiar_shadow = TRUE
argument is needed.
Value
A tibble with the analysis-ready dataset.
Examples
## Not run:
# most common use case
create_dataset(
dir_data = "6_0/data",
study = "abcd",
vars = c("var1", "var2", "var3")
)
# to handle with tagged missingness
create_dataset(
dir_data = "1_0/data",
study = "hbcd",
vars = c("var1", "var2", "var3"),
value_to_na = TRUE
)
# to bind shadow matrices to the data
create_dataset(
dir_data = "1_0/data/",
study = "hbcd",
vars = c("var1", "var2", "var3"),
bind_shadow = TRUE
)
# to use the additional arguments
# for example in `value_to_na` option, the underlying function
# `transf_value_to_na()` has 2 more arguments,
# which can be passed to the `create_dataset()` function
create_dataset(
dir_data = "6_0/data",
study = "abcd",
vars = c("var1", "var2", "var3"),
value_to_na = TRUE,
missing_codes = c("999", "888", "777", "666", "555", "444", "333", "222"),
ignore_col_pattern = "__dk$|__dk__l$"
)
# use study specific functions
create_dataset_abcd(
dir_data = "6_0/data",
vars = c("var1", "var2", "var3")
)
## End(Not run)
Filter empty columns
Description
This function filters out columns that are empty.
Usage
filter_empty_cols(
data,
id_cols = union(get_id_cols_abcd(), get_id_cols_hbcd())
)
Arguments
data |
tibble. The data to be filtered. |
id_cols |
character (vector). The names of the ID columns to be excluded from the filtering (Default: identifier columns used in ABCD and HBCD). |
Value
A tibble with the filtered data.
Examples
data <- tibble::tibble(
participant_id = c("sub-001", "sub-002", "sub-003"),
session_id = c("ses-001", "ses-001", "ses-002"),
var1 = c(NA, NA, NA),
var2 = c(NA, NA, 2),
var3 = c(NA, NA, 3)
)
filter_empty_cols(data)
Filter empty rows
Description
This function filters out rows that are empty
Usage
filter_empty_rows(
data,
id_cols = union(get_id_cols_abcd(), get_id_cols_hbcd())
)
Arguments
data |
tibble. The data to be filtered. |
id_cols |
character (vector). The names of the ID columns to be excluded from the filtering (Default: identifier columns used in ABCD and HBCD). |
Value
A tibble with the filtered data.
Examples
data <- tibble::tibble(
participant_id = c("sub-001", "sub-002", "sub-003"),
session_id = c("ses-001", "ses-001", "ses-002"),
var1 = c(NA, NA, 1),
var2 = c(NA, NA, 2),
var3 = c(NA, NA, 3)
)
filter_empty_rows(data)
Filter ABCD events
Description
Given a (set of) condition(s), filters the events included in an ABCD dataset. Conditions can be specified as a vector of strings, where each string can be one of the following conditions:
-
"core"
: events for the ABCD core study -
"annual"
: annual events for the ABCD core study -
"mid_year"
: mid-year events for the ABCD core study -
"substudy"
: events for ABCD substudies -
"covid"
: events for the COVID substudy -
"sdev"
: events for the Social Development substudy -
"even"
: even-numbered events -
"odd"
: odd-numbered events numerical expressions like
>2
or<=5
to filter events by numberany other string to be used as filter for the
session_id
column
The conditions can be combined with logical "and"
or "or"
.
Usage
filter_events_abcd(data, conditions, connect = "and")
Arguments
data |
tibble. The data to be filtered. |
conditions |
character (vector). The events to keep. |
connect |
character. Whether to connect the conditions with |
Value
A tibble with the filtered data.
Examples
data <- tibble::tribble(
~session_id, ~study, ~type,
"ses-00S", "core", "screener",
"ses-00M", "core", "mid-year",
"ses-00A", "core", "even",
"ses-01M", "core", "mid-year",
"ses-01A", "core", "odd",
"ses-02M", "core", "mid-year",
"ses-02A", "core", "even",
"ses-03M", "core", "mid-year",
"ses-03A", "core", "odd",
"ses-04M", "core", "mid-year",
"ses-04A", "core", "even",
"ses-05M", "core", "mid-year",
"ses-05A", "core", "odd",
"ses-06M", "core", "mid-year",
"ses-06A", "core", "even",
"ses-C01", "substudy", "covid",
"ses-C02", "substudy", "covid",
"ses-C03", "substudy", "covid",
"ses-C04", "substudy", "covid",
"ses-C05", "substudy", "covid",
"ses-C06", "substudy", "covid",
"ses-C07", "substudy", "covid",
"ses-S01", "substudy", "sdev",
"ses-S02", "substudy", "sdev",
"ses-S03", "substudy", "sdev",
"ses-S04", "substudy", "sdev",
"ses-S05", "substudy", "sdev"
)
# ABCD core study events
filter_events_abcd(data, c("core"))
# COVID substudy events
filter_events_abcd(data, c("covid"))
# imaging events
filter_events_abcd(data, c("annual", "even"))
# mid-years before year 5
filter_events_abcd(data, c("mid_year", "<5"))
# COVID or Social Development substudy events
filter_events_abcd(data, c("covid", "sdev"), connect = "or")
Filter ID/events
Description
Given a vector of ID/events (concatenated like
"{participant_id}_{session_id}"
), or a dataframe
with participant_id
and session_id
columns,
this function filters the data to keep or alternatively
remove the rows for the given ID/events.
Usage
filter_id_events(data, id_events, revert = FALSE)
Arguments
data |
tibble. The data to be filtered. |
id_events |
character (vector) or dataframe. (Vector of) ID/event(s)
or a dataframe with |
revert |
logical. Whether to revert the filter, i.e., to keep only rows
NOT matching the |
Value
A tibble with the filtered data.
Examples
data <- tibble::tribble(
~participant_id, ~session_id,
"sub-001", "ses-001",
"sub-001", "ses-002",
"sub-002", "ses-001",
"sub-002", "ses-002",
"sub-003", "ses-001",
"sub-003", "ses-002"
)
# filter using a vector of ID/events
filter_id_events(
data,
id_events = c("sub-001_ses-001", "sub-003_ses-002")
)
# filter using a dataframe with participant_id and session_id
data_filter <- tibble::tibble(
participant_id = c("sub-001", "sub-003"),
session_id = c("ses-001", "ses-002")
)
filter_id_events(
data,
id_events = data_filter
)
# revert filter
filter_id_events(
data,
id_events = c("sub-001_ses-001", "sub-003_ses-002"),
revert = TRUE
)
Get data dictionary
Description
Retrieves data dictionary for a given study and release version. Allows for
filtering by variables and tables. Wrapper around
get_metadata()
.
In addition to the main get_dd()
function, there are two
study-specific variations:
-
get_dd_abcd()
: for the ABCD study. -
get_dd_hbcd()
: for the HBCD study.
They have the same arguments as the get_dd()
function, except
that the study
argument is set to the respective study by default, and
should not be set by the user.
Usage
get_dd(study, release = "latest", vars = NULL, tables = NULL)
get_dd_abcd(...)
get_dd_hbcd(...)
Arguments
study |
character. The study name. One of "abcd" or "hbcd". |
release |
character. Release version (Default: |
vars |
character (vector). Vector with the names of variables to be included. |
tables |
character (vector). Vector with the names of tables to be included. |
... |
Additional arguments passed to the underlying
|
Value
Data frame with the data dictionary.
Examples
get_dd("abcd")
get_dd("hbcd", release = "1.0")
get_dd("abcd", vars = c("ab_g_dyn__visit_dtt", "ab_g_dyn__visit_age"))
get_dd("abcd", tables = "ab_g_dyn")
get_dd_abcd()
get_dd_hbcd(release = "1.0")
Get identifier columns
Description
Retrieves the identifier columns for a given study and release version.
In addition to the main get_id_cols()
function, there are two
study-specific variations:
-
get_id_cols_abcd()
: for the ABCD study. -
get_id_cols_hbcd()
: for the HBCD study.
They have the same arguments as the get_id_cols()
function, except
that the study
argument is set to the respective study by default, and
should not be set by the user.
Usage
get_id_cols(study, release = "latest")
get_id_cols_abcd(...)
get_id_cols_hbcd(...)
Arguments
study |
character. The study name. One of "abcd" or "hbcd". |
release |
character. Release version (Default: |
... |
Additional arguments passed to the underlying
|
Value
character vector with the identifier columns.
Examples
get_id_cols("abcd")
get_id_cols("hbcd")
get_id_cols_abcd(release = "6.0")
get_id_cols_hbcd(release = "1.0")
Get levels table
Description
Retrieves levels table for a given study and release version. Allows for
filtering by variables and tables. Wrapper around
get_metadata()
.
In addition to the main get_levels()
function, there are two
study-specific variations:
-
get_levels_abcd()
: for the ABCD study. -
get_levels_hbcd()
: for the HBCD study.
They have the same arguments as the get_levels()
function, except
that the study
argument is set to the respective study by default, and
should not be set by the user.
Usage
get_levels(study, release = "latest", vars = NULL, tables = NULL)
get_levels_abcd(...)
get_levels_hbcd(...)
Arguments
study |
character. The study name. One of "abcd" or "hbcd". |
release |
character. Release version (Default: |
vars |
character (vector). Vector with the names of variables to be included. |
tables |
character (vector). Vector with the names of tables to be included. |
... |
Additional arguments passed to the underlying
|
Value
Data frame with the levels table.
Examples
get_levels("abcd")
get_levels("hbcd", release = "1.0")
get_levels("abcd", vars = c("ab_g_dyn__visit_type"))
get_levels("abcd", tables = "ab_g_dyn")
get_levels_abcd(release = "6.0")
get_levels_hbcd()
Get metadata
Description
Retrieves metadata (data dictionary, levels table, event map) for a given study and release version. Allows for filtering by variables and tables.
Usage
get_metadata(
study,
release = "latest",
vars = NULL,
tables = NULL,
type = "dd"
)
Arguments
study |
character. The study name. One of "abcd" or "hbcd". |
release |
character. Release version (Default: |
vars |
character (vector). Vector with the names of variables to be included. |
tables |
character (vector). Vector with the names of tables to be included. |
type |
character. Type of metadata to retrieve. One of |
Value
Data frame with the metadata.
Examples
get_metadata("abcd", type = "levels")
get_metadata("hbcd", release = "1.0")
get_metadata("abcd", vars = c("ab_g_dyn__visit_dtt", "ab_g_dyn__visit_age"))
get_metadata("abcd", tables = "ab_g_dyn")
get_metadata("abcd", tables = "ab_g_dyn")
get_metadata("abcd", type = "sessions")
Get sessions table
Description
Retrieves the sessions table for a given study and release version. Wrapper
around get_metadata()
.
In addition to the main get_sessions()
function, there are two
study-specific variations:
-
get_sessions_abcd()
: for the ABCD study. -
get_sessions_hbcd()
: for the HBCD study.
They have the same arguments as the get_sessions()
function, except
that the study
argument is set to the respective study by default, and
should not be set by the user.
Usage
get_sessions(study, release = "latest")
get_sessions_abcd(...)
get_sessions_hbcd(...)
Arguments
study |
character. The study name. One of "abcd" or "hbcd". |
release |
character. Release version (Default: |
... |
Additional arguments passed to the underlying
|
Value
Data frame with the sessions table.
Examples
get_sessions("abcd")
get_sessions("hbcd")
get_sessions_abcd(release = "6.0")
get_sessions_hbcd(release = "1.0")
Join by identifier set
Description
Internal helper for join_tabulated()
that joins requested variables from
all tables that have the given (set of) identifier column(s).
Usage
join_by_identifier(
dir_data,
dd,
identifiers,
format = "parquet",
shadow = FALSE
)
Arguments
dir_data |
character. Path to the directory with the data files in
|
dd |
tibble. Data frame with the data dictionary. |
identifiers |
character (vector). Identifier column(s). |
format |
character. Data format (One of |
shadow |
logical. Whether to join the shadow matrix
instead of the data table (default: |
Value
A tibble with the joined variables for the given (set of) identifier column(s).
Join tabulated data
Description
Joins selected variables and/or whole tables from the tabulated data/shadow
files into a single data frame. Expects the data files to be stored in one
directory in .parquet
or .tsv
format, with one file per table following
the naming convention of the respective NBDC dataset (from the ABCD or HBCD
studies). Typically, this will be the rawdata/phenotype/
directory within
a BIDS dataset downloaded from the NBDC Data Hub.
Variables specified in vars
and tables
will be full-joined together,
i.e., all rows will be kept, even if they do not have a value for all
columns. Variables specified in vars_add
will be left-joined to the
variables selected in vars
and tables
, i.e., only the values for already
existing rows will be added and no new rows will be created. This is useful
for adding variables to the dataset that are important for a given analysis
but are not the main variables of interest (e.g., design/nesting or
demographic information). By left-joining these variables, one avoids
creating new rows that contain only missing values for the main variables of
interest selected using vars
and tables
. If the same variables are
specified in vars
/tables
and vars_add
/tables_add
, the variables in
vars_add
/tables_add
will be ignored.
In addition to the main join_tabulated()
function, there are two
study-specific variations:
-
join_tabulated_abcd()
: for the ABCD study. -
join_tabulated_hbcd()
: for the HBCD study.
They have the same arguments as the join_tabulated()
function, except
that the study
argument is set to the respective study by default, and
should not be set by the user.
Usage
join_tabulated(
dir_data,
study,
vars = NULL,
tables = NULL,
vars_add = NULL,
tables_add = NULL,
release = "latest",
format = "parquet",
shadow = FALSE,
remove_empty_rows = TRUE,
bypass_ram_check = FALSE
)
join_tabulated_abcd(...)
join_tabulated_hbcd(...)
Arguments
dir_data |
character. Path to the directory with the data files in
|
study |
character. NBDC study (One of |
vars |
character (vector). Name(s) of variable(s) to be joined.
(Default: |
tables |
character (vector). Name(s) of table(s) to be joined (Default:
|
vars_add |
character (vector). Name(s) of additional variable(s) to be
left-joined to the variables selected in |
tables_add |
character (vector). Name(s) of additional table(s) to be
left-joined to the variables selected in |
release |
character. Release version (Default: |
format |
character. Data format (One of |
shadow |
logical. Whether to join the shadow matrix
instead of the data table (default: |
remove_empty_rows |
logical. Whether to filter out rows that have
all values missing in the joined variables, except for the
ID columns (default: |
bypass_ram_check |
logical. If This argument is only used for the ABCD study, as the HBCD data is small enough to be loaded without RAM issues with most personal computers. As HBCD data grows in the future, this may change. |
... |
Additional arguments passed to the underlying function
Note: Turning this parameter to |
Value
A tibble of data or shadow matrix with the joined variables.
Examples
## Not run:
join_tabulated(
dir_data = "path/to/data/",
vars = c("var_1", "var_2", "var_3"),
tables = c("table_1", "table_2"),
study = "abcd",
release = "6.0"
)
## End(Not run)
Read delimiter (tab/comma) separated values file correctly formatted
Description
Reads in a .tsv
or .csv
file with correctly formatted column types.
Uses readr::read_tsv()
/readr::read_csv()
internally and specifies the
column types explicitly using the col_types
argument utilizing information
from the data dictionary. Returns only the identifier columns and the columns
specified in the data dictionary, i.e., all columns in the file that are not
specified in the data dictionary are ignored.
Usage
read_dsv_formatted(file, dd, action = "warn")
Arguments
file |
character. Path to the |
dd |
tibble. Data dictionary specifying the column types. Only columns specified in the data dictionary are read. |
action |
character. What to do if there are columns in the file that are
not specified in the data dictionary (One of |
Details
WHY THIS IS IMPORTANT: readr::read_tsv()
/readr::read_csv()
(like
other commands to load text files in R or other programming languages) by
default infers the column types from the data. This doesn't always work
perfectly. For example, it may interpret a column with only integers as a
double, or a column with only dates as a character. Sometimes a column may
even be read in completely empty because, by default,
readr::read_tsv()
/readr::read_csv()
only considers the first 1000 rows
when inferring the data type and interprets the column as an empty logical
vector if those rows are all empty. The NBDC datasets store categorical
data as integers formatted as character. By default,
readr::read_tsv()
/readr::read_csv()
may interpret them as numeric. By
specifying the column types explicitly based on what is defined in the
data dictionary, we can avoid these issues.
GENERAL RECOMMENDATION: Other file formats like .parquet
correctly
store the column types and don't need to be handled explicitly. They also
offer other advantages like faster reading speed and smaller file sizes. As
such, these formats should generally be preferred over .tsv
/.csv
files.
However, if you have to work with .tsv
/.csv
files, this function can help
you avoid common pitfalls.
Value
A tibble with the data/shadow matrix read from the .tsv
or .csv
file.
Examples
## Not run:
dd <- NBDCtools::get_dd("abcd", "6.0")
read_tsv_formatted("path/to/file.tsv", dd)
## End(Not run)
Read file
Description
Internal helper for join_by_identifier()
that reads
data/shadow matrix for a given file in either
.parquet
or .tsv
format.
Usage
read_file(file, dd, format)
Arguments
file |
character. Path to the file. |
dd |
tibble. Data frame with the data dictionary used to select columns
and determine the column types if reading from a |
format |
character. Data format (One of |
Value
A tibble with the data/shadow matrix from the file.
Bind the shadow matrix to the data
Description
This function binds the shadow matrix to the data.
Usage
shadow_bind_data(
data,
shadow = NULL,
naniar_shadow = FALSE,
id_cols = union(get_id_cols_abcd(), get_id_cols_hbcd()),
suffix = "_shadow"
)
Arguments
data |
tibble. The data. |
shadow |
tibble. The shadow matrix. If |
naniar_shadow |
logical. Whether to use |
id_cols |
character. The columns to join by (the identifier column(s))
in the data and shadow matrices (Default: identifier columns used in ABCD and
HBCD).
In |
suffix |
character. The suffix to add to the shadow columns.
Default is If |
Details
Data requirements
If naniar_shadow = FASLE
and shadow
is provided, the two dataframes
must have the same columns, order of the columns does not matter, but
ID columns must be the same in both dataframes. If there are extra
rows in the shadow matrix, they will be ignored.
ABCD and HBCD data
NBDC releases HBCD data with shadow matrices, which can be used for
the shadow
argument. To work with ABCD data, the option for
now is to use naniar_shadow = TRUE
, which will create a shadow matrix
from the data using naniar::as_shadow()
.
Value
a dataframe of the data matrix with shadow columns. It will be 2x the size of the original data matrix.
Examples
shadow <- tibble::tibble(
participant_id = c("1", "2", "3"),
session_id = c("1", "2", "3"),
var1 = c("Unknown", NA, NA),
var2 = c("Wish not to answer", NA, NA)
)
data <- tibble::tibble(
participant_id = c("1", "2", "3"),
session_id = c("1", "2", "3"),
var1 = c(NA, NA, 1),
var2 = c(NA, 2, NA)
)
shadow_bind_data(data, shadow)
## Not run:
shadow_bind_data(data, naniar_shadow = TRUE)
## End(Not run)
Fix binding resulted missingness in shadow matrices
Description
This function replaces the missing values in the shadow matrices.
This is done by checking if the values in
shadow matrices are both NA. If they are, the value in the shadow
matrix is replaced with Missing due to joining
.
Usage
shadow_replace_binding_missing(
data,
shadow,
id_cols = union(get_id_cols_abcd(), get_id_cols_hbcd()),
replacement = "Missing due to joining"
)
Arguments
data |
tibble. The data. |
shadow |
tibble. The shadow matrix. |
id_cols |
character (vector). The possible unique identifier columns.
The data does not need to have all of these columns, but if they are
present, they will be used to identify unique rows (Default: identifier
columns used in ABCD and HBCD).
For example, the ABCD data usually has only |
replacement |
character. The value to replace the missing values with. |
Details
Data and shadow requirements: The two dataframes must have the same columns and the same number of rows. They must have the same column names, but the order of the columns does not matter. It is recommended to use the same column order and the same row order (by ID columns) in both dataframes, which saves some processing time.
Value
A tibble of the shadow matrix with missing values replaced.
Examples
shadow <- tibble::tibble(
participant_id = c("1", "2", "3"),
session_id = c("1", "2", "3"),
var1 = c("Unknown", NA, NA),
var2 = c("Wish not to answer", NA, NA)
)
data <- tibble::tibble(
participant_id = c("1", "2", "3"),
session_id = c("1", "2", "3"),
var1 = c(NA, NA, 1),
var2 = c(NA, 2, NA)
)
shadow_replace_binding_missing(data, shadow)
Convert categorical columns to factor
Description
Based on the specifications in the data dictionary, transforms all categorical columns to factor.
Usage
transf_factor(data, study, release = "latest")
Arguments
data |
tibble. The data to be transformed. Columns are expected to be in the data dictionary. If not, they will be skipped. |
study |
character. NBDC study (One of |
release |
character. Release version (Default: |
Value
A tibble with the transformed data.
Examples
## Not run:
transf_factor(data, study = "abcd")
## End(Not run)
Add variable/value labels
Description
This function can add variable labels and value labels to the data. The variable labels are descriptive information about the column, and the value labels are the levels of the factor variables.
Usage
transf_label(
data,
study,
release = "latest",
add_var_label = TRUE,
add_value_label = TRUE,
id_cols_labels = c(participant_id = "Participant identifier", session_id =
"Event identifier", run_id = "Run identifier")
)
Arguments
data |
tibble. The data to be transformed. |
study |
character. NBDC study (One of |
release |
character. Release version (Default: |
add_var_label |
logical. Whether to add variable labels (Default:
|
add_value_label |
logical. Whether to add value labels (Default:
|
id_cols_labels |
named character vector. A named vector of labels for the identifier columns, with the names being the column names and the values being the labels. |
Details
Two types of labels
At least one of add_var_label
or add_value_label
must be set to TRUE
.
If both are FALSE
, an error will be raised.
Text columns
The transf_factor()
function has a convert_text
argument,
which will convert text columns to unordered factors. When one uses
a type transformed data to add labels, the text-factor columns
will not have labels at variable level.
Value
A tibble with the labelled data.
See Also
transf_factor()
for transforming categorical columns to factors.
Examples
## Not run:
transf_label(data)
## End(Not run)
Convert time columns to hms
format
Description
This function converts time columns to hms
format.
Usage
transf_time_to_hms(data, study, release = "latest")
Arguments
data |
tibble. The data to be converted. |
study |
character. NBDC study (One of |
release |
character. Release version (Default: |
Details
The input data with time columns are expected to have character format
of "HH:MM:SS"
. If it is not in this format, the function will return NA
for that row.
Value
A tibble with time columns converted to hms
format.
Examples
## Not run:
transf_time_to_hms(data)
## End(Not run)
Convert values to labels for categorical variables
Description
Converts the values of categorical/factor columns (e.g., "1"
, "2"
) to
their labels (e.g., "Male"
, "Female"
). The value labels will be set to
the values.
Usage
transf_value_to_label(data, transf_sess_id = FALSE)
Arguments
data |
tibble. The labelled dataset |
transf_sess_id |
logical. Whether to transform the |
Details
Input requirements
The data must be type transformed and labelled. See
transf_factor()
and transf_label()
for details.
data <- data |> transf_factor() |> transf_label()
Value
A tibble with factor columns transformed to labels.
Examples
## Not run:
transf_value_to_label(data)
transf_value_to_label(data, value_to_na = TRUE)
## End(Not run)
Convert categorical missingness/non-response codes to NA
Description
This function converts the missing codes in the dataset to NA
in all factor columns. Example of missing codes are 999
, 888
, 777
, etc.
Usage
transf_value_to_na(
data,
missing_codes = c("999", "888", "777", "666", "555", "444", "333", "222"),
ignore_col_pattern = "__dk$|__dk__l$",
id_cols = union(get_id_cols_abcd(), get_id_cols_hbcd())
)
Arguments
data |
tibble. The labelled dataset and type converted data. |
missing_codes |
character vector. The missing codes to be converted to NA |
ignore_col_pattern |
character. A regex pattern to ignore columns that should not be converted to NA. |
id_cols |
character vector. The names of the ID columns to be excluded from the conversion (Default: identifier columns used in ABCD and HBCD). |
Details
Use case
This function works the best with ABCD
data where the missing codes
are strictly defined. For HBCD
data, the missing codes are still
under discussion. The function may work, but for some undecided future
missing codes, the function may not work as expected.
In case of HBCD
data or other aribitrary missing codes that one wishes
to convert to NA, it is recommended to use the
sjmisc::set_na_if()
function instead.
Input requirements
The data must be type transformed and labelled. See
transf_factor()
and transf_label()
for details.
data <- data |> transf_factor() |> transf_label()
Value
A tibble of the dataset with missing codes converted to NA
Examples
## Not run:
data <- data |>
transf_factor() |>
transf_label()
transf_value_to_na(data)
## End(Not run)