| Type: | Package |
| Title: | A Curated Collection of 'Causal Inference' Datasets and Tools |
| Version: | 0.1.0 |
| Maintainer: | Tomás Valderrama <tomasvm2004@gmail.com> |
| Description: | Provides a comprehensive set of datasets and tools for 'causal inference' research. The package includes data from clinical trials, cancer studies, epidemiological surveys, environmental exposures, and health-related observational studies. Designed to facilitate causal analysis, risk assessment, and advanced statistical modeling, it leverages datasets from packages such as 'causalOT', 'survival', 'causalPAF', 'evident', 'melt', and 'sanon'. The package is inspired by the foundational work of Pearl (2009) <doi:10.1017/CBO9780511803161> on causal inference frameworks. |
| License: | GPL-3 |
| URL: | https://github.com/Toby-codigos/ForCausality, https://toby-codigos.github.io/ForCausality/ |
| BugReports: | https://github.com/Toby-codigos/ForCausality/issues |
| Encoding: | UTF-8 |
| LazyData: | true |
| Suggests: | ggplot2, dplyr, testthat (≥ 3.0.0), knitr, rmarkdown |
| RoxygenNote: | 7.3.3 |
| Config/testthat/edition: | 3 |
| VignetteBuilder: | knitr |
| NeedsCompilation: | no |
| Packaged: | 2025-10-22 02:18:33 UTC; tomis |
| Author: | Tomás Valderrama [aut, cre] |
| Depends: | R (≥ 3.5.0) |
| Repository: | CRAN |
| Date/Publication: | 2025-10-25 12:40:22 UTC |
ForCausality: A Curated Collection of Causal Inference Datasets and Tools
Description
Provides a comprehensive set of datasets and tools for causal inference research. The package includes data from clinical trials, cancer studies, epidemiological surveys, environmental exposures, and health-related observational studies.
Details
ForCausality: A Curated Collection of Causal Inference Datasets and Tools
A Curated Collection of Causal Inference Datasets and Tools
Author(s)
Maintainer: Tomás Valderrama tomasvm2004@gmail.com
See Also
Useful links:
Benzene Exposure and Chromosome Damage Data
Description
This dataset, Benzene_df, is a data frame containing indicators of chromosome damage related to benzene exposure, alcohol consumption, and smoking habits. The dataset consists of 78 observations and 5 variables, including age, exposure, and lifestyle factors. Some observations may contain missing values.
Usage
data(Benzene_df)
Format
A data frame with 78 observations and 5 variables:
- age
Age of the subject (integer)
- exposure
Benzene exposure indicator (integer)
- alcohol
Alcohol consumption indicator (integer)
- smoking
Smoking indicator (numeric)
- totalplus
Chromosome damage measure (numeric)
Details
The dataset name has been kept as 'Benzene_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the evident package version 1.0.4
Clothianidin Concentration in Maize Plants
Description
This dataset, Cloth_df, is a data frame containing measurements of clothianidin concentration in maize plants under different treatments. The dataset consists of 102 observations and 3 variables, including block identifiers, treatment types, and measured concentrations. Some observations may contain missing values.
Usage
data(Cloth_df)
Format
A data frame with 102 observations and 3 variables:
- blk
Block identifier (factor)
- trt
Treatment type (factor)
- clo
Clothianidin concentration (numeric)
Details
The dataset name has been kept as 'Cloth_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the melt package version 1.11.4
Chemotherapy Data for Stage B/C Colon Cancer
Description
This dataset, Colon_df, contains data from a clinical trial of chemotherapy for patients with Stage B/C colon cancer. The dataset includes 1,858 observations and 16 variables, providing information on patient demographics, treatment assignment, disease characteristics, and outcomes. Some observations contain missing values.
Usage
data(Colon_df)
Format
A data frame with 1,858 observations and 16 variables:
- id
Patient identifier (numeric)
- study
Study number (numeric)
- rx
Treatment group (factor)
- sex
Sex of the patient (numeric)
- age
Age of the patient in years (numeric)
- obstruct
Obstruction present (numeric indicator)
- perfor
Perforation present (numeric indicator)
- adhere
Adherence to adjacent structures (numeric indicator)
- nodes
Number of lymph nodes with cancer (numeric)
- status
Patient status (numeric indicator)
- differ
Tumor differentiation (numeric)
- extent
Extent of local spread (numeric)
- surg
Surgical procedure performed (numeric indicator)
- node4
At least 4 nodes positive (numeric indicator)
- time
Follow-up time in days (numeric)
- etype
Type of event (numeric indicator)
Details
The dataset name has been kept as 'Colon_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the survival package version 3.8-3
Breast Cancer Prognostic Data (German Breast Cancer Study Group)
Description
This dataset, Gbsg_df, provides prognostic factors for breast cancer patients from the German Breast Cancer Study Group (GBSG). The dataset includes 686 observations and 11 variables, containing information on patient demographics, tumor characteristics, hormone receptor status, and outcomes. Some observations contain missing values.
Usage
data(Gbsg_df)
Format
A data frame with 686 observations and 11 variables:
- pid
Patient identifier (integer)
- age
Age at diagnosis (integer)
- meno
Menopausal status (integer indicator)
- size
Tumor size in millimeters (integer)
- grade
Tumor grade (integer)
- nodes
Number of positive lymph nodes (integer)
- pgr
Progesterone receptor level (integer)
- er
Estrogen receptor level (integer)
- hormon
Hormonal therapy received (integer indicator)
- rfstime
Relapse-free survival time in days (integer)
- status
Patient status (integer indicator)
Details
The dataset name has been kept as 'Gbsg_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the survival package version 3.8-3
Lead Exposure Data
Description
This dataset, Lead_df, is a data frame comparing control and exposed groups under different hygiene and exposure levels. The dataset consists of 33 observations and 6 variables, including measures of exposure, hygiene, and calculated differences between groups. Some observations may contain missing values.
Usage
data(Lead_df)
Format
A data frame with 33 observations and 6 variables:
- control
Control group count (integer)
- exposed
Exposed group count (integer)
- level
Exposure level (factor with 3 levels: "high", "low", "medium")
- hyg
Hygiene level (factor with 3 levels: "good", "mod", "poor")
- both
Combined exposure and hygiene category (factor with 4 levels, e.g. "high.ok", "high.poor", ...)
- dif
Difference between control and exposed (integer)
Details
The dataset name has been kept as 'Lead_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the evident package version 1.0.4
Mouse Cancer Trial Data
Description
This dataset, Mouse_df, provides data from mouse cancer trials used in studies by Royston and Altman. The dataset includes 181 observations and 4 variables, covering information on treatment assignment, survival time, outcome, and mouse identifiers. Some observations contain missing values.
Usage
data(Mouse_df)
Format
A data frame with 181 observations and 4 variables:
- trt
Treatment group (factor)
- days
Survival time in days (numeric)
- outcome
Trial outcome (factor)
- id
Mouse identifier (integer)
Details
The dataset name has been kept as 'Mouse_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the survival package version 3.8-3
Chronic Pain Clinical Trial Data
Description
This dataset, Pain_df, is a data frame containing clinical trial data for chronic pain treatments. The trial compared active treatment versus placebo across different clinical centers and diagnoses. The dataset consists of 193 observations and 4 variables. Some observations may contain missing values.
Usage
data(Pain_df)
Format
A data frame with 193 observations and 4 variables:
- treat
Treatment group (factor: active vs placebo)
- response
Response outcome (factor)
- center
Clinical trial center (factor)
- diagnosis
Diagnosis category (factor)
Details
The dataset name has been kept as 'Pain_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the sanon package version 1.6
Periodontal Disease Data
Description
This dataset, Periodontal_df, is a data frame containing information on smoking habits, demographics, and periodontal health indicators. The dataset consists of 882 observations and 12 variables, including smoking frequency, socioeconomic indicators, and periodontal measures. Some observations may contain missing values.
Usage
data(Periodontal_df)
Format
A data frame with 882 observations and 12 variables:
- SEQN
Sequence identifier (numeric)
- female
Sex indicator (numeric)
- age
Age in years (numeric)
- black
Race indicator for Black participants (numeric)
- educf
Education level (ordered factor with 5 levels)
- income
Income measure (numeric)
- cigsperday
Cigarettes smoked per day (numeric)
- either
Count of sites with periodontal disease (integer)
- neither
Count of sites without periodontal disease (integer)
- pcteither
Percentage of sites with periodontal disease (numeric)
- z
Standardized measure (numeric)
- mset
Additional periodontal health indicator (numeric)
Details
The dataset name has been kept as 'Periodontal_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the evident package version 1.0.4
External Control Trial Data for Post-partum Hemorrhage
Description
This dataset, Pph_df, provides data from an external control trial of treatments for post-partum hemorrhage. The dataset includes 802 observations and 17 variables, containing information on blood loss, treatment assignment, demographic characteristics, and educational background. Some observations contain missing values.
Usage
data(Pph_df)
Format
A data frame with 802 observations and 17 variables:
- cum_blood_20m
Cumulative blood loss at 20 minutes (numeric)
- tx
Treatment indicator (numeric)
- age
Age of the participant (numeric)
- no_educ
Indicator for no formal education (numeric)
- ...
Additional variables related to treatment and outcomes (numeric)
Details
The dataset name has been kept as 'Pph_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the causalOT package version 1.0.2
Respiratory Disorder Clinical Trial Data
Description
This dataset, Resp_df, is a data frame containing repeated measurements from a clinical trial on respiratory disorders under two treatment conditions. The dataset records demographic information (center, sex, age), baseline measures, and follow-up measurements across four visits. It consists of 111 observations and 9 variables. Some observations may contain missing values.
Usage
data(Resp_df)
Format
A data frame with 111 observations and 9 variables:
- center
Clinical trial center (factor)
- treatment
Treatment group (character)
- sex
Sex of the participant (character)
- age
Age of the participant (integer)
- baseline
Baseline measurement (integer)
- visit1
Measurement at visit 1 (integer)
- visit2
Measurement at visit 2 (integer)
- visit3
Measurement at visit 3 (integer)
- visit4
Measurement at visit 4 (integer)
Details
The dataset name has been kept as 'Resp_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the sanon package version 1.6
Breast Cancer Prognostic Data (Rotterdam Study)
Description
This dataset, Rotterdam_df, provides prognostic factors for breast cancer patients used in the studies of Royston and Altman. The dataset includes 2,982 observations and 15 variables, covering patient demographics, tumor characteristics, treatments, and outcomes. Some observations contain missing values.
Usage
data(Rotterdam_df)
Format
A data frame with 2,982 observations and 15 variables:
- pid
Patient identifier (integer)
- year
Year of surgery (integer)
- age
Age at diagnosis (integer)
- meno
Menopausal status (integer indicator)
- size
Tumor size category (factor)
- grade
Tumor grade (integer)
- nodes
Number of positive lymph nodes (integer)
- pgr
Progesterone receptor level (integer)
- er
Estrogen receptor level (integer)
- hormon
Hormonal therapy received (integer indicator)
- chemo
Chemotherapy received (integer indicator)
- rtime
Relapse-free survival time in days (numeric)
- recur
Recurrence indicator (integer)
- dtime
Time to death in days (numeric)
- death
Death indicator (integer)
Details
The dataset name has been kept as 'Rotterdam_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the survival package version 3.8-3
Seborrheic Dermatitis Clinical Trial Data
Description
This dataset, Sebor_df, is a data frame containing clinical trial data on seborrheic dermatitis, comparing test and placebo treatments. It records participant center, treatment assignment, dermatitis scores across three assessments, and severity indicators at the same points. The dataset consists of 167 observations and 8 variables. Some observations may contain missing values.
Usage
data(Sebor_df)
Format
A data frame with 167 observations and 8 variables:
- center
Clinical trial center (factor)
- treat
Treatment group: test or placebo (character)
- score1
Dermatitis score at assessment 1 (integer)
- score2
Dermatitis score at assessment 2 (integer)
- score3
Dermatitis score at assessment 3 (integer)
- severity1
Severity indicator at assessment 1 (integer)
- severity2
Severity indicator at assessment 2 (integer)
- severity3
Severity indicator at assessment 3 (integer)
Details
The dataset name has been kept as 'Sebor_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the sanon package version 1.6
Skin Condition Clinical Trial Data
Description
This dataset, Skin_df, is a data frame containing clinical trial data on skin conditions, comparing responses under placebo and test treatments. It includes participant center, treatment assignment, disease stage, and responses across three assessments. The dataset consists of 172 observations and 6 variables. Some observations may contain missing values.
Usage
data(Skin_df)
Format
A data frame with 172 observations and 6 variables:
- center
Clinical trial center (factor)
- treat
Treatment group: placebo or test (factor)
- stage
Disease stage (integer)
- res1
Response at assessment 1 (integer)
- res2
Response at assessment 2 (integer)
- res3
Response at assessment 3 (integer)
Details
The dataset name has been kept as 'Skin_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the sanon package version 1.6
Smoking and Homocysteine Data
Description
This dataset, SmokeH_df, is a data frame containing information on smoking, homocysteine levels, demographics, and socioeconomic indicators. The dataset consists of 2,475 observations and 15 variables, including biomarkers, smoking-related measures, age, education, and poverty ratio. Some observations contain missing values.
Usage
data(SmokeH_df)
Format
A data frame with 2,475 observations and 15 variables:
- SEQN
Participant identifier (integer)
- homocysteine
Homocysteine level (numeric)
- z
Z score indicator (integer)
- female
Sex indicator (integer, 1 = female, 0 = male)
- age
Age in years (integer)
- education
Education level (integer code)
- povertyr
Poverty ratio (numeric)
- bmi
Body mass index (numeric)
- cotinine
Cotinine level (numeric)
- st
Smoking type indicator (integer)
- stf
Smoking type (character string)
- age3
Age category (integer code)
- ed3
Education category (integer code)
- bmi3
BMI category (integer code)
- pov2
Poverty category (logical)
Details
The dataset name has been kept as 'SmokeH_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the evident package version 1.0.4
Ischemic Stroke Case-Control Data
Description
This dataset, Stroke_df, contains fictional case-control data for ischemic stroke, including exposures, risk factors, and confounders. The dataset includes 16,623 observations and 21 variables, covering demographic details, lifestyle factors, biomarkers, and comorbidities. Some observations contain missing values.
Usage
data(Stroke_df)
Format
A data frame with 16,623 observations and 21 variables:
- regionnn7
Geographic region (factor)
- case
Case indicator for ischemic stroke (numeric)
- esex
Sex of the participant (integer)
- eage
Age of the participant (integer)
- htnadmbp
Hypertension or blood pressure measure (numeric)
- nevfcur
Smoking status (factor)
- global_stress2
Perceived stress indicator (factor)
- whrs2tert
Waist-to-hip ratio tertiles (factor)
- phys
Physical activity indicator (factor)
- alcohfreqwk
Weekly alcohol consumption frequency (factor)
- dmhba1c2
Diabetes / HbA1c category (factor)
- cardiacrfcat
Cardiac risk factor category (factor)
- ahei3tert
Alternative Healthy Eating Index tertiles (factor)
- apob_apoatert
ApoB/ApoA ratio tertiles (factor)
- subeduc
Sub-education level (factor)
- moteduc
Mother’s education level (factor)
- fatduc
Father’s education level (factor)
- subhtn
Sub-hypertension indicator (factor)
- whr
Waist-to-hip ratio (numeric)
- apob_apoa
ApoB/ApoA continuous ratio (numeric)
- weights
Sample weights (numeric)
Details
The dataset name has been kept as 'Stroke_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the causalPAF package version 1.2.5
Thiamethoxam Application and Crop Yield Data
Description
This dataset, Thiam_df, is a data frame containing information on thiamethoxam applications and crop yield measurements in squash plants. The dataset consists of 165 observations and 11 variables, including treatment types, plant variety, replication, fruit counts, yield measures, and defoliation indicators. Some observations may contain missing values.
Usage
data(Thiam_df)
Format
A data frame with 165 observations and 11 variables:
- trt
Treatment type (factor)
- var
Plant variety (factor)
- rep
Replication block (factor)
- fruit
Number of fruits (numeric)
- avg_mass
Average fruit mass (numeric)
- mass
Total fruit mass (numeric)
- yield
Crop yield (numeric)
- visit
Pollinator visit count (numeric)
- foliage
Foliage measure (numeric)
- scb
Squash vine borer damage (numeric)
- defoliation
Defoliation percentage (numeric)
Details
The dataset name has been kept as 'Thiam_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the melt package version 1.11.4
Ursodeoxycholic Acid Trial Data
Description
This dataset, Udca_df, contains data from a clinical trial of ursodeoxycholic acid (UDCA). The dataset includes 1,360 observations and 8 variables, covering treatment assignment, disease stage, bilirubin levels, risk scores, follow-up time, and outcomes. Some observations contain missing values.
Usage
data(Udca_df)
Format
A data frame with 1,360 observations and 8 variables:
- id
Patient identifier (integer)
- trt
Treatment group (integer)
- stage
Disease stage (integer)
- bili
Bilirubin level (numeric)
- riskscore
Calculated risk score (numeric)
- futime
Follow-up time in days (numeric)
- status
Patient status indicator (numeric)
- endpoint
Endpoint description (character)
Details
The dataset name has been kept as 'Udca_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the survival package version 3.8-3