The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Type: Package
Title: A Curated Collection of 'Causal Inference' Datasets and Tools
Version: 0.1.0
Maintainer: Tomás Valderrama <tomasvm2004@gmail.com>
Description: Provides a comprehensive set of datasets and tools for 'causal inference' research. The package includes data from clinical trials, cancer studies, epidemiological surveys, environmental exposures, and health-related observational studies. Designed to facilitate causal analysis, risk assessment, and advanced statistical modeling, it leverages datasets from packages such as 'causalOT', 'survival', 'causalPAF', 'evident', 'melt', and 'sanon'. The package is inspired by the foundational work of Pearl (2009) <doi:10.1017/CBO9780511803161> on causal inference frameworks.
License: GPL-3
URL: https://github.com/Toby-codigos/ForCausality, https://toby-codigos.github.io/ForCausality/
BugReports: https://github.com/Toby-codigos/ForCausality/issues
Encoding: UTF-8
LazyData: true
Suggests: ggplot2, dplyr, testthat (≥ 3.0.0), knitr, rmarkdown
RoxygenNote: 7.3.3
Config/testthat/edition: 3
VignetteBuilder: knitr
NeedsCompilation: no
Packaged: 2025-10-22 02:18:33 UTC; tomis
Author: Tomás Valderrama [aut, cre]
Depends: R (≥ 3.5.0)
Repository: CRAN
Date/Publication: 2025-10-25 12:40:22 UTC

ForCausality: A Curated Collection of Causal Inference Datasets and Tools

Description

Provides a comprehensive set of datasets and tools for causal inference research. The package includes data from clinical trials, cancer studies, epidemiological surveys, environmental exposures, and health-related observational studies.

Details

ForCausality: A Curated Collection of Causal Inference Datasets and Tools

logo

A Curated Collection of Causal Inference Datasets and Tools

Author(s)

Maintainer: Tomás Valderrama tomasvm2004@gmail.com

See Also

Useful links:


Benzene Exposure and Chromosome Damage Data

Description

This dataset, Benzene_df, is a data frame containing indicators of chromosome damage related to benzene exposure, alcohol consumption, and smoking habits. The dataset consists of 78 observations and 5 variables, including age, exposure, and lifestyle factors. Some observations may contain missing values.

Usage

data(Benzene_df)

Format

A data frame with 78 observations and 5 variables:

age

Age of the subject (integer)

exposure

Benzene exposure indicator (integer)

alcohol

Alcohol consumption indicator (integer)

smoking

Smoking indicator (numeric)

totalplus

Chromosome damage measure (numeric)

Details

The dataset name has been kept as 'Benzene_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.

Source

Data taken from the evident package version 1.0.4


Clothianidin Concentration in Maize Plants

Description

This dataset, Cloth_df, is a data frame containing measurements of clothianidin concentration in maize plants under different treatments. The dataset consists of 102 observations and 3 variables, including block identifiers, treatment types, and measured concentrations. Some observations may contain missing values.

Usage

data(Cloth_df)

Format

A data frame with 102 observations and 3 variables:

blk

Block identifier (factor)

trt

Treatment type (factor)

clo

Clothianidin concentration (numeric)

Details

The dataset name has been kept as 'Cloth_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.

Source

Data taken from the melt package version 1.11.4


Chemotherapy Data for Stage B/C Colon Cancer

Description

This dataset, Colon_df, contains data from a clinical trial of chemotherapy for patients with Stage B/C colon cancer. The dataset includes 1,858 observations and 16 variables, providing information on patient demographics, treatment assignment, disease characteristics, and outcomes. Some observations contain missing values.

Usage

data(Colon_df)

Format

A data frame with 1,858 observations and 16 variables:

id

Patient identifier (numeric)

study

Study number (numeric)

rx

Treatment group (factor)

sex

Sex of the patient (numeric)

age

Age of the patient in years (numeric)

obstruct

Obstruction present (numeric indicator)

perfor

Perforation present (numeric indicator)

adhere

Adherence to adjacent structures (numeric indicator)

nodes

Number of lymph nodes with cancer (numeric)

status

Patient status (numeric indicator)

differ

Tumor differentiation (numeric)

extent

Extent of local spread (numeric)

surg

Surgical procedure performed (numeric indicator)

node4

At least 4 nodes positive (numeric indicator)

time

Follow-up time in days (numeric)

etype

Type of event (numeric indicator)

Details

The dataset name has been kept as 'Colon_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.

Source

Data taken from the survival package version 3.8-3


Breast Cancer Prognostic Data (German Breast Cancer Study Group)

Description

This dataset, Gbsg_df, provides prognostic factors for breast cancer patients from the German Breast Cancer Study Group (GBSG). The dataset includes 686 observations and 11 variables, containing information on patient demographics, tumor characteristics, hormone receptor status, and outcomes. Some observations contain missing values.

Usage

data(Gbsg_df)

Format

A data frame with 686 observations and 11 variables:

pid

Patient identifier (integer)

age

Age at diagnosis (integer)

meno

Menopausal status (integer indicator)

size

Tumor size in millimeters (integer)

grade

Tumor grade (integer)

nodes

Number of positive lymph nodes (integer)

pgr

Progesterone receptor level (integer)

er

Estrogen receptor level (integer)

hormon

Hormonal therapy received (integer indicator)

rfstime

Relapse-free survival time in days (integer)

status

Patient status (integer indicator)

Details

The dataset name has been kept as 'Gbsg_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.

Source

Data taken from the survival package version 3.8-3


Lead Exposure Data

Description

This dataset, Lead_df, is a data frame comparing control and exposed groups under different hygiene and exposure levels. The dataset consists of 33 observations and 6 variables, including measures of exposure, hygiene, and calculated differences between groups. Some observations may contain missing values.

Usage

data(Lead_df)

Format

A data frame with 33 observations and 6 variables:

control

Control group count (integer)

exposed

Exposed group count (integer)

level

Exposure level (factor with 3 levels: "high", "low", "medium")

hyg

Hygiene level (factor with 3 levels: "good", "mod", "poor")

both

Combined exposure and hygiene category (factor with 4 levels, e.g. "high.ok", "high.poor", ...)

dif

Difference between control and exposed (integer)

Details

The dataset name has been kept as 'Lead_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.

Source

Data taken from the evident package version 1.0.4


Mouse Cancer Trial Data

Description

This dataset, Mouse_df, provides data from mouse cancer trials used in studies by Royston and Altman. The dataset includes 181 observations and 4 variables, covering information on treatment assignment, survival time, outcome, and mouse identifiers. Some observations contain missing values.

Usage

data(Mouse_df)

Format

A data frame with 181 observations and 4 variables:

trt

Treatment group (factor)

days

Survival time in days (numeric)

outcome

Trial outcome (factor)

id

Mouse identifier (integer)

Details

The dataset name has been kept as 'Mouse_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.

Source

Data taken from the survival package version 3.8-3


Chronic Pain Clinical Trial Data

Description

This dataset, Pain_df, is a data frame containing clinical trial data for chronic pain treatments. The trial compared active treatment versus placebo across different clinical centers and diagnoses. The dataset consists of 193 observations and 4 variables. Some observations may contain missing values.

Usage

data(Pain_df)

Format

A data frame with 193 observations and 4 variables:

treat

Treatment group (factor: active vs placebo)

response

Response outcome (factor)

center

Clinical trial center (factor)

diagnosis

Diagnosis category (factor)

Details

The dataset name has been kept as 'Pain_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.

Source

Data taken from the sanon package version 1.6


Periodontal Disease Data

Description

This dataset, Periodontal_df, is a data frame containing information on smoking habits, demographics, and periodontal health indicators. The dataset consists of 882 observations and 12 variables, including smoking frequency, socioeconomic indicators, and periodontal measures. Some observations may contain missing values.

Usage

data(Periodontal_df)

Format

A data frame with 882 observations and 12 variables:

SEQN

Sequence identifier (numeric)

female

Sex indicator (numeric)

age

Age in years (numeric)

black

Race indicator for Black participants (numeric)

educf

Education level (ordered factor with 5 levels)

income

Income measure (numeric)

cigsperday

Cigarettes smoked per day (numeric)

either

Count of sites with periodontal disease (integer)

neither

Count of sites without periodontal disease (integer)

pcteither

Percentage of sites with periodontal disease (numeric)

z

Standardized measure (numeric)

mset

Additional periodontal health indicator (numeric)

Details

The dataset name has been kept as 'Periodontal_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.

Source

Data taken from the evident package version 1.0.4


External Control Trial Data for Post-partum Hemorrhage

Description

This dataset, Pph_df, provides data from an external control trial of treatments for post-partum hemorrhage. The dataset includes 802 observations and 17 variables, containing information on blood loss, treatment assignment, demographic characteristics, and educational background. Some observations contain missing values.

Usage

data(Pph_df)

Format

A data frame with 802 observations and 17 variables:

cum_blood_20m

Cumulative blood loss at 20 minutes (numeric)

tx

Treatment indicator (numeric)

age

Age of the participant (numeric)

no_educ

Indicator for no formal education (numeric)

...

Additional variables related to treatment and outcomes (numeric)

Details

The dataset name has been kept as 'Pph_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.

Source

Data taken from the causalOT package version 1.0.2


Respiratory Disorder Clinical Trial Data

Description

This dataset, Resp_df, is a data frame containing repeated measurements from a clinical trial on respiratory disorders under two treatment conditions. The dataset records demographic information (center, sex, age), baseline measures, and follow-up measurements across four visits. It consists of 111 observations and 9 variables. Some observations may contain missing values.

Usage

data(Resp_df)

Format

A data frame with 111 observations and 9 variables:

center

Clinical trial center (factor)

treatment

Treatment group (character)

sex

Sex of the participant (character)

age

Age of the participant (integer)

baseline

Baseline measurement (integer)

visit1

Measurement at visit 1 (integer)

visit2

Measurement at visit 2 (integer)

visit3

Measurement at visit 3 (integer)

visit4

Measurement at visit 4 (integer)

Details

The dataset name has been kept as 'Resp_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.

Source

Data taken from the sanon package version 1.6


Breast Cancer Prognostic Data (Rotterdam Study)

Description

This dataset, Rotterdam_df, provides prognostic factors for breast cancer patients used in the studies of Royston and Altman. The dataset includes 2,982 observations and 15 variables, covering patient demographics, tumor characteristics, treatments, and outcomes. Some observations contain missing values.

Usage

data(Rotterdam_df)

Format

A data frame with 2,982 observations and 15 variables:

pid

Patient identifier (integer)

year

Year of surgery (integer)

age

Age at diagnosis (integer)

meno

Menopausal status (integer indicator)

size

Tumor size category (factor)

grade

Tumor grade (integer)

nodes

Number of positive lymph nodes (integer)

pgr

Progesterone receptor level (integer)

er

Estrogen receptor level (integer)

hormon

Hormonal therapy received (integer indicator)

chemo

Chemotherapy received (integer indicator)

rtime

Relapse-free survival time in days (numeric)

recur

Recurrence indicator (integer)

dtime

Time to death in days (numeric)

death

Death indicator (integer)

Details

The dataset name has been kept as 'Rotterdam_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.

Source

Data taken from the survival package version 3.8-3


Seborrheic Dermatitis Clinical Trial Data

Description

This dataset, Sebor_df, is a data frame containing clinical trial data on seborrheic dermatitis, comparing test and placebo treatments. It records participant center, treatment assignment, dermatitis scores across three assessments, and severity indicators at the same points. The dataset consists of 167 observations and 8 variables. Some observations may contain missing values.

Usage

data(Sebor_df)

Format

A data frame with 167 observations and 8 variables:

center

Clinical trial center (factor)

treat

Treatment group: test or placebo (character)

score1

Dermatitis score at assessment 1 (integer)

score2

Dermatitis score at assessment 2 (integer)

score3

Dermatitis score at assessment 3 (integer)

severity1

Severity indicator at assessment 1 (integer)

severity2

Severity indicator at assessment 2 (integer)

severity3

Severity indicator at assessment 3 (integer)

Details

The dataset name has been kept as 'Sebor_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.

Source

Data taken from the sanon package version 1.6


Skin Condition Clinical Trial Data

Description

This dataset, Skin_df, is a data frame containing clinical trial data on skin conditions, comparing responses under placebo and test treatments. It includes participant center, treatment assignment, disease stage, and responses across three assessments. The dataset consists of 172 observations and 6 variables. Some observations may contain missing values.

Usage

data(Skin_df)

Format

A data frame with 172 observations and 6 variables:

center

Clinical trial center (factor)

treat

Treatment group: placebo or test (factor)

stage

Disease stage (integer)

res1

Response at assessment 1 (integer)

res2

Response at assessment 2 (integer)

res3

Response at assessment 3 (integer)

Details

The dataset name has been kept as 'Skin_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.

Source

Data taken from the sanon package version 1.6


Smoking and Homocysteine Data

Description

This dataset, SmokeH_df, is a data frame containing information on smoking, homocysteine levels, demographics, and socioeconomic indicators. The dataset consists of 2,475 observations and 15 variables, including biomarkers, smoking-related measures, age, education, and poverty ratio. Some observations contain missing values.

Usage

data(SmokeH_df)

Format

A data frame with 2,475 observations and 15 variables:

SEQN

Participant identifier (integer)

homocysteine

Homocysteine level (numeric)

z

Z score indicator (integer)

female

Sex indicator (integer, 1 = female, 0 = male)

age

Age in years (integer)

education

Education level (integer code)

povertyr

Poverty ratio (numeric)

bmi

Body mass index (numeric)

cotinine

Cotinine level (numeric)

st

Smoking type indicator (integer)

stf

Smoking type (character string)

age3

Age category (integer code)

ed3

Education category (integer code)

bmi3

BMI category (integer code)

pov2

Poverty category (logical)

Details

The dataset name has been kept as 'SmokeH_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.

Source

Data taken from the evident package version 1.0.4


Ischemic Stroke Case-Control Data

Description

This dataset, Stroke_df, contains fictional case-control data for ischemic stroke, including exposures, risk factors, and confounders. The dataset includes 16,623 observations and 21 variables, covering demographic details, lifestyle factors, biomarkers, and comorbidities. Some observations contain missing values.

Usage

data(Stroke_df)

Format

A data frame with 16,623 observations and 21 variables:

regionnn7

Geographic region (factor)

case

Case indicator for ischemic stroke (numeric)

esex

Sex of the participant (integer)

eage

Age of the participant (integer)

htnadmbp

Hypertension or blood pressure measure (numeric)

nevfcur

Smoking status (factor)

global_stress2

Perceived stress indicator (factor)

whrs2tert

Waist-to-hip ratio tertiles (factor)

phys

Physical activity indicator (factor)

alcohfreqwk

Weekly alcohol consumption frequency (factor)

dmhba1c2

Diabetes / HbA1c category (factor)

cardiacrfcat

Cardiac risk factor category (factor)

ahei3tert

Alternative Healthy Eating Index tertiles (factor)

apob_apoatert

ApoB/ApoA ratio tertiles (factor)

subeduc

Sub-education level (factor)

moteduc

Mother’s education level (factor)

fatduc

Father’s education level (factor)

subhtn

Sub-hypertension indicator (factor)

whr

Waist-to-hip ratio (numeric)

apob_apoa

ApoB/ApoA continuous ratio (numeric)

weights

Sample weights (numeric)

Details

The dataset name has been kept as 'Stroke_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.

Source

Data taken from the causalPAF package version 1.2.5


Thiamethoxam Application and Crop Yield Data

Description

This dataset, Thiam_df, is a data frame containing information on thiamethoxam applications and crop yield measurements in squash plants. The dataset consists of 165 observations and 11 variables, including treatment types, plant variety, replication, fruit counts, yield measures, and defoliation indicators. Some observations may contain missing values.

Usage

data(Thiam_df)

Format

A data frame with 165 observations and 11 variables:

trt

Treatment type (factor)

var

Plant variety (factor)

rep

Replication block (factor)

fruit

Number of fruits (numeric)

avg_mass

Average fruit mass (numeric)

mass

Total fruit mass (numeric)

yield

Crop yield (numeric)

visit

Pollinator visit count (numeric)

foliage

Foliage measure (numeric)

scb

Squash vine borer damage (numeric)

defoliation

Defoliation percentage (numeric)

Details

The dataset name has been kept as 'Thiam_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.

Source

Data taken from the melt package version 1.11.4


Ursodeoxycholic Acid Trial Data

Description

This dataset, Udca_df, contains data from a clinical trial of ursodeoxycholic acid (UDCA). The dataset includes 1,360 observations and 8 variables, covering treatment assignment, disease stage, bilirubin levels, risk scores, follow-up time, and outcomes. Some observations contain missing values.

Usage

data(Udca_df)

Format

A data frame with 1,360 observations and 8 variables:

id

Patient identifier (integer)

trt

Treatment group (integer)

stage

Disease stage (integer)

bili

Bilirubin level (numeric)

riskscore

Calculated risk score (numeric)

futime

Follow-up time in days (numeric)

status

Patient status indicator (numeric)

endpoint

Endpoint description (character)

Details

The dataset name has been kept as 'Udca_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.

Source

Data taken from the survival package version 3.8-3

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.