Clinicaltrials.gov ClinicalTrials.gov is a registry and results database of publicly and privately supported clinical studies of human participants conducted around the world. Users can search for information about and results from those trials. This package provides a set of functions to interact with the search and download features. Results are downloaded to temporary directories and returned as R objects.
The package is available on CRAN and can be installed as usual. To install the latest version from github, use devtools::install_github()
, as follows:
install.packages("devtools")
library(devtools)
install_github("sachsmc/rclinicaltrials")
The main function is clinicaltrials_search()
. Here’s an example of its use:
library(rclinicaltrials)
library(ggplot2)
library(dplyr)
z <- clinicaltrials_search(query = 'lime+disease')
str(z)
## 'data.frame': 20 obs. of 8 variables:
## $ score : chr "0.021204" "0.020414" "0.0073972" "0.0072419" ...
## $ nct_id : chr "NCT01951924" "NCT01333202" "NCT01056133" "NCT01644682" ...
## $ url : chr "https://ClinicalTrials.gov/show/NCT01951924" "https://ClinicalTrials.gov/show/NCT01333202" "https://ClinicalTrials.gov/show/NCT01056133" "https://ClinicalTrials.gov/show/NCT01644682" ...
## $ title : chr "LIME Study (LFB IVIg MMN Efficacy Study)" "Fresh Lime Alone for Smoking Cessation" "Effect of Fish-oil on Non-alcoholic Steatohepatitis (NASH)" "Replacement of Insecticides to Control Visceral Leishmaniasis (VL)" ...
## $ status.text : chr "Completed" "Completed" "Completed" "Completed" ...
## $ condition_summary : chr "Motor Neuron Disease" "Tobacco Use Disorder" "Non-alcoholic Fatty Liver Disease; Non-alcoholic Steatohepatitis" "Cost-effective and Sustainable Vector Control Methods Will be Established to Reduce VL in India, Bangladesh and Nepal" ...
## $ intervention_summary: chr "Drug: Biological : I10E (Human normal Immunoglobulin for intravenous administration 100mg/mL); Drug: Biological: Kiovig® (Human"| __truncated__ "Other: Fresh lime" "Other: Omega-3 capsules-Fish Oil" "Other: IWFPL; Other: IDWL; Other: ITN" ...
## $ last_changed : chr "July 18, 2016" "April 8, 2011" "May 10, 2016" "February 16, 2015" ...
This gives you basic information about the trials. Before searching or downloading, you can determine how many results will be returned using the clinicaltrials_count()
function:
clinicaltrials_count(query = "myeloma")
## [1] 2213
clinicaltrials_count(query = "29485tksrw@")
## [1] 0
The query can be a single string which will be passed to the “search terms” field on clinicaltrials.gov. Terms can be combined using the logical operators AND, OR, and NOT. Advanced searches can be performed by passing a vector of key=value pairs as strings. For example, to search for cancer interventional studies,
clinicaltrials_count(query = c("type=Intr", "cond=cancer"))
## [1] 44787
The possible advance search terms are included in the advanced_search_terms
data frame which comes with the package. The data frame has the keys, description, and a link to the help webpage which will explain the possible values of the search terms. To open the help page for cond
, for instance, run browseURL(advanced_search_terms["cond", "help"])
.
head(advanced_search_terms)
## keys description
## term term Search Terms
## recr recr Recruitment
## rslt rslt Study Results
## type type Study Type
## cond cond Conditions
## intr intr Interventions
## help
## term http://clinicaltrials.gov/ct2/help/search_terms
## recr http://clinicaltrials.gov/ct2/help/recruitment
## rslt http://clinicaltrials.gov/ct2/help/study_results
## type http://clinicaltrials.gov/ct2/help/study_type
## cond http://clinicaltrials.gov/ct2/help/conditions_instr
## intr http://clinicaltrials.gov/ct2/help/interventions_instr
To download detailed study information, including results, use clinicaltrials_download()
:
y <- clinicaltrials_download(query = 'myeloma', count = 10, include_results = TRUE)
str(y)
## List of 2
## $ study_information:List of 6
## ..$ study_info :'data.frame': 10 obs. of 34 variables:
## .. ..$ org_study_id : chr [1:10] "970030" "970143" "970099" "970202" ...
## .. ..$ nct_id : chr [1:10] "NCT00001561" "NCT00001582" "NCT00001623" "NCT00001637" ...
## .. ..$ brief_title : chr [1:10] "Active Immunization of Sibling Bone Marrow Transplant Donors Against Purified Myeloma Protein of the Recipient Undergoing Allog"| __truncated__ "Investigation of the Human Immune Response in Normal Subjects and Patients With Disorders of the Immune System and Cancer" "Bone Marrow Transplant Studies for Safe and Effective Treatment of Leukemia" "Immunosuppressive Preparation Followed by Blood Cell Transplant for the Treatment of Blood Cancers in Older Adults" ...
## .. ..$ official_title : chr [1:10] "Active Immunization of Sibling Bone Marrow Transplant Donors Against Purified Myeloma Protein of the Recipient Undergoing Allog"| __truncated__ "Collection of Blood, Bone Marrow and Tissue Samples for the Investigation of the Human Immune Response, Lymphoma Biology and HT"| __truncated__ "HLA-Matched Peripheral Blood Mobilized Hematopoietic Precursor Cell Transplantation Followed by T Cell Add-Back for Hematologic"| __truncated__ "Low Intensity Preparative Regimen Followed by HLA-Matched, Peripheral Blood Mobilized Hematopoietic Precursor Cell Transplantat"| __truncated__ ...
## .. ..$ overall_status : chr [1:10] "Completed" "Recruiting" "Completed" "Completed" ...
## .. ..$ start_date : chr [1:10] "November 1996" "July 1997" "March 1997" "September 1997" ...
## .. ..$ completion_date : chr [1:10] "September 2005" NA NA NA ...
## .. ..$ lead_sponsor/agency : chr [1:10] "National Cancer Institute (NCI)" "National Cancer Institute (NCI)" "National Heart, Lung, and Blood Institute (NHLBI)" "National Heart, Lung, and Blood Institute (NHLBI)" ...
## .. ..$ phase : chr [1:10] "Phase 3" "N/A" "N/A" "Phase 2" ...
## .. ..$ study_type : chr [1:10] "Interventional" "Observational" "Interventional" "Interventional" ...
## .. ..$ study_design : chr [1:10] "Primary Purpose: Treatment" "Time Perspective: Prospective" "Allocation: Non-Randomized, Endpoint Classification: Safety/Efficacy Study, Intervention Model: Single Group Assignment, Maskin"| __truncated__ "Allocation: Non-Randomized, Endpoint Classification: Safety/Efficacy Study, Intervention Model: Single Group Assignment, Maskin"| __truncated__ ...
## .. ..$ enrollment : chr [1:10] "30" NA NA NA ...
## .. ..$ primary_condition : chr [1:10] "Graft vs Host Disease; Multiple Myeloma" "T-cell Lymphoma; B-Cell Lymphoma; ATL; Myeloma" "Graft vs Host Disease; Hematologic Neoplasm; Leukemia; Multiple Myeloma; Myelodysplastic Syndrome" "Chronic Lymphocytic Leukemia; Graft vs Host Disease; Leukemia; Myelodysplastic Syndrome; Myeloid Leukemia" ...
## .. ..$ eligibility.gender : chr [1:10] "Both" "Both" "Both" "Both" ...
## .. ..$ eligibility.minimum_age : chr [1:10] "18 Years" "1 Year" "10 Years" "55 Years" ...
## .. ..$ eligibility.maximum_age : chr [1:10] "60 Years" "99 Years" "55 Years" "71 Years" ...
## .. ..$ eligibility.healthy_volunteers : chr [1:10] "No" "No" "No" "No" ...
## .. ..$ sponsors.lead_sponsor.agency : chr [1:10] "National Cancer Institute (NCI)" "National Cancer Institute (NCI)" "National Heart, Lung, and Blood Institute (NHLBI)" "National Heart, Lung, and Blood Institute (NHLBI)" ...
## .. ..$ sponsors.lead_sponsor.agency_class: chr [1:10] "NIH" "NIH" "NIH" "NIH" ...
## .. ..$ date_disclaimer : chr [1:10] "ClinicalTrials.gov processed this data on December 30, 2016" "ClinicalTrials.gov processed this data on December 30, 2016" "ClinicalTrials.gov processed this data on December 30, 2016" "ClinicalTrials.gov processed this data on December 30, 2016" ...
## .. ..$ overall_official.last_name : chr [1:10] NA "Thomas A Waldmann, M.D." "A. John Barrett, M.D." "A. John Barrett, M.D." ...
## .. ..$ overall_official.role : chr [1:10] NA "Principal Investigator" "Principal Investigator" "Principal Investigator" ...
## .. ..$ overall_official.affiliation : chr [1:10] NA "National Cancer Institute (NCI)" "National Heart, Lung, and Blood Institute (NHLBI)" "National Heart, Lung, and Blood Institute (NHLBI)" ...
## .. ..$ enrollment.text : chr [1:10] NA "1000" "41" "28" ...
## .. ..$ enrollment..attrs : chr [1:10] NA "Anticipated" "Actual" "Actual" ...
## .. ..$ primary_outcome.measure : chr [1:10] NA "Create Biobank" NA NA ...
## .. ..$ primary_outcome.time_frame : chr [1:10] NA "Ongoing" NA NA ...
## .. ..$ primary_outcome.safety_issue : chr [1:10] NA "No" NA NA ...
## .. ..$ completion_date.text : chr [1:10] NA NA "January 2008" "December 2016" ...
## .. ..$ completion_date..attrs : chr [1:10] NA NA "Actual" "Actual" ...
## .. ..$ completion_date_type : chr [1:10] NA NA "Actual" "Actual" ...
## .. ..$ primary_outcome : chr [1:10] NA NA "To evaluate the feasibility of using G-CSF mobilized donor blood to transplant a predetermined dose of stem cells and T lymphoc"| __truncated__ "The proportion of patients with clinically significant acute GHVD (Grade II or higher) following the T depleted PBPC transplant"| __truncated__ ...
## .. ..$ sponsors.collaborator.agency : chr [1:10] NA NA NA NA ...
## .. ..$ sponsors.collaborator.agency_class: chr [1:10] NA NA NA NA ...
## ..$ locations :'data.frame': 14 obs. of 9 variables:
## .. ..$ name : chr [1:14] "National Cancer Institute (NCI)" "National Institutes of Health Clinical Center, 9000 Rockville Pike" "National Institutes of Health Clinical Center, 9000 Rockville Pike" "National Institutes of Health Clinical Center, 9000 Rockville Pike" ...
## .. ..$ address.city : chr [1:14] "Bethesda" "Bethesda" "Bethesda" "Bethesda" ...
## .. ..$ address.state : chr [1:14] "Maryland" "Maryland" "Maryland" "Maryland" ...
## .. ..$ address.zip : chr [1:14] "20892" "20892" "20892" "20892" ...
## .. ..$ address.country : chr [1:14] "United States" "United States" "United States" "United States" ...
## .. ..$ nct_id : chr [1:14] "NCT00001561" "NCT00001582" "NCT00001623" "NCT00001637" ...
## .. ..$ status : chr [1:14] NA "Recruiting" NA NA ...
## .. ..$ contact.last_name: chr [1:14] NA "For more information at the NIH Clinical Center contact National Cancer Institute Referral Office" NA NA ...
## .. ..$ contact.phone : chr [1:14] NA "(888) NCI-1937" NA NA ...
## ..$ arms :'data.frame': 0 obs. of 0 variables
## ..$ interventions:'data.frame': 15 obs. of 3 variables:
## .. ..$ intervention_type: chr [1:15] "Drug" "Drug" "Procedure" "Procedure" ...
## .. ..$ intervention_name: chr [1:15] "Myeloma Immunoglobulin Idiotype Vaccine-KLH" "GM-CSF" "Allogeneic Bone Marrow Transplant" "Blood cell transplantation" ...
## .. ..$ nct_id : chr [1:15] "NCT00001561" "NCT00001561" "NCT00001623" "NCT00001637" ...
## ..$ outcomes :'data.frame': 4 obs. of 5 variables:
## .. ..$ measure : chr [1:4] "Create Biobank" "To evaluate the feasibility of using G-CSF mobilized donor blood to transplant a predetermined dose of stem cells and T lymphoc"| __truncated__ "The proportion of patients with clinically significant acute GHVD (Grade II or higher) following the T depleted PBPC transplant"| __truncated__ "The proportion of patients with clinically significant acute GHVD (Grade II or higher) following the T depleted PBPC transplant"| __truncated__
## .. ..$ time_frame : chr [1:4] "Ongoing" NA NA "Day 45"
## .. ..$ safety_issue: chr [1:4] "No" NA NA "Yes"
## .. ..$ type : chr [1:4] "primary_outcome" "primary_outcome" "primary_outcome" "primary_outcome"
## .. ..$ nct_id : chr [1:4] "NCT00001582" "NCT00001623" "NCT00001637" "NCT00001873"
## ..$ textblocks : NULL
## $ study_results :List of 3
## ..$ participant_flow: NULL
## ..$ baseline_data : NULL
## ..$ outcome_data : NULL
This returns a list of dataframes that have a common key variable: nct_id
. Optionally, you can get the long text fields and/or study results (if available). Study results are also returned as a list of dataframes, contained within the list.
The data come from a relational database with lots of text fields, so it may take some effort to get the data into a flat format for analysis. For that reason, results come back from the clinicaltrials_download
function as a list of dataframes. Each dataframe has a common key variable: nct_id
. To merge dataframes, use this key. Otherwise, you can analyze the dataframes separately. They are organized into study information, locations, outcomes, interventions, results, and textblocks. Results, where available, is itself a list with three dataframes: participant flow, baseline data, and outcome data.
Results tables are stored in long format, so there are often multiple rows per study, each corresponding to a different group or outcome. Let’s look at an example, the cumulative enrollment of men and women in phase III, melanoma, interventional studies over time. We can also pass the query as a list of named items.
melanom <- clinicaltrials_search(query = c("cond=melanoma", "phase=2",
"type=Intr", "rslt=With"),
count = 1e6)
nrow(melanom)
## [1] 27
table(melanom$status.text)
##
## Active, not recruiting Completed Terminated
## 9 16 2
melanom2 <- clinicaltrials_search(query = list(cond = "melanoma", phase = "2",
type = "Intr", rslt = "With"),
count = 1e6)
nrow(melanom)
## [1] 27
Now to download the data and summarize it:
melanom_information <- clinicaltrials_download(query = c("cond=melanoma", "phase=2",
"type=Intr", "rslt=With"),
count = 1e6, include_results = TRUE)
summary(melanom_information$study_results$baseline_data)
## title units
## Gender :140 years : 52
## Age : 97 participants:461
## Race/Ethnicity, Customized: 97 Participants:225
## Region of Enrollment : 81 Years : 25
## Age, Customized : 41
## Site of lesion : 27
## (Other) :280
## param dispersion subtitle
## Median : 25 Full Range : 35 Length:763
## Number :674 Standard Deviation: 42 Class :character
## Mean : 52 NA's :686 Mode :character
## Count of Participants: 12
##
##
##
## group_id value lower_limit
## Length:763 Length:763 Length:763
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## upper_limit
## Length:763
## Class :character
## Mode :character
##
##
##
##
## description
## Stage IIIB: Ulcerated lesion and 1 lymph node or 2-3 nodes with micrometastasis, or any-depth lesion with no ulceration, and 1 lymph node or 2-3 nodes with macrometastasis; Stage IIIC: Ulcerated lesion and 1 lymph node with macrometastasis; 2-3 nodes with macrometastasis or =4 metastatic lymph nodes, matted lymph nodes, or in-transit met(s)/satellite(s); Stage IV: M1a: Spread to skin, subcutaneous tissue, or lymph nodes; normal lactate dehydrogenase (LDH) level; M1b: Spread to lungs, normal LDH; M1c: Spread to all other visceral organs, normal LDH or any distant disease with elevated LDH.: 18
## The "M" in the TNM (tumor, node, metastasis) system refers to distant metastaseswhether, and how far, the cancer has spread outside the original site. M0: There is no evidence that the cancer has spread beyond the original site. M1: The cancer has spread beyond the original site. M1a: The cancer has spread to other areas of skin, underneath the epidermis to the dermis (subcutaneous), or to lymph node(s). M1b: The cancer has spread to the lung(s) only. M1c: The cancer has spread to other organs and/or locations in the body with or without elevated LDH. : 16
## Breslow's Thickness is a measure of the vertical thickness of a cutaneous melanoma lesion and is reported in millimeters (mm). : 15
## Scale used to assess how a patient's disease is progressing, how the disease affects the daily living abilities of the patient: 0 = Fully active, able to carry on all pre-disease performance without restriction; 1 = Restricted in physically strenuous activity, ambulatory, able to carry out work of a light nature; 2 = Ambulatory and capable of all self-care but unable to carry out any work activities. Up and about > 50% of waking hours; 3 = Capable of only limited self care, confined to a bed or chair > 50% of waking hours; 4 = Completely disabled, confined to bed or chair; 5 = Dead. : 15
## ECOG-Eastern Cooperative Oncology Group (ECOG) Performance Status is used by doctors and researchers to assess how a participant's disease is progressing, assess how the disease affects the daily living activities of the participant and determine appropriate treatment and prognosis. 0 = Fully Active (Most Favorable Activity); 1 = Restricted activity but ambulatory; 2 = Ambulatory but unable to carry out work activities; 3 = Limited Self-Care; 4 = Completely Disabled, No self-care (Least Favorable Activity) : 15
## (Other) :174
## NA's :510
## arm nct_id spread
## Length:763 Length:763 Length:763
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
gend_data <- subset(melanom_information$study_results$baseline_data,
title == "Gender" & arm != "Total")
gender_counts <- gend_data %>% group_by(nct_id, subtitle) %>%
do( data.frame(
count = sum(as.numeric(paste(.$value)), na.rm = TRUE)
))
dates <- melanom_information$study_information$study_info[, c("nct_id", "start_date")]
dates$year <- sapply(strsplit(paste(dates$start_date), " "), function(d) as.numeric(d[2]))
counts <- merge(gender_counts, dates, by = "nct_id")
cts <- counts %>% group_by(year, subtitle) %>%
summarize(count = sum(count))
colnames(cts)[2] <- "Gender"
ggplot(cts, aes(x = year, y = cumsum(count), color = Gender)) +
geom_line() + geom_point() +
labs(title = "Cumulative enrollment into Phase III, \n interventional trials in Melanoma, by gender") +
scale_y_continuous("Cumulative Enrollment")