The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

rcprd: An R package to simplify the extraction and processing of CPRD data, and create analysis-ready datasets

This article does not contain any real patient data. All patient data has been simulated but formatted to match the structure of CPRD Aurum data.

1 Introduction

The Clinical Practice Research Datalink (CPRD) is a large resource of Electronic Health Records from the UK, owned by the UK Medicines and Healthcare products Regulatory Agency (MHRA), and containing information on demography, medical history, test results and drug use of individuals registered with a general practice. The MHRA maintain two databases, CPRD GOLD, (Herrett et al. 2015) which contains data from general practices using the Vision computer system, and CPRD Aurum (Wolf et al. 2019), which contains data from general practices using the Egton Medical Information Systems (EMIS) computer system, EMIS Web. The primary care data is linked to hospital data, death registration data, cancer registry data, deprivation data and mental health services data, enabled by NHS digital.(Padmanabhan et al. 2019) As of 2016, the EMIS Web computer system was used by 4199 (56%) of the general practices in England.(Kontopantelis et al. 2018) As of September 2024, CPRD Aurum contained data on 47 million (16 million currently registered) individuals from 1,784 (1,596 currently contributing) general practices across the United Kingdom,(CPRD 2024a) and CPRD GOLD contained data on 21 million (2.9 million currently registered) individuals.(CPRD 2024b) CPRD is a widely used resource: since 2019, a PubMed search reveals there have been 540 studies published which contain “CPRD” in the title or abstract. Extraction of CPRD data and transformation into a format ready for statistical analysis is computationally demanding and requires a significant amount of work. There is limited published software available to aid researchers in the extraction and processing of CPRD data.Yimer et al. (2021)

rEHR (Springate et al. 2017) is an R package for manipulating and analysing electronic health record data, which works by creating an SQLite database on a fixed storage device (i.e. a disk drive), which is then subsequently queried to extract relevant information, faster than with conventional statistical analysis software. rEHR was designed to be database agnostic, and contains functionality for longitudinal data extraction, cutting data by time-varying covariates, matching controls to cases, converting the units of test data, and creating clinical code lists. rEHR is however no longer maintained, it works with an older version of R (3.3.2) and has been archived on CRAN. The aurumpipeline package (The Health Foundation Analytics Lab 2021) contains functions to clean and process CPRD Aurum data, which works by storing the data as parquet files on the disk drive, which are then subsequently queried to extract relevant data. However, aurumpipeline is not available on CRAN and is not provided with any reproducible examples. The R package drugprepr (Yimer et al. 2021) implements the algorithm of Pye et al. (2018) for preparing drug exposure data extracted from CPRD, however it does not deal with the initial data extraction and storing of data.

Given the many studies using CPRD data, and the limited availability software for data processing, this indicates that a large amount of research time is being spent duplicating the work of others in order to extract CPRD data. This study introduces rcprd, an R package designed to assist researchers in working with CPRD Aurum data and creating datasets which are ‘analysis-ready’. The main problem when working with CPRD Aurum data is the size of the raw data. Data on over 47 million individuals results in thousands of raw .txt files, and Terabytes of data, which can be cumbersome to work with. This is a particular issue for R users, as its infeasible to read all this data into the R workspace simultaneously, as R operates using physical memory (RAM). As suggested by Springate et al. (2017), rcprd bypassess this problem by creating an SQLite database which can be saved onto a fixed storage device. This SQLite database can then be queried for data of interest in order to build an analysis-ready dataset. rcprd then simplifies the process of querying the SQLite database with functions to extract variables such as “most recent test result”, “time until first event”, or “history of a specified condition”.

We start by discussing the structure of CPRD Aurum data and the approach taken by rcprd for processing this data, which draws heavily on the work of Springate et al. (2017). We then run through a worked example to showcase the functionality of rcprd, which has two main groups of functions. The first are to extract and store the data in a consistent manner. The second group is to query this data to extract patient level variables. We focus on CPRD Aurum, as opposed to CPRD GOLD, given there has been a considerable drop in the number of practices utilising Vision software in the last 10 years, limiting the research utility of the CPRD GOLD database. However, the rcprd package can also be used to manage linked secondary care (HES) and ONS death data, and is flexible to the point that it could be used to extract and store data from any electronic health record, which will be touched on in the discussion.

2 Data Structure and Extraction Process

2.1 Structure of CPRD Aurum data

We first define the terminology which will be used throughout this article:

The raw CPRD Aurum data is split into eight different file types: Consultation, DrugIssue, Observation, Patient, Practice, Problem, Referral, Staff. The data specification is available here: (CPRD 2022). For most research questions, the relevant files are Patient, Observation and DrugIssue. The Patient file contains information about registration into the database, date of death or lost to follow up, year of birth and gender. This file will be required to define a cohort. The observation file contains all medical diagnoses and tests, while DrugIssue contains information on prescriptions. Medical observations are identified by their medcodeid, whereas prescriptions are identified through their prodcodeid.

In order to facilitate data transfer, this data is commonly split by CPRD into numerous smaller files. The different patient files are denoted by the string set1, set2, set3 in the file name. Individuals in the same patient file will have the corresponding string (setX) in the files containing their medical or prescription data. However, there will be more than one Observation and DrugIssue file corresponding to each patient file. For example, the observation files for patients in set1, will have set1 in their file name, and then an extra suffix 1, 2, 3, etc. The same is true for the DrugIssue files. The naming structure for these is as follows:

where \(X\in{1,2,3,...}\) and \(Y\in{01,02,03,...}\). Note that the prefix to the file names may vary (i.e. the ‘aurum_allpatid’ part) however we expect the naming convention with regards to ‘set\(X\)’, file type, and ‘0\(Y\)’ to remain consistent. If this changes in the future, we will endeavour to update the rcprd as soon as possible.

3 Worked example for data extraction

3.1 Step 1: Defining a cohort

We have provided simulated patient, observation and drugissue files which will be utilisied in the worked example. The names of the files share the same naming convention given in section 2.1, and column names of the data match the real Aurum data. Numeric variables were simulated at random as integers between 1 and 100, date variables as a date between 01/01/1900 and 01/01/2000, gender as an integer 1 or 2, and year of birth as an integer between 1900 and 2000. Patient id and practice id were assigned manually. These files are contained in the inst/aurum_data directory of rcprd. After installing rcprd, this directory can be accessed using the command system.file("aurum_data", package = "rcprd"). This contains data on 12 fake patients, split across two patient files (set1 and set2) and three observation and drugissue files (all set1):

#devtools::install_github("alexpate30/rcprd")
#install.packages("rcprd") NOT YET ON CRAN
library(rcprd)
#devtools::load_all()
list.files(system.file("aurum_data", package = "rcprd"), pattern = ".txt")
#> [1] "aurum_allpatid_set1_extract_drugissue_001.txt"  
#> [2] "aurum_allpatid_set1_extract_drugissue_002.txt"  
#> [3] "aurum_allpatid_set1_extract_drugissue_003.txt"  
#> [4] "aurum_allpatid_set1_extract_observation_001.txt"
#> [5] "aurum_allpatid_set1_extract_observation_002.txt"
#> [6] "aurum_allpatid_set1_extract_observation_003.txt"
#> [7] "aurum_allpatid_set1_extract_patient_001.txt"    
#> [8] "aurum_allpatid_set2_extract_patient_001.txt"

The first step in most analyses is creating and defining a cohort of individuals, which will involve working with the patient files. Data from the patient files can be combined using the extract_cohort function. This will look in the directory specified through the filepath argument, for any file containing “patient” in the file name. All files will be read in and concatenated into a single dataset. In some circumstances, researchers may be provided with a list of patids which meet their inclusion/exclusion criteria. In this case, these can be specified through the patids argument (which requires a character vector). Suppose the individuals meeting the exclusion criteria are those with patid = 1, 3, 4 and 6. We would then specify:

pat <- extract_cohort(filepath = system.file("aurum_data", package = "rcprd"), patids = as.character(c(1,3,4,6)))
str(pat)
#> 'data.frame':    4 obs. of  12 variables:
#>  $ patid         : chr  "1" "3" "4" "6"
#>  $ pracid        : int  49 98 53 54
#>  $ usualgpstaffid: chr  "6" "43" "72" "11"
#>  $ gender        : int  2 1 2 1
#>  $ yob           : int  1984 1930 1915 1914
#>  $ mob           : int  NA NA NA NA
#>  $ emis_ddate    : Date, format: "1976-11-21" "1972-06-01" ...
#>  $ regstartdate  : Date, format: "1940-07-24" "1913-07-02" ...
#>  $ patienttypeid : int  58 81 10 85
#>  $ regenddate    : Date, format: "1996-08-25" "1997-04-24" ...
#>  $ acceptable    : int  1 1 0 1
#>  $ cprd_ddate    : Date, format: "1935-03-17" "1912-04-27" ...

In other circumstances, a user may need to apply the inclusion and exclusion criteria themselves. In this case, one would initially create a patient file for all individuals.

pat <- extract_cohort(filepath = system.file("aurum_data", package = "rcprd"))
str(pat)
#> 'data.frame':    12 obs. of  12 variables:
#>  $ patid         : chr  "1" "2" "3" "4" ...
#>  $ pracid        : int  49 79 98 53 62 54 49 79 98 53 ...
#>  $ usualgpstaffid: chr  "6" "11" "43" "72" ...
#>  $ gender        : int  2 1 1 2 2 1 2 1 1 2 ...
#>  $ yob           : int  1984 1932 1930 1915 1916 1914 1984 1932 1930 1915 ...
#>  $ mob           : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ emis_ddate    : Date, format: "1976-11-21" "1979-02-14" ...
#>  $ regstartdate  : Date, format: "1940-07-24" "1929-02-23" ...
#>  $ patienttypeid : int  58 21 81 10 45 85 58 21 81 10 ...
#>  $ regenddate    : Date, format: "1996-08-25" "1945-03-19" ...
#>  $ acceptable    : int  1 0 1 0 0 1 1 0 1 0 ...
#>  $ cprd_ddate    : Date, format: "1935-03-17" "1932-02-05" ...

The cohort of individuals would then be defined by applying study specific inclusion/exclusion criteria. For example, all individuals with > 1 day valid follow up aged 65+, after 1st January 2000. Such criteria can be applied solely using the information available in patient files. In this example, we define the individuals that met the inclusion criteria to be those with patid = 1, 3, 4 and 6.

pat <- subset(pat, patid %in% c(1,3,4,6))

Once the cohort has been defined, the next step is to extract medical/prescription data for these individuals.

3.2 Step 2: Reading in data and creating an SQLite database

Data for individuals in the cohort of interest is extracted from the .txt files and put into a SQLite database. This SQLite database is stored on a fixed storage device and can be queried when defining an analysis-ready dataset.

3.2.1 Add individual files to SQLite database using add_to_database

The function add_to_database can be used to add individual files to the SQLite database. Start by defining and connecting to your SQLite database. In this article we create a temporary database, but in practice this would be a permanent storage location. Specifically, file.path(tempdir(), "temp.sqlite") would be replaced by the desired file path and SQLite database name.

aurum_extract <- connect_database(file.path(tempdir(), "temp.sqlite"))

Next, we add medical diagnoses data from the observation files to this database using the add_to_database function. The simulated raw data provided with rcprd can be accessed using the system.file function. The vector of patient id’s that defines the cohort is defined through the subset_patids argument. Only data with patid’s matching this argument will be added to the SQLite database. The filetype argument will select an appropriate function for reading in the .txt files, and also defines the name of the table in the SQLite database that the files are added to. Note that for the first file, overwrite = TRUE is specified to create a new table. For the second and third file, append = TRUE is specified to append to an existing table.

add_to_database(filepath = system.file("aurum_data", "aurum_allpatid_set1_extract_observation_001.txt", package = "rcprd"), 
                filetype = "observation", subset_patids = c(1,3,4,6), db = aurum_extract, overwrite = TRUE)
add_to_database(filepath = system.file("aurum_data", "aurum_allpatid_set1_extract_observation_002.txt", package = "rcprd"), 
                filetype = "observation", subset_patids = c(1,3,4,6), db = aurum_extract, append = TRUE)
add_to_database(filepath = system.file("aurum_data", "aurum_allpatid_set1_extract_observation_003.txt", package = "rcprd"), 
                filetype = "observation", subset_patids = c(1,3,4,6), db = aurum_extract, append = TRUE)

We can then query this database, by selecting all rows from the observation table, and only printing the first 3. More details on how to query an SQLite database from within R is available in the documentation for R package RSQLite (Müller et al. 2024).

RSQLite::dbGetQuery(aurum_extract, 'SELECT * FROM observation', n = 3)
#>   patid consid pracid obsid obsdate enterdate staffid parentobsid
#> 1     1     33      1   100  -15931      -994      79          95
#> 2     1     66      1    46  -13782    -15232      34          17
#> 3     1     41      1    53  -20002      8845      35          79
#>           medcodeid value numunitid obstypeid numrangelow numrangehigh
#> 1   498521000006119    48        16        20          28           86
#> 2         401539014    22         1         2          27            8
#> 3 13483031000006114    17        78        13          87           41
#>   probobsid
#> 1        54
#> 2        35
#> 3        74

Note that when reading the raw data into R, the dates are converted into date formats, with a underlying numeric value where day 0 is 01/01/1970. When saved to the SQLite database, it is the underlying numeric values which is saved, hence the dates now appearing as numeric values. Next, the prescription data from the drugissue files is added to a table called drugissue. A single SQLite database may contain more than one table, so this data is added to a different table within the same SQLite database.

add_to_database(filepath = system.file("aurum_data", "aurum_allpatid_set1_extract_drugissue_001.txt", package = "rcprd"), 
                filetype = "drugissue", subset_patids = c(1,3,4,6), db = aurum_extract, overwrite = TRUE)
add_to_database(filepath = system.file("aurum_data", "aurum_allpatid_set1_extract_drugissue_002.txt", package = "rcprd"), 
                filetype = "drugissue", subset_patids = c(1,3,4,6), db = aurum_extract, append = TRUE)
add_to_database(filepath = system.file("aurum_data", "aurum_allpatid_set1_extract_drugissue_003.txt", package = "rcprd"), 
                filetype = "drugissue", subset_patids = c(1,3,4,6), db = aurum_extract, append = TRUE)

Again this table can be queried, by selecting all rows from the drugissue table, and only printing the first 3.

RSQLite::dbGetQuery(aurum_extract, 'SELECT * FROM drugissue', n = 3)
#>   patid issueid pracid probobsid drugrecid issuedate enterdate staffid
#> 1     1      93      1        88        83    -16118     -1013      98
#> 2     1      93      1        55        59    -13322    -12900      88
#> 3     1      16      1        22        82     -8677     -3543      50
#>         prodcodeid dosageid quantity quantunitid duration estnhscost quanunitid
#> 1 3092241000033113       58       18          33       27         12          6
#> 2   92041000033111       62       93          83       59         11         25
#> 3  971241000033111       87       43          83       88         65         92

Listing the tables in the SQLite database shows there are now two, named observation and drugissue.

RSQLite::dbListTables(aurum_extract)
#> [1] "drugissue"   "observation"

The add_to_database function allows specification of filetype = c("observation", "drugissue", "referral", "problem", "consultation", "hes_primary","death"), each corresponding to a specific function for reading in the corresponding .txt files with correct formatting. The "hes_primary" options correspond to the primary diagnoses file in linked HES APC data. The "death" file corresponds to the death file in the linked ONS data. If wanting to add other files to the SQLite database, a user defined function for reading in the raw .txt file can be specified through extract_txt_func, and a table name can be specified through tablename. This allows the user to add any .txt file to their SQLite database.

Finally, when manually adding files in this manner, it is good practice to close the connection to the SQLite database once finished.

RSQLite::dbDisconnect(aurum_extract)

3.2.2 Add all relevant files to SQLite database using cprd_extract

In practice, there will be a high number of files to add to the SQLite database and adding each one using add_to_database would be cumbersome. We now repeat the extraction but using the cprd_extract function, which is a wrapper for add_to_database, and will add all the files in a specified directory that contain a string matching the specified file type. Start by creating a connection to the database:

aurum_extract <- connect_database(file.path(tempdir(), "temp.sqlite"))

We then use cprd_extract to add all the observation files into the SQLite database. If the connection (aurum_extract) is to an existing database, which is the case here, it will be overwritten when running cprd_extract. The directory containing the files should be specified using filepath. It will only read in and add files with the text string specified in filetype, which takes values in c("observation", "drugissue", "referral", "problem", "consultation"). We then query the first three rows of this database, and note they are the same as previously.

### Extract data
cprd_extract(db = aurum_extract, 
             filepath = system.file("aurum_data", package = "rcprd"), 
             filetype = "observation", subset_patids = c(1,3,4,6), use_set = FALSE)
#>   |                                                                              |                                                                      |   0%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_observation_001.txt 2024-11-11 22:30:40.655626
#>   |                                                                              |=======================                                               |  33%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_observation_002.txt 2024-11-11 22:30:40.680561
#>   |                                                                              |===============================================                       |  67%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_observation_003.txt 2024-11-11 22:30:40.703337
#>   |                                                                              |======================================================================| 100%

### Query first three rows
RSQLite::dbGetQuery(aurum_extract, 'SELECT * FROM observation', n = 3)
#>   patid consid pracid obsid obsdate enterdate staffid parentobsid
#> 1     1     33      1   100  -15931      -994      79          95
#> 2     1     66      1    46  -13782    -15232      34          17
#> 3     1     41      1    53  -20002      8845      35          79
#>           medcodeid value numunitid obstypeid numrangelow numrangehigh
#> 1   498521000006119    48        16        20          28           86
#> 2         401539014    22         1         2          27            8
#> 3 13483031000006114    17        78        13          87           41
#>   probobsid
#> 1        54
#> 2        35
#> 3        74

The process is then repeated for the drugissue files.

### Extract data
cprd_extract(db = aurum_extract, 
             filepath = system.file("aurum_data", package = "rcprd"), 
             filetype = "drugissue", subset_patids = c(1,3,4,6), use_set = FALSE)
#>   |                                                                              |                                                                      |   0%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_drugissue_001.txt 2024-11-11 22:30:40.777615
#>   |                                                                              |=======================                                               |  33%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_drugissue_002.txt 2024-11-11 22:30:40.801986
#>   |                                                                              |===============================================                       |  67%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_drugissue_003.txt 2024-11-11 22:30:40.823978
#>   |                                                                              |======================================================================| 100%

### List tables
RSQLite::dbListTables(aurum_extract)
#> [1] "drugissue"   "observation"

### Query first three rows
RSQLite::dbGetQuery(aurum_extract, 'SELECT * FROM drugissue', n = 3)
#>   patid issueid pracid probobsid drugrecid issuedate enterdate staffid
#> 1     1      93      1        88        83    -16118     -1013      98
#> 2     1      93      1        55        59    -13322    -12900      88
#> 3     1      16      1        22        82     -8677     -3543      50
#>         prodcodeid dosageid quantity quantunitid duration estnhscost quanunitid
#> 1 3092241000033113       58       18          33       27         12          6
#> 2   92041000033111       62       93          83       59         11         25
#> 3  971241000033111       87       43          83       88         65         92

### Disconnect
RSQLite::dbDisconnect(aurum_extract)

The string to match on, function to read in the raw data, and the name of the table in the SQLite database, can be altered using the str_match, extract_txt_func and tablename arguments respectively. Note that this function may run for a considerable period of time when working with the entire CPRD AURUM database, and therefore it is not recommended to run interactively. While creation of the SQLite database may be time consuming, subsequent queries will be far more efficient, so this is short term pain for a long term gain.

3.2.3 Add all relevant files to SQLite database in a computationally efficient manner using the set functionality.

When the number of patients in your cohort is very large (for example millions, or tens of millions), the add_to_database function may perform very slowly. This is because for each observation in the file being added to the SQLite database, add_to_database checks to see whether the patid is contained in the vector subset_patids (a vector of length 20,000,000 in our case). We can utilise the structure of the CPRD AURUM data to speed up this process. If data has the set naming convention (see section 2.1), we know that we only need to search for patids from subset_patids, that are in the corresponding patient file. For example, when reading in file aurum_allpatid_set1_extract_observation_00Y.txt (for any Y), we only need to search whether patid is in the vector of patids from subset.patid, that are also in aurum_allpatid_set1_extract_patient_001.txt, which is much smaller vector. This can reduce the computation time for add_to_database and cprd_extract.

To achieve this, the subset_patids object should be a data frame with two required columns. The first column should be patid, the second should be set, reporting the corresponding value of set which the patient belongs to. The first step is therefore to create a patient file, which has an extra variable set, the number following the text string set in the patient file containing data for that patient. When reading in the patient files to create a cohort, this can be done by specifying set = TRUE. In this example, all individuals in our cohort come from the file with string set1, and therefore this variable is the same for all individuals in this cohort, however this will not be the case in practice.

pat <- extract_cohort(filepath = system.file("aurum_data", package = "rcprd"), patids = as.character(c(1,3,4,6)), set = TRUE)
str(pat)
#> 'data.frame':    4 obs. of  13 variables:
#>  $ patid         : chr  "1" "3" "4" "6"
#>  $ pracid        : int  49 98 53 54
#>  $ usualgpstaffid: chr  "6" "43" "72" "11"
#>  $ gender        : int  2 1 2 1
#>  $ yob           : int  1984 1930 1915 1914
#>  $ mob           : int  NA NA NA NA
#>  $ emis_ddate    : Date, format: "1976-11-21" "1972-06-01" ...
#>  $ regstartdate  : Date, format: "1940-07-24" "1913-07-02" ...
#>  $ patienttypeid : int  58 81 10 85
#>  $ regenddate    : Date, format: "1996-08-25" "1997-04-24" ...
#>  $ acceptable    : int  1 1 0 1
#>  $ cprd_ddate    : Date, format: "1935-03-17" "1912-04-27" ...
#>  $ set           : num  1 1 1 1

The patient file read in is the same as previously, with the addition of the set column. This file can be reduced to just the patid and set columns, and used as the input to subset_patids when running the add_to_database and cprd_extract functions. When extracting data from observation files with set1 in the name, it will only search for patient id’s with set == 1 in the data.frame provided to subset_patids.

### Create connection to SQLite database
aurum_extract <- connect_database(file.path(tempdir(), "temp.sqlite"))

### Add observation files
cprd_extract(db = aurum_extract, 
             filepath = system.file("aurum_data", package = "rcprd"), 
             filetype = "observation", 
             subset_patids = pat, 
             use_set = TRUE)
#>   |                                                                              |                                                                      |   0%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_observation_001.txt 2024-11-11 22:30:40.965948
#>   |                                                                              |=======================                                               |  33%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_observation_002.txt 2024-11-11 22:30:40.990244
#>   |                                                                              |===============================================                       |  67%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_observation_003.txt 2024-11-11 22:30:41.012684
#>   |                                                                              |======================================================================| 100%

### Add drugissue files
cprd_extract(db = aurum_extract, 
             filepath = system.file("aurum_data", package = "rcprd"), 
             filetype = "drugissue", 
             subset_patids = pat, 
             use_set = TRUE)
#>   |                                                                              |                                                                      |   0%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_drugissue_001.txt 2024-11-11 22:30:41.038475
#>   |                                                                              |=======================                                               |  33%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_drugissue_002.txt 2024-11-11 22:30:41.065845
#>   |                                                                              |===============================================                       |  67%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_drugissue_003.txt 2024-11-11 22:30:41.087364
#>   |                                                                              |======================================================================| 100%

### Query first three rows of each table
RSQLite::dbGetQuery(aurum_extract, 'SELECT * FROM observation', n = 3)
#>   patid consid pracid obsid obsdate enterdate staffid parentobsid
#> 1     1     33      1   100  -15931      -994      79          95
#> 2     1     66      1    46  -13782    -15232      34          17
#> 3     1     41      1    53  -20002      8845      35          79
#>           medcodeid value numunitid obstypeid numrangelow numrangehigh
#> 1   498521000006119    48        16        20          28           86
#> 2         401539014    22         1         2          27            8
#> 3 13483031000006114    17        78        13          87           41
#>   probobsid
#> 1        54
#> 2        35
#> 3        74
RSQLite::dbGetQuery(aurum_extract, 'SELECT * FROM drugissue', n = 3)
#>   patid issueid pracid probobsid drugrecid issuedate enterdate staffid
#> 1     1      93      1        88        83    -16118     -1013      98
#> 2     1      93      1        55        59    -13322    -12900      88
#> 3     1      16      1        22        82     -8677     -3543      50
#>         prodcodeid dosageid quantity quantunitid duration estnhscost quanunitid
#> 1 3092241000033113       58       18          33       27         12          6
#> 2   92041000033111       62       93          83       59         11         25
#> 3  971241000033111       87       43          83       88         65         92

Note that there is no difference compared to the previously extracted SQLite databases. The computational gains from applying the subsetting in this manner will not be realised in this example. We do not close the connection, as we will now move onto querying the database to extract variables for creating an analysis-ready dataset.

3.3 Step 3: Querying the SQLite database to extract variables

Once the data has been extracted and stored in an SQLite database, it can now be queried to create variables of interest. The normal process for extracting variables from electronic health records is to create code lists, a group of codes which denote the same condition. The database would then be queried for observations with medical codes matching those in the code list. A variable would then be defined based on this query. Whether this is a binary variable, indicating whether an individual has any record of a given code, or the most recent test result with the given code, or something much more complex. In CPRD Aurum, medical diagnoses and tests are identified from the observation file using medcodeids, and prescription data is identified from the drugissue file using prodcodeids. Creation of code lists is an important step of data extraction, and we refer elsewhere for details on best practice for developing code lists, and the limitations of working with code lists (Williams et al. 2019, 2017; Watson et al. 2017; Gulliford et al. 2009; Matthewman et al. 2024). The functions in this section are split into three groups:

These functions extract and query the data relative to an index date. The index date may be a fixed date (e.g. 1st January 2010), a date which is different for each individual (e.g. date age 50 reached), or a combination of the two (e.g., maximum of 1st January 2010 and date aged 50 reached). Note, if the inclusion/exclusion criteria for the cohort are dependent on medical diagnoses or prescriptions, the functions in this section will be necessary in order to apply these criteria, and further reduce the cohort (step 2.2).

3.3.1 Functions for extracting common variable types

There are functions to extract three common variable types, history of condition/medication prior to index date (extract_ho), time from the index date until first occurrence of a medical code/prescription or censoring (extract_time_until), and most recent test result(s) in a given time frame and valid range relative to the index date (extract_test_data).

The first, extract_ho, extracts a binary variable based on whether individual has a specified code recorded prior to index date. This can be applied to search for history of medical diagnoses or prescriptions. The index date ust be a variable in the cohort dataset, and is specified through the indexdt argument.

### Define codelist
codelist <- "187341000000114"

### Add an index date to cohort
pat$fup_start <- as.Date("01/01/2020", format = "%d/%m/%Y")

### Extract a history of type variable using extract_ho
ho <- extract_ho(cohort = pat, 
                 codelist_vector = codelist, 
                 indexdt = "fup_start", 
                 db_open = aurum_extract, 
                 tab = "observation",
                 return_output = TRUE)
str(ho)
#> 'data.frame':    4 obs. of  2 variables:
#>  $ patid: chr  "1" "3" "4" "6"
#>  $ ho   : int  1 0 0 1

The second is extract_time_until, which defines a time-to-event/survival variable. This has two components, the time until the first record of a specified code or censoring, and an indicator for whether event was observed or censored. To derive a variable of this type the cohort must also contain a time until censoring variable, which can be specified through censdt.

### Add an censoring date to cohort
pat$fup_end <- as.Date("01/01/2024", format = "%d/%m/%Y")

### Extract a time until variable using extract_time_until
time_until <- extract_time_until(cohort = pat, 
                                 codelist_vector = codelist, 
                                 indexdt = "fup_start", 
                                 censdt = "fup_end",
                                 db_open = aurum_extract, 
                                 tab = "observation",
                                 return_output = TRUE)
str(time_until)
#> 'data.frame':    4 obs. of  3 variables:
#>  $ patid        : chr  "1" "3" "4" "6"
#>  $ var_time     : num  1461 1461 1461 1461
#>  $ var_indicator: num  0 0 0 0

The third is extract_test, which will extract the most recent test result in a given time frame. The number of days before and after the index date to search for results are specified through time_post and time_prev respectively. Test results are identified from the observation file, using code lists. Lower and upper bounds can also be specified for the extracted data through lower_bound and upper_bound.

### Extract test data using extract_test_data
test_data <- extract_test_data(cohort = pat, 
                          codelist_vector = codelist, 
                          indexdt = "fup_start", 
                          db_open = aurum_extract,
                          time_post = 0,
                          time_prev = Inf,
                          return_output = TRUE)
str(test_data)
#> 'data.frame':    4 obs. of  2 variables:
#>  $ patid: chr  "1" "3" "4" "6"
#>  $ value: num  84 NA NA 28

More than one observation can be returned by specifying numobs. Metadata of the test result, such as the unit of measurement, date recorded, and the medical code, can be returned by settings numunitid = TRUE. A variation of this function, extract_test_data_var, will returns the standard deviation of the test data within the specified time and value range. Once all the variables of interest have been extracted, they can be merged into an analysis-ready dataset (step 4).

### Recursive merge
analysis.ready.pat <- Reduce(function(df1, df2) merge(df1, df2, by = "patid", all.x = TRUE), list(pat[,c("patid", "gender", "yob")], ho, time_until, test_data)) 
analysis.ready.pat
#>   patid gender  yob ho var_time var_indicator value
#> 1     1      2 1984  1     1461             0    84
#> 2     3      1 1930  0     1461             0    NA
#> 3     4      2 1915  0     1461             0    NA
#> 4     6      1 1914  1     1461             0    28

3.3.2 Functions for extracting specific variables

There are also a number of functions that can be used to extract specific variables:

  • extract_bmi: Derives BMI scores. Requires specification of codelist for BMI, height, and weight separately.
  • extract_cholhdl_ratio: Derives total cholesterol/high-density lipoprotein ratio. Requires specification of separate codelists for total cholesterol/high-density lipoprotein ratio, total cholesterol, and high-density lipoproteins separately.
  • extract_diabetes: Derives a categorical variable for history of type 1 diabetes, history of type 2 diabetes or no history of diabetes. Requires specification of separate codelists for type 1 and type 2 diabetes. Individuals with codes for both are designated as type 1.
  • extract_smoking: Derives a categorical variable for smoking status. Requires specification of seperate codelists for non-smoker, ex-smoker, light smoker, moderate smoker and heavy smoker. If the most recent smoking status is non-smoker, but there are historical codes which indicate smoking, then individual will be classified as an ex-smoker.

It was deemed that these variables required custom functions because their definitions did not fit into any of the variable types from section 3.3.1. In each case, a number of steps are taken in order to clean or manipulate the data in order to get the desired output. For example, height measurements recorded in centimeters are converted to metres in order to calculate BMI scores. This is done through the use of the numunitid variable in the observation file. For both BMI and cholesterol/high-density lipoprotein ratio, the variable can be either be identified directly, or calculated from the component mesaures. In each case, the component parts must be recorded in the specified time range relative to the index date. For smoking status, if an individuals most recent medical observation was recorded as a non-smoker, but their medical record shows previous smoking, the most recent record is changed to ex-smoker. The steps for cleaning the data and extracting these variables are provided in the vignette titled Details-on-algorithms-for-extracting-specific-variables. However, it is important to state, that the correct way to define a variable may change from study to study. Therefore when using these functions to extract variables, we encourage taking the time to ensure that the way the variable is extracted matches the definition in ones study.

3.3.3 Functions for database queries and custom variable extraction

These functions are utilised internally in the functions from sections 3.3.1 and 3.3.2. They have been provided to more easily enable package users to write their own functions for extracting variables that are not covered in the previous two sections.

The db_query function will query the SQLite database for observations where the medcodeid or prodcodeid is in a specified codelist. For example, we can query the observation table for all codes with medcodeid of 187341000000114.

db_query <- db_query(db_open = aurum_extract,
                     tab ="observation",
                     codelist_vector = "187341000000114")

db_query
#>     patid consid pracid  obsid obsdate enterdate staffid parentobsid
#>    <char> <char>  <int> <char>   <num>     <num>  <char>      <char>
#> 1:      1     42      1     81   -5373      4302      85          35
#> 2:      6     40      1     41  -14727     -6929      98          80
#>          medcodeid value numunitid obstypeid numrangelow numrangehigh probobsid
#>             <char> <num>     <int>     <int>       <num>        <num>    <char>
#> 1: 187341000000114    84        79        67          24           22         5
#> 2: 187341000000114    28        20         5          41           97        92

The combine_query_boolean function will assess whether each individual in a specified cohort (pat) has an observation in the queried data (obtained using db_query) within a specified time frame from the index date, returning a 0/1 vector. The cohort must contain a variable called indexdt containing the index date. This function is useful when defining ‘history of’ type variables, where we want to know if there is any record of a given condition prior to the index date.

### Add an index date to pat
pat$indexdt <- as.Date("01/01/2020", format = "%d/%m/%Y")

### Combine query with cohort creating a boolean variable denoting 'history of'
combine.query.boolean <- combine_query_boolean(cohort = pat,
                                               db_query = db_query,
                                               query_type = "med")
  
combine.query.boolean
#> [1] 1 0 0 1

The combine_query function will merge a cohort with the queried data and return a specified number of observations (numobs) within a specified time frame from the index date. This is useful when extracting test data and requiring access to the values of the tests, or when specifying variables that require > 1 observation within a certain time frame (i.e. two prescriptions within a month prior to index date). For queries from the observation table, the query type can be specified as "med" or "test". Inputting query_type = "med" will just return the date of the observations and the medcodeid.

### Combine query with cohort retaining most recent three records
combine.query <- combine_query(cohort = pat,
                               db_query = db_query,
                               query_type = "med",
                               numobs = 3)
  
combine.query
#>     patid       medcodeid obsdate
#>    <char>          <char>   <num>
#> 1:      1 187341000000114   -5373
#> 2:      6 187341000000114  -14727

For query_type = "test", the value and other relevant information will also be returned, and those with NA values removed (although this can be altered through argument value_na_rm). We then close the connection to the database.

### Extract a history of type variable using extract_ho
combine.query <- combine_query(cohort = pat,
                               db_query = db_query,
                               query_type = "test",
                               numobs = 3)
  
combine.query
#>     patid       medcodeid obsdate value numunitid numrangelow numrangehigh
#>    <char>          <char>   <num> <num>     <int>       <num>        <num>
#> 1:      1 187341000000114   -5373    84        79          24           22
#> 2:      6 187341000000114  -14727    28        20          41           97

### Disconnect
RSQLite::dbDisconnect(aurum_extract)

If the query was from the drugissue table, then query_type = "drug" should be specified, and the date of the observations and the prodcodeid will be returned. The functions in this section do little processing of the extracted data, and further manipulation is required in order to define most variables.

3.3.4 Saving extracted variables directly to a disk drive, and utilising rAURUMs suggested directory system

So far all extracted variables (using functions from section 3.3.1 and 3.3.2) have been read into the R workspace by specifying return_output = TRUE. When working with large cohorts it may be preferable to save the output directly onto a disk drive, by specifying out_save_disk = TRUE. The file path to save the output can be specified manually through the out_filepath argument. However, if this argument is left as NULL, rcprd will attempt to save the extracted variable into a directory “data/extraction/” relative to the working directory. The name of the file itself will be dependent on the variable name specified through argument varname. This can be a very convenient way to save the output directly to disk without having to repeatedly specify file paths and file names.

There is similar functionality when specifying the codelists. Codelists can be specified in two ways. The first is to read the codelist into R as a character vector and then specify through the argument codelist_vector, which has been done in all the previous examples. Alternatively, codelists stored on the disk drive can be referred to from the codelist argument in many rcprd functions, but requires a specific underlying directory structure. The codelist on the disk drive must be stored in a directory called “codelists/analysis/” relative to the working directory. The codelist must be a .csv file, and contain a column medcodeid, prodcodeid or ICD10 depending on the table being queried. The input to argument codelist should just be a character string of the name of the files (excluding the suffix ‘.csv’). The codelist_vector argument will take precedence over the codelist argument if both are specified.

Finally, there is similar functionality for accessing the SQLite database internally, rather than having to 1) open a connection, 2) use this as an input in the functions, and then 3) remember to close the connection. Instead, if the SQLite database is stored in a directory “data/sql/” relative to the working directory, the SQLite database can be referred to by name (a character string) with the argument db. A connection to the SQLite datbase will be opened internally within the function call, the SQLite database will be queried, and then the connection closed. Alternatively, a SQLite database stored anywhere on the disk drive can be accessed by specifying the full filepath (character string) with the argument db_filepath.

This workflow is advantageous as it avoids hard file paths which beneficial if wanting to move your code onto another computer system. Furthermore, once codelists and the SQLite database have been created and stored in the appropriate folders, they can simply be referred to by name, resulting in an easier workflow. The function create_directory_system() will create the directory system required to use rcprd in this way. To avoid repetition of the previous section, this is showcased just once using the extract_ho function. For the sake of this example, we start by setting the working directory to a directory called inst/example within rcprd. To maintain the new working directory across multiple R markdown code chunks, we use knitr::opts_knit$set. To follow this section, the user should simply set their working directory as usual using setwd().

## Set working directory
knitr::opts_knit$set(root.dir = tempdir())

Next, the create_directory_system() function can be used to generate the required directory structure.

suppressMessages(
  create_directory_system()
)

file.exists(file.path(tempdir(), "data"))
#> [1] TRUE
file.exists(file.path(tempdir(), "codelists"))
#> [1] TRUE
file.exists(file.path(tempdir(), "code"))
#> [1] TRUE

An SQLite database called “mydb.sqlite” is then created in the “data/sql” directory, using the same data from the previous examples:

## Open connection
aurum_extract <- connect_database("data/sql/mydb.sqlite")

## Add data to SQLite database using cprd_extract
cprd_extract(db = aurum_extract,
             filepath = system.file("aurum_data", package = "rcprd"),
             filetype = "observation", use_set = FALSE)
#>   |                                                                              |                                                                      |   0%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_observation_001.txt 2024-11-11 22:30:42.75966
#>   |                                                                              |=======================                                               |  33%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_observation_002.txt 2024-11-11 22:30:42.784241
#>   |                                                                              |===============================================                       |  67%
#> Adding C:/Users/mbrxsap3/AppData/Local/Temp/Rtmpu4y1rO/Rinst43ac58e6b4a/rcprd/aurum_data/aurum_allpatid_set1_extract_observation_003.txt 2024-11-11 22:30:42.809921
#>   |                                                                              |======================================================================| 100%

## Disconnect
RSQLite::dbDisconnect(aurum_extract)

Finally, a code list called mylist.csv is created and saved into the codelists/analysis/ directory.

### Define codelist
codelist <- data.frame(medcodeid = "187341000000114")

### Save codelist
write.csv(codelist, "codelists/analysis/mylist.csv")

The mydb.sqlite database can now be queried to create a ‘history of’ type variable using the codelist mylist.csv, with the output saved directly onto the disk drive.

extract_ho(cohort = pat,
           codelist = "mylist",
           indexdt = "fup_start",
           db = "mydb",
           tab = "observation",
           return_output = FALSE,
           out_save_disk = TRUE)

Note that in order to run extract_ho here, a connection to the SQLite database did not need to be created, the codelist did not need to be in the R workspace, and there is no output from this function. Instead the extracted variable has been saved onto the disk drive in an .rds file, and can be read in using:

readRDS("data/extraction/var_ho.rds")
#>   patid ho
#> 1     1  1
#> 3     3  0
#> 4     4  0
#> 6     6  1

This setup can be used in conjunction with any of the functions from step 3 (i.e. extract_test_var, extract_time_until or db_query).

3.3.5 Extracting longitudinal data/time varying covariates

All of the functions in section 3.3.1 and 3.3.2 have the option to extract data at a given time point post index date (specified through the t argument). This allows users to extract data at fixed intervals, which can be utilised for longitudinal analyses where time-varying covariates are required. If saving the extracted variables directly to the disk drive (out_save_disk = TRUE), the time at which data was extracted from, t, will be added to the file name by default.

4 Discussion

rcprd is an R package which allows users to process CPRD Aurum data in R in a consistent and computationally efficiency manner. It provides functionality to both read in and store data, and create analysis-ready datasets. The process avoids reading thousands of raw text files into the R workspace whenever a variable needs to be derived, minimising the risk of coding errors. rcprd enables the handling and storing the raw data, achieved through the creation of an SQLite database using RSQLite. The user can define their own functions for reading in the raw data, allowing these functions to be applied to other electronic health records, or future versions of CPRD Aurum which have different data structures. The functions for extraction of variables to create analysis-ready datasets involve are split into three groups: 1) Functions for extracting common variable types (history of a specified condition, time until event occurs, or most recent test result); 2) Functions for extracting specific variables; 3) Functions for database queries and custom variable extraction. These querying large data files that could not otherwise be handled in the R workspace. These functions uses computationally efficient SQL queries to query large datasets that could not be read into the R workspace, but no-user knowledge of SQL is required.

By utilising RSQLite for the storing and querying of the raw data, rcprd follows the suggested approach of rEHR (Springate et al. 2017). In many ways, rEHR is more comprehensive than rcprd, as it could also be used for case-control matching, cutting up a survival cohort by time-varying covariates, and constructing clinical code lists. Both packages provide functionality to query the underlying database for observations with specific medical or prescription codes without needing SQL experience, however differ in their method for doing so. rEHR functions return observations between specified dates, whether that is all clinical codes, or the first/last clinical code in that period. These functions can also be applied across multiple time periods (i.e. by year) simultaneously. In contrast, rcprd functions query the database and return observations in a time period relative to an index date, which may (or may not) be a different date for each patient. As well as functions to query the database, rcprd also provides functions which will extract specific variable types, again relative to a given index date. For example, a binary variable based on existence of a clinical code prior to the index date, a test result between a specified upper and lower bound, or a time-to-event/survival type variable. These functions can also be applied any number of days before/after the specified index date to allow extraction of data for longitudinal analyses. The approach of rcprd, extracting variables relative to an index date is common when building datasets to be used for development or validation of a clinical prediction model,(Riley et al. 2019) whereas the functions contained in rEHR are relevant for a wider range of epidemiological analyses, including case-control studies and reporting descriptive properties such as incidence/prevalence.

aurumpipeline takes a different approach to rEHR and rcprd by using parquet files to store the data as opposed to SQLite. Parquet files are efficient for data storage and are optimised for query performance, meaning this setup has a high ceiling in term of computational efficiency. aurumpipeline provides functions to query the raw data between two fixed dates, with the option to define a binary variable depending if specified medical codes are recorded in this time period. Beyond this, the arrow (Richardson et al. 2024) package is recommended for any further data base queries, meaning the derivation of other variables types will require user-developed functions.

The strength of rcprd is to simplify the complex process of turning raw CPRD data into an analysis-ready dataset, and does this by following the process of (Springate et al. 2017). Functions for extracting variables have been designed to be user friendly, to the extent that all that needs to be specified is the index date and code list, and a number of different common variable types can be derived. More basic functions are also provided, which simply return queries of the underlying data, in order to allow the user flexibility in defining their own functions for extracting other variables or summary statistics. The main limitation of this package is one inherent with all R packages, that they must be continuously maintained as R is updated. This is the main reason rcprd has been developed, in light of the archiving of rEHR. Package rights will be set-up so that rcprd can be maintained and taken over by other individuals. Another limitation is that the scope of this package is not comprehensive, for example in comparison to rEHR, and may not cover the needs of all statisticians/epidemiologists. However, as the scope and size of the package increases, so does the task of maintaining it. We believe in it’s current state, maintenance of rcprd is manageable going forwards. Furthermore, rcprd provides the foundations to build a data set for any type of analysis, some tasks will just require more user-input in order to define new functions around the database queries.

In summary, the main goal of this package is to reduce the duplication of time and effort among those using CPRD data for their research, allowing more time to be focused on other aspects of research projects. rcprd will be actively maintained for the foreseeable future. Suggestions for improvement are encouraged and can be posted on GitHub: https://github.com/alexpate30/calibmsm.

5 References

CPRD. 2022. CPRD Aurum Data Specification. Version 2.7.” https://www.cprd.com/.
———. 2024a. CPRD Aurum September 2024 Dataset.” https://www.cprd.com/doi/cprd-aurum-march-2024-dataset.
———. 2024b. CPRD GOLD September 2024 Dataset.” https://www.cprd.com/doi/cprd-gold-june-2024-dataset.
Gulliford, Martin C., Judith Charlton, Mark Ashworth, Anthony G. Rudd, Andre Michael Toschke, Brendan Delaney, Andy Grieve, et al. 2009. Selection of medical diagnostic codes for analysis of electronic patient records. Application to stroke in a primary care database.” PLoS ONE 4 (9). https://doi.org/10.1371/journal.pone.0007168.
Herrett, Emily, Arlene M. Gallagher, Krishnan Bhaskaran, Harriet Forbes, Rohini Mathur, Tjeerd van Staa, and Liam Smeeth. 2015. Data Resource Profile: Clinical Practice Research Datalink (CPRD).” International Journal of Epidemiology 44 (3): 827–36. https://doi.org/10.1093/ije/dyv098.
Kontopantelis, Evangelos, Richard John Stevens, Peter J. Helms, Duncan Edwards, Tim Doran, and Darren M. Ashcroft. 2018. Spatial distribution of clinical computer systems in primary care in England in 2016 and implications for primary care electronic medical record databases: A cross-sectional population study.” BMJ Open 8 (2): 1–7. https://doi.org/10.1136/bmjopen-2017-020738.
Matthewman, Julian, Kirsty Andresen, Anne Suffel, Liang-Yu Lin, Anna Schultze, John Tazare, Krishnan Bhaskaran, et al. 2024. Checklist and guidance on creating codelists for electronic health records research.” NIHR Open Research 4: 20. https://doi.org/10.3310/nihropenres.13550.1.
Müller, Kirill, Hadley Wickham, David A. James, and Seth Falcon. 2024. RSQLite: SQLite Interface for R.” https://rsqlite.r-dbi.org.
Padmanabhan, Shivani, Lucy Carty, Ellen Cameron, Rebecca E. Ghosh, Rachael Williams, and Helen Strongman. 2019. Approach to record linkage of primary care data from Clinical Practice Research Datalink to other health-related patient data: overview and implications.” European Journal of Epidemiology 34 (1): 91–99. https://doi.org/10.1007/s10654-018-0442-4.
Pye, Stephen R., Thérèse Sheppard, Rebecca M. Joseph, Mark Lunt, Nadyne Girard, Jennifer S. Haas, David W. Bates, et al. 2018. Assumptions made when preparing drug exposure data for analysis have an impact on results: An unreported step in pharmacoepidemiology studies.” Pharmacoepidemiology and Drug Safety 27 (7): 781–88. https://doi.org/10.1002/pds.4440.
Richardson, Niel, Ian Cook, Nic Crane, Dewey Dunnington, Romain Francois, Jonathan Keane, Dragos Moldovan-Grunfeld, Ooms Jeroen, and Jacob Wujciak-Jens. 2024. arrow: Integration to ’Apache’ ’Arrow’.” https://github.com/apache/arrow/.
Riley, Richard D., Danielle van der Windt, Peter Croft, and Karel G. M. Moons. 2019. Prognosis Research in Healthcare: Concepts, Methods, and Impact. Oxford: Oxford University Press.
Springate, David A., Rosa Parisi, Ivan Olier, David Reeves, and Evangelos Kontopantelis. 2017. rEHR: An R package for manipulating and analysing electronic health record data.” PLoS ONE 12 (2): 1–25. https://doi.org/10.1371/journal.pone.0171784.
The Health Foundation Analytics Lab. 2021. “Aurumpipeline.” https://github.com/HFAnalyticsLab/aurumpipeline.
Watson, Jessica, Brian D. Nicholson, Willie Hamilton, and Sarah Price. 2017. Identifying clinical features in primary care electronic health record studies: Methods for codelist development.” BMJ Open 7 (11): 1–9. https://doi.org/10.1136/bmjopen-2017-019637.
Williams, Richard, Benjamin Brown, Evan Kontopantelis, Tjeerd van Staa, and Niels Peek. 2019. Term sets: A transparent and reproducible representation of clinical code sets.” PLoS ONE 14 (2): 1–15. https://doi.org/10.1371/journal.pone.0212291.
Williams, Richard, Evangelos Kontopantelis, Iain Buchan, and Niels Peek. 2017. Clinical code set engineering for reusing EHR data for research: A review.” Journal of Biomedical Informatics 70: 1–13. https://doi.org/10.1016/j.jbi.2017.04.010.
Wolf, Achim, Daniel Dedman, Jennifer Campbell, Helen Booth, Darren Lunn, Jennifer Chapman, and Puja Myles. 2019. Data resource profile: Clinical Practice Research Datalink (CPRD) Aurum.” International Journal of Epidemiology 48 (6): 1740–1740G. https://doi.org/10.1093/ije/dyz034.
Yimer, Belay Birlie, David Selby, Meghna Jani, Goran Nenadic, Mark Lunt, and William G. Dixon. 2021. drugprepr: Prepare Electronic Prescription Record Data to Estimate Drug Exposure.” https://cran.r-project.org/package=drugprepr.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.