fhircrackr
is a package designed to help analyzing HL7 FHIR resources.
FHIR stands for Fast Healthcare Interoperability Resources and is a standard describing data formats and elements (known as “resources”) as well as an application programming interface (API) for exchanging electronic health records. The standard was created by the Health Level Seven International (HL7) health-care standards organization. For more information on the FHIR standard, visit https://www.hl7.org/fhir/.
While FHIR is a very useful standard to describe and exchange medical data in an interoperable way, it is not very useful for statistical analyses of said data. This is due to the fact that FHIR data is stored in many nested and interlinked resources instead of matrix-like structures.
Thus, to be able to do statistical analyses a tool is needed that allows converting these nested resources into data frames. This process of flattening FHIR resources is not trivial, as the unpredictable degree of nesting and connectedness of the resources makes generic solutions to this problem not feasible.
We therefore implemented a package that makes it possible to download FHIR resources from a server into R and to flatten these resources into (multiple) data frames.
The package is still under development. The CRAN version of the package contains all functions that are already stable, for more recent (but potentially unstable) developments, the development version of the package can be downloaded from GitHub using devtools::install_github("POLAR-fhiR/fhircrackr")
.
The complexity of the problem requires a couple of prerequisites both regarding knowledge and access to data. We will shortly list the preconditions to using the fhircrackr
package here:
First of all, you need the endpoint of the FHIR server you want to access. If you don’t have you own FHIR server, you can use one of the publicly available servers, such as https://hapi.fhir.org/baseR4 or http://fhir.hl7.de:8080/baseDstu3. The endpoint of a FHIR server is often referred to as [base].
To download ressources from the server, you should be familiar with FHIR search requests. FHIR search allows you to download sets of resources that match very specific requirements. As the focus of this package is dealing with FHIR resources in R, rather than the intricacies of FHIR search, we will mostly use simple examples of FHIR search requests. Most of them will have the form [base]/[type]?parameter(s)
, where [type]
refers to the type of resource you are looking for and parameter(s)
characterise specific properties those resources should have. http://hapi.fhir.org/baseR4/Patient?gender=female
for example downloads all Patient resources from the fhir server at http://hapi.fhir.org/baseR4/
that represent female patients.
In the first step, fhircrackr
downloads the resources in xml format into R. To specify which elements from the FHIR resources you want in your data frame, you should have at least some familiarity with XPath expressions. A good tutorial on XPath expressions can be found here.
In the following we’ll go through a typical workflow with fhircrackr
step by step.
We will start with a very simple example and use fhir_search()
to download Patient resources from a publicly available HAPI server after we’ve loaded the package with library(fhircrackr)
:
library(fhircrackr)
patient_bundles <- fhir_search(request="http://hapi.fhir.org/baseR4/Patient?", max_bundles=2)
The minimum information fhir_search()
requires is a string containing the full FHIR search request in the argument request
. In general, a fhir search request returns a bundle of the resources you requested. If there are a lot of resources matching your request, the search result isn’t returned in one big bundle but distributed over several of them. If the argument max_bundles
is set to its default Inf
, fhir_search()
will return all available bundles, meaning all resources matching your request. If you set it to 2
as in the above example, the download will stop after the first two bundles. Note that in this case, the result may not contain all the resources from the server matching your request.
If you want to connect to a fhir server that uses basic authentification, you can supply the arguments username
and password
.
Because endpoints can sometimes be hard to reach, fhir_search()
will start five attempts to connect to the endpoint before it gives up. With the arguments max_attempts
and delay_between_attempts
you can control this number as well the time interval between attempts.
As you can see in the next block of code, fhir_search()
returns a list of xml objects where each list element represents one bundle of resources, so a list of two xml objects in our case:
str(patient_bundles)
#> List of 2
#> $ http://hapi.fhir.org/baseR4/Patient? :List of 2
#> ..$ node:<externalptr>
#> ..$ doc :<externalptr>
#> ..- attr(*, "class")= chr [1:2] "xml_document" "xml_node"
#> $ http://hapi.fhir.org/baseR4?_getpages=4e4aef4b-d689-44fc-a51f-5e9e223d9d5a&_getpagesoffset=20&_count=20&_pretty=true&_bundletype=searchset:List of 2
#> ..$ node:<externalptr>
#> ..$ doc :<externalptr>
#> ..- attr(*, "class")= chr [1:2] "xml_document" "xml_node"
If for some reason you cannot connect to a FHIR server at the moment but want to explore the following functions anyway, the package provides an example list of bundles containing Patient and MedicationStatement resources. See ?medication_bundles
for how to use it.
Now we know that inside these xml objects there is the patient data somewhere. To get it out, we will use fhir_crack()
. The most important arguments fhir_crack()
takes is bundles
, the list of bundles that is returned by fhir_search()
and design
, an object that tells the function wich data to extract from the bundle. It returns a list of data.frames.
In general, design
is a named list containing one element per data frame that will be created. The element names of design
are going to be the names of the resulting data frames. It usually makes sense to create one data frame per type of resource. Because we have just downloaded resources of the type Patient, the design
here would be a list of length 1.
The elements of design
are lists themselves. They can be of length 1 or length 2, depending on the level of precision in extracting the attributes. There are three levels of precision in extracting the data for our data frame with fhir_crack()
:
If we want to extract all available attributes, the list describing the data frames inside design is a list of length 1, containing only an Xpath expression to the resource type we want to extract:
#define which elements of the resources are of interest
design1 <- list(
Patients = list(
".//Patient"
)
)
#Convert resources
list_of_tables <- fhir_crack(patient_bundles, design1)
#>
#> Patients
#> 1....................
#> 2....................
#> FHIR-Resources cracked.
#have look at part of the results
list_of_tables$Patients[1:5,5:10]
#> text.status.value text.div.div.class text.div.table.class
#> 1 generated hapiHeaderText hapiPropertyTable
#> 2 generated hapiHeaderText hapiPropertyTable
#> 3 generated hapiHeaderText hapiPropertyTable
#> 4 generated hapiHeaderText hapiPropertyTable
#> 5 generated hapiHeaderText hapiPropertyTable
#> identifier.type.coding.system.value identifier.type.coding.code.value
#> 1 http://hl7.org/fhir/v2/0203 MR
#> 2 http://hl7.org/fhir/v2/0203 MR
#> 3 http://hl7.org/fhir/v2/0203 MR
#> 4 http://hl7.org/fhir/v2/0203 MR
#> 5 http://hl7.org/fhir/v2/0203 MR
#> identifier.value.value
#> 1 4034a31f-57cf-4dde-944e-6a3522627bb3
#> 2 4f63e2ee-fe67-468f-98de-5b322de8c9ad
#> 3 c70c62d9-4cce-46b0-ae01-96202edba35f
#> 4 46561a78-1e69-42f2-9b5e-0f46e39c6189
#> 5 f1bb77f1-60c5-4a74-abeb-4a4353d63079
Note that this can easily become a rather wide and sparse data frame. This is due to the fact that every attribute appearing in at least one of the resources will be turned into a variable (i.e. column), even if none of the other resources contain this attribute. For those resources, the value on that attribute will be set to NA
. Depnding on the variability of the resources, the resulting data frame can contain a lot of NA
values. If a recource has multiple entries for an attribute, these are pasted together using the string provided in the argument sep
as a seperator. The column names in this option are automatically generated by pasting together the path to the respective attribute.
We can extract all attributes that are found on a certain level of the resource, if we specify this level in an XPath expression and provide it as the second element of the list describing the data.frame:
#define which elements of the resources are of interest
design2 <- list(
Patients = list(
".//Patient",
"./*/@value"
)
)
#Convert resources
list_of_tables <- fhir_crack(patient_bundles, design2)
#>
#> Patients
#> 1....................
#> 2....................
#> FHIR-Resources cracked.
#have look at the results
head(list_of_tables$Patients)
#> id.value gender.value birthDate.value active.value deceasedBoolean.value
#> 1 642193 <NA> <NA> <NA> <NA>
#> 2 642194 <NA> <NA> <NA> <NA>
#> 3 642201 <NA> <NA> <NA> <NA>
#> 4 642202 <NA> <NA> <NA> <NA>
#> 5 642209 <NA> <NA> <NA> <NA>
#> 6 642210 <NA> <NA> <NA> <NA>
#> multipleBirthBoolean.value
#> 1 <NA>
#> 2 <NA>
#> 3 <NA>
#> 4 <NA>
#> 5 <NA>
#> 6 <NA>
"./*/@value"
for example tells fhir_crack()
to extract all attributes that are located (exactly) one level below the root level.
If we know exactly which attributes we want to extract, we can specify them in a named list that we provide as the second element of the list describing the data.frame:
#define which elements of the resources are of interest
design3 <- list(
Patients = list(
".//Patient",
list(
PID = "id/@value",
NAME.USE = "name/use/@value",
NAME.GIVEN = "name/given/@value",
NAME.FAMILY = "name/family/@value",
GENDER = "gender/@value",
BIRTHDAY = "birthDate/@value"
)
)
)
#Convert resources
list_of_tables <- fhir_crack(patient_bundles, design3)
#>
#> Patients
#> 1....................
#> 2....................
#> FHIR-Resources cracked.
#have look at the results
head(list_of_tables$Patients)
#> PID NAME.USE NAME.GIVEN NAME.FAMILY GENDER BIRTHDAY
#> 1 642193 Bob Alexander
#> 2 642194 Caleb Cushing
#> 3 642201 Bob Alexander
#> 4 642202 Caleb Cushing
#> 5 642209 Bob Alexander
#> 6 642210 Caleb Cushing
This options will usually return the most tidy and clear data frames. You should always extract the resource id, because this is used to link to other resources you might also extract. If you are not sure which attributes are available or where they are located in the resource, it can be helpful to start by extracting all availabe attributes. Then you can get an overview over the available attributes and their location and continue by doing a second, more targeted extraction to get your final data frame.
Of course the previous example is using just one resource type. If you are interested in several types of resources, design
will have more elements and the result will be a list of several data frames.
The abstract form design
should therefore have is:
list(
#Option 1: extract all attributes
<Name of first data frame> = list(
<XPath to resource type>
),
#Option 2: extract attributes from certain level
<Name of second data frame> = list(
<XPath to resource type>,
<XPath indicating attribute level>
),
#Option 3: extract specific attributes
<Name of third data frame> = list(
<XPath to resource type>,
list(
<column name 1> = <XPath to attribute>,
<column name 2> = <XPath to attribute>
...
)
),
...
)
In reality your FHIR search requests are probably going to be slightly more complex than just asking for Patient resources. Consider the following example where we want to download MedicationStatements refering to a certain medication we specify with its snomed code and also the Patient resources the MedicationStatements are linked to.
When the FHIR search request gets longer, it can be helpful to build up the request piece by piece like this:
search_request <- paste0(
"https://hapi.fhir.org/baseR4/", #server endpoint
"MedicationStatement?", #look for MedicationsStatements
"code=http://snomed.info/ct|429374003", #only choose resources with this loinc code
"&_include=MedicationStatement:subject") #include the corresponding Patient resources
Then we can download the resources:
And convert them into to data frames, one for the MedicationStatements and one for the Patients:
design <- list(
MedicationStatement = list(
".//MedicationStatement",
list(
MS.ID = "id/@value",
STATUS.TEXT = "text/status/@value",
STATUS = "status/@value",
MEDICATION.SYSTEM = "medicationCodeableConcept/coding/system/@value",
MEDICATION.CODE = "medicationCodeableConcept/coding/code/@value",
MEDICATION.DISPLAY = "medicationCodeableConcept/coding/display/@value",
DOSAGE = "dosage/text/@value",
PATIENT = "subject/reference/@value",
LAST.UPDATE = "meta/lastUpdated/@value"
)
),
Patients = list(
".//Patient",
"./*/@value"
)
)
list_of_tables <- fhir_crack(medication_bundles, design)
#>
#> MedicationStatement
#> 1....................
#> 2....................
#> 3....................
#>
#> Patients
#> 1....................
#> 2....................
#> 3....................
#> FHIR-Resources cracked.
head(list_of_tables$MedicationStatement)
#> MS.ID STATUS.TEXT STATUS MEDICATION.SYSTEM MEDICATION.CODE
#> 1 30233 generated active http://snomed.info/ct 429374003
#> 2 42012 generated active http://snomed.info/ct 429374003
#> 3 42091 generated active http://snomed.info/ct 429374003
#> 4 45646 generated active http://snomed.info/ct 429374003
#> 5 45724 generated active http://snomed.info/ct 429374003
#> 6 45802 generated active http://snomed.info/ct 429374003
#> MEDICATION.DISPLAY DOSAGE PATIENT
#> 1 simvastatin 40mg 1 tab once daily Patient/30163
#> 2 simvastatin 40mg 1 tab once daily Patient/41945
#> 3 simvastatin 40mg 1 tab once daily Patient/42024
#> 4 simvastatin 40mg 1 tab once daily Patient/45579
#> 5 simvastatin 40mg 1 tab once daily Patient/45657
#> 6 simvastatin 40mg 1 tab once daily Patient/45735
#> LAST.UPDATE
#> 1 2019-09-26T14:34:44.543+00:00
#> 2 2019-10-09T20:12:49.778+00:00
#> 3 2019-10-09T22:44:05.728+00:00
#> 4 2019-10-11T16:17:42.365+00:00
#> 5 2019-10-11T16:30:24.411+00:00
#> 6 2019-10-11T16:32:05.206+00:00
head(list_of_tables$Patients)
#> id.value gender.value birthDate.value
#> 1 60096 male 2019-11-13
#> 2 49443 female 1970-10-19
#> 3 46213 female 2019-10-11
#> 4 45735 male 1970-10-11
#> 5 42024 female 1979-10-09
#> 6 58504 male 2019-11-08
As you can see, the result now contains two data frames, one for Patient resources and one for MedicationStatement resources.
A particularly complicated problem in flattening FHIR resources is caused by the fact that there can be multiple entries to an attribute. The profile your FHIR resources have been built by defines how often a particular attribute can appear in a resource. This is called the cardinality of the attribute. For example the Patient resource defined here can have zero or one birthdates but arbitratily many addresses. In general, fhir_crack()
will paste multiple entries for same attribute together, using the seperator provided by the sep
argument. In most cases this will work just fine, but there are some special cases that require a little more attention.
Let’s have a look at the following example, where we have a bundle containing just two Patient resources:
bundle<-xml2::read_xml(
"<Bundle>
<Patient>
<id value='id1'/>
<address>
<use value='home'/>
<city value='Amsterdam'/>
<type value='physical'/>
<country value='Netherlands'/>
</address>
<birthDate value='1992-02-06'/>
</Patient>
<Patient>
<id value='id2'/>
<address>
<use value='home'/>
<city value='Rome'/>
<type value='physical'/>
<country value='Italy'/>
</address>
<address>
<use value='work'/>
<city value='Stockholm'/>
<type value='postal'/>
<country value='Sweden'/>
</address>
<birthDate value='1980-05-23'/>
</Patient>
<Patient>
<id value='id3'/>
<address>
<use value='home'/>
<city value='Berlin'/>
</address>
<address>
<type value='postal'/>
<country value='France'/>
</address>
<address>
<use value='work'/>
<city value='London'/>
<type value='postal'/>
<country value='England'/>
</address>
<birthDate value='1974-12-25'/>
</Patient>
</Bundle>"
)
bundle_list<-list(bundle)
This bundle contains three Patient resources. The first resource has just one entry for the address attribute. The second Patient resource has two entries containing the same elements for the address attribute. The third Patient resource has a rather messy address attribute, with three entries containing different elements.
Let’s see what happens if we extract all attributes:
design1 <- list(
Patients = list(".//Patient")
)
df1 <- fhir_crack(bundle_list, design1, sep = " | ",)
#>
#> Patients
#> 1...
#> FHIR-Resources cracked.
df1$Patients
#> id.value address.use.value address.city.value address.type.value
#> 1 id1 home Amsterdam physical
#> 2 id2 home | work Rome | Stockholm physical | postal
#> 3 id3 home | work Berlin | London postal | postal
#> address.country.value birthDate.value
#> 1 Netherlands 1992-02-06
#> 2 Italy | Sweden 1980-05-23
#> 3 France | England 1974-12-25
As you can see, multiple entries for the same attribute (address) are pasted together. This works fine for Patient 2, but for Patient 3 you can see a problem with the number of entries that are displayed. The original Patient resource had three (inclomplete) address
entries, but because the first two of them use complementary elements (use
and city
vs. type
and country
), the resulting pasted entries look like there had just been two entries for the address
attribute.
You can counter this problem with the add_indices
argument and customize the appearance of the indices with brackets
:
design2 <- list(
Patients = list(".//Patient")
)
df2 <- fhir_crack(bundle_list, design1, sep = " ", add_indices = T, brackets = c("<", ">") )
#>
#> Patients
#> 1...
#> FHIR-Resources cracked.
df2$Patients
#> id.value address.use.value address.city.value
#> 1 <1>id1 <1.1>home <1.1>Amsterdam
#> 2 <1>id2 <1.1>home <2.1>work <1.1>Rome <2.1>Stockholm
#> 3 <1>id3 <1.1>home <3.1>work <1.1>Berlin <3.1>London
#> address.type.value address.country.value birthDate.value
#> 1 <1.1>physical <1.1>Netherlands <1>1992-02-06
#> 2 <1.1>physical <2.1>postal <1.1>Italy <2.1>Sweden <1>1980-05-23
#> 3 <2.1>postal <3.1>postal <2.1>France <3.1>England <1>1974-12-25
Now the indices display the entry the value belongs to. That way you can see that Patient resource 3 had three entries for the attribute address
and you can also see which attributes belong to which entry.
Of course this is a very specific that only occurs if your resources have multiple entries with complementary elements. In the vast majority of cases multiple entries in one resource will look identical, thus making numbering of those entries superfluous.
Since fhir_crack()
discards of all the data not specified in design
it makes sense to store the original search result for reproducibility and in case you realise later on that you need elements from the resources that you haven’t extracted at first.
There are two ways of saving the FHIR bundles you downloaded: Either you save them as R objects, or you write them to an xml file.
If you want to save the list of downloaded bundles as an .rda
or .RData
file, you cannot just R’s save()
or save_image()
on it, because this will break the external pointers in the xml objects representing your bundles. Instead, you have to serialize the bundles before saving and unserialize them after loading. For single xml objects the package xml2
proved serialization functions. For convenience, however, fhircrackr
provides the functions fhir_serialize()
and fhir_unserialize()
that you can use directly on the list of bundles returned by fhir_search()
:
#serialize bundles
serialized_bundles <- fhir_serialize(patient_bundles)
#have a look at them
head(serialized_bundles[[1]])
#> [1] 58 0a 00 00 00 03
#save
save(serialized_bundles, file="bundles.rda")
If you load this bundle again, you have to unserialize it, before you can work with it:
#load bundles
load("bundles.rda")
#unserialize
bundles <- fhir_unserialize(serialized_bundles)
#have a look
head(bundles[[1]])
#> $node
#> <pointer: 0x5644af907640>
#>
#> $doc
#> <pointer: 0x5644b05fb750>
After unserialization, the pointers are restored and you can continue to work with the bundles. Note that the example bundle medication_bundles
that is provided with the fhircrackr
package is also provided in its serialized form and has to be unserialized as described on its help page.
If you want to store the bundles in xml files instead of R objects, you can use the functions fhir_save()
and fhir_load()
. fhir_save()
takes a list of bundles in form of xml objects (as returned by fhir_search()
) and writes them into the directory specified in the argument directory
. Each bundle is saved as a seperate xml-file. If the folder defined in directory
doesn’t exist, it is created in the current working directory.
To read bundles saved with fhir_save()
back into R, you can use fhir_load()
:
fhir_load()
takes the name of the directory (or path to it) as its only argument. All xml-files in this directory will be read into R and returned as a list of bundles in xml format just as returned by fhir_search()
.
The capability statement documents a set of capabilities (behaviors) of a FHIR Server for a particular version of FHIR. You can download this statement using the function fhir_capability_statement()
:
cap <- fhir_capability_statement("http://hapi.fhir.org/baseR4/")
#>
#> Download completed. All available bundles were downloaded.
#> FHIR-Resources cracked.
fhir_capability_statement()
takes a FHIR server endpoint and returns a list of data frames containing all information from the capability statement of this server.
You can then access the parts that interest you, for example:
While we recommend extracting exactly one data frame per resource, it is technically possible to choose a different level per data frame:
design <- list(
MedCodes=list(".//medicationCodeableConcept/coding")
)
df <- fhir_crack(medication_bundles, design)
#>
#> MedCodes
#> 1....................
#> 2....................
#> 3....................
#> FHIR-Resources cracked.
head(df$MedCodes)
#> system.value code.value display.value
#> 1 http://snomed.info/ct 429374003 simvastatin 40mg
#> 2 http://snomed.info/ct 429374003 simvastatin 40mg
#> 3 http://snomed.info/ct 429374003 simvastatin 40mg
#> 4 http://snomed.info/ct 429374003 simvastatin 40mg
#> 5 http://snomed.info/ct 429374003 simvastatin 40mg
#> 6 http://snomed.info/ct 429374003 simvastatin 40mg
The above example shows that instead of the MedicationStatement resource, we can choose the MedicationCodeableConcept as the root level for our extraction. This can be useful to get a quick and relatively clean overview over the types of codes used on this level of the resource. It is however important to note that this mode of extraction makes it impossible to recognise if each row belongs to one ressource of if several of these rows came from the same ressource. This of course also means that you cannot link this information to data from other ressources because this extraction mode discards of that information.