NHSDataDictionaRy - a package for accessing NHS Data Dictionary with web scraping and other useful functions

Context

This package has been commissioned by the NHS-R community and is intended to be used to web scrape the NHS Data Dictionary website for useful lookup tables. The package is maintained by Gary Hutson - Head of Advanced Analytics at Arden and GEM Commissioning Support Unit and to contact the maintainer directly you can navigate to this site.

Additionally, the package has been developed with generic web scraping functionality to allow other websites containing data tables and elements to be scraped.

Loading the package

To load the package, you can use the below command:

library(NHSDataDictionaRy)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

This brings in the functions needed to work with the package. The below sub sections will show how to use the package, as intended.

Text manipulation of the tibble

The NHSDataDictionaRy package provides a couple of Microsoft Excel convenience functions for working with text data. These are:

I will demonstrate how these can be used on the tibble extracted from the previous example in the following sub sections.

left_xl() function

To utilise the left_xl function it expects two parameters - the first is the text to work with and the second is the number of characters to left trim by:

#Grab a sub set of the data frame
df <- nhs_tibble[10,]
result <- NHSDataDictionaRy::left_xl(df$link_name, 22)
print(result)
#> [1] "ACCESSIBLE INFORMATION"
class(result)
#> [1] "character"

right_xl() function

This works the same way as the left function, but trims from the right of the text inward:

#Grab a sub set of the data frame
df <- nhs_tibble[10,]
result <- NHSDataDictionaRy::right_xl(df$link_name, 23)
print(result)
#> [1] "FORMAT CODE (SNOMED CT)"
class(result)
#> [1] "character"

mid_xl() function

This function takes a slightly different approach and expects 3 input parameter, the first being the text to trim, the second being where to start trimming and the third parameter is the termination point i.e. where to stop the trimming of the string:

#Grab a sub set of the data frame
df <- nhs_tibble[10,]
original <- df$link_name
#Original string
result <- NHSDataDictionaRy::mid_xl(df$link_name, 12, 20)
print(original); print(result)
#> [1] "ACCESSIBLE INFORMATION SPECIFIC INFORMATION FORMAT CODE (SNOMED CT)"
#> [1] "INFORMATION SPECIFIC"
class(result)
#> [1] "character"

len_xl() function

This is a simple, but useful function, as it gets the length of the string:

#Grab a sub set of the data frame
df <- nhs_tibble[10,]
#Original string
original <- df$link_name
string_length <- NHSDataDictionaRy::len_xl(original)
print(string_length)
#> [1] 67
class(string_length)
#> [1] "integer"

Working with the NHS R Data Dictionary lookup

This package provides functionality for working with the nhs_data_elements extracted from the NHS Data Dictionary website. The two main useful function to extract elements are the tableR function and the xPathTextR function. These can work with the tibble returned to extract useful lookups.

tableR function (utilising scrapeR function)

The scrapeR function is the workhorse, but the tableR wraps the results of the function in a nice tibble output. This will show you how to utilise the return tibble and to pass the function through the tableR to scrape a tibble to be utilised for lookups:

# Filter by a specific lookup required
reduced_tibble <- 
  dplyr::filter(nhs_tibble, link_name == "ACTIVITY TREATMENT FUNCTION CODE")

#Use the tableR function to query the NHS Data Dictionary website and return the associate tibble

treatment_function_lookup <- NHSDataDictionaRy::tableR(url=reduced_tibble$full_url,
                          xpath = reduced_tibble$xpath_nat_code, 
                          title = "NHS Hospital Activity Treatment Function Codes")

treatment_function_meta <- NHSDataDictionaRy::tableR(url=reduced_tibble$full_url,
                                                     xpath=reduced_tibble$xpath_also_known,
                                                     title = "Activity Treatment Function Code Meta")

# The query has returned results, if the url does not have a lookup table an error will be thrown

print(head(treatment_function_lookup,10))
#> # A tibble: 10 x 4
#>    Code  Description                Dict_Type                DttmExtracted      
#>    <chr> <chr>                      <chr>                    <dttm>             
#>  1 100   General Surgery Service    NHS Hospital Activity T~ 2021-02-01 17:28:01
#>  2 101   Urology Service            NHS Hospital Activity T~ 2021-02-01 17:28:01
#>  3 102   Transplant Surgery Service NHS Hospital Activity T~ 2021-02-01 17:28:01
#>  4 103   Breast Surgery Service     NHS Hospital Activity T~ 2021-02-01 17:28:01
#>  5 104   Colorectal Surgery Service NHS Hospital Activity T~ 2021-02-01 17:28:01
#>  6 105   Hepatobiliary and Pancrea~ NHS Hospital Activity T~ 2021-02-01 17:28:01
#>  7 106   Upper Gastrointestinal Su~ NHS Hospital Activity T~ 2021-02-01 17:28:01
#>  8 107   Vascular Surgery Service   NHS Hospital Activity T~ 2021-02-01 17:28:01
#>  9 108   Spinal Surgery Service     NHS Hospital Activity T~ 2021-02-01 17:28:01
#> 10 109   Bariatric Surgery Service  NHS Hospital Activity T~ 2021-02-01 17:28:01
print(treatment_function_meta)
#> # A tibble: 2 x 3
#>   `apply(simple_lookup_table_scraped, 2~ Dict_Type           DttmExtracted      
#>   <chr>                                  <chr>               <dttm>             
#> 1 Plural                                 Activity Treatment~ 2021-02-01 17:28:01
#> 2 ACTIVITY TREATMENT FUNCTION CODES      Activity Treatment~ 2021-02-01 17:28:01

Not all lookups will have associated national code tables, if they are not returned you will receive a message saying the lookup table is not available for this NHS Data Dictionary type.

Using my lookup with NHS data

There are common lookups that are needed, and this is one such mapping between specialty code, to get the description of the specialty unit description. I will show an example with a made up data frame to illustrate the use case for these lookups and to have up to date lookups:


act_aggregations <- tibble(SpecCode = as.character(c(101,102,103, 104, 105)),
                             ActivityCounts = round(rnorm(5,250,3),0), 
                             Month = rep("May", 5))

# Use dplyr to join the NHS activity by specialty code

act_aggregations %>% 
  left_join(treatment_function_lookup, by = c("SpecCode"="Code"))
#> # A tibble: 5 x 6
#>   SpecCode ActivityCounts Month Description     Dict_Type    DttmExtracted      
#>   <chr>             <dbl> <chr> <chr>           <chr>        <dttm>             
#> 1 101                 251 May   Urology Service NHS Hospita~ 2021-02-01 17:28:01
#> 2 102                 249 May   Transplant Sur~ NHS Hospita~ 2021-02-01 17:28:01
#> 3 103                 249 May   Breast Surgery~ NHS Hospita~ 2021-02-01 17:28:01
#> 4 104                 248 May   Colorectal Sur~ NHS Hospita~ 2021-02-01 17:28:01
#> 5 105                 251 May   Hepatobiliary ~ NHS Hospita~ 2021-02-01 17:28:01
  
# This easily joins the lookup on to your data
  

The benefit of having it in an R package is that you can instantaneously have a lookup of the most relevant and up to date NHS lookups, replacing the need to have a massive data warehouse to capture this information.

xpathTextR function

This function has been provided to return elements from a website, other than html tables, as these functions predominately work with tables. The below example shows how this can be implemented, but requires the retrieval of the xpath via the Inspect command in Google Chrome (CTRL + SHIFT + I):


url <- "https://datadictionary.nhs.uk/data_elements/abbreviated_mental_test_score.html"
xpath_element <- '//*[@id="element_abbreviated_mental_test_score.description"]'

# Run the xpathTextR function to retrieve details of the element retrieved

NHSDataDictionaRy::xpathTextR(url, xpath_element)
#> $result
#> [1] "Description\n  \n  \n  \n    \n      ABBREVIATED MENTAL TEST SCORE\n is the \n              PERSON SCORE\n where the \n              ASSESSMENT TOOL TYPE\n is \n              'Abbreviated Mental Test Score'.        \n    The score is in the range 0 to 10.\n  \n\n"
#> 
#> $website_passed
#> [1] "https://datadictionary.nhs.uk/data_elements/abbreviated_mental_test_score.html"
#> 
#> $xpath_passed
#> [1] "//*[@id=\"element_abbreviated_mental_test_score.description\"]"
#> 
#> $html_node_result
#> {html_document}
#> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:whc="http://www.oxygenxml.com/webhelp/components" xml:lang="en" lang="en" whc:version="21.1">
#> [1] <head>\n<link rel="shortcut icon" href="../oxygen-webhelp%5Ctemplate%5Cre ...
#> [2] <body class="wh_topic_page frmBody">\n        <a href="#wh_topic_body" cl ...
#> 
#> $datetime_access
#> [1] "2021-02-01 17:28:02 GMT"
#> 
#> $person_accessed
#> [1] "GARYH - LAPTOP-GE3S96EI"

This provides details of the result, the text retrieved live from the website - this would need some cleaning, the website passed to the function, the xpath included, the result of the node search, the date and time the list was generated and the person and domain accessing this.

These could be used to scrape not just the data dictionary website, but any website.

Wrapping up

There are lots of use cases for this, but I would like to keep iterating this tool so please contact me with suggestions of what could be included in future versions.