The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Parsing Europe PMC FTP files

Chris Stubben

August 26, 2024

The Europe PMC FTP includes 2.5 million open access articles separated into files with 10K articles each. Download and unzip a recent series of PMC ids and load into R using the readr package. A sample file with the first 10 articles is included in the tidypmc package.

library(readr)
pmcfile <- system.file("extdata/PMC6358576_PMC6358589.xml", package = "tidypmc")
pmc <- read_lines(pmcfile)

Find the start of the article nodes.

a1 <- grep("^<article ", pmc)
head(a1)
#  [1]  2 30 38 52 62 69
n <- length(a1)
n
#  [1] 10

Read a single article by collapsing the lines into a new line separated string.

library(xml2)
x1 <- paste(pmc[2:29], collapse="\n")
doc <- read_xml(x1)
doc
#  {xml_document}
#  <article article-type="case-report" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML">
#  [1] <front>\n  <journal-meta>\n    <journal-id journal-id-type="nlm-ta">ACG Case Rep J</journal-i ...
#  [2] <body>\n  <sec sec-type="intro" id="sec1">\n    <title>Introduction</title>\n    <p>Bezoars a ...
#  [3] <back>\n  <ref-list>\n    <title>References</title>\n    <ref id="B1">\n      <label>1.</labe ...

Loop through the articles and save the metadata and text below. All 10K articles takes about 10 minutes to run on a Mac laptop and returns 1.7M sentences.

library(tidypmc)
a1 <- c(a1, length(pmc))
met1 <- vector("list", n)
txt1 <- vector("list", n)
for(i in seq_len(n)){
  doc <- read_xml(paste(pmc[a1[i]:(a1[i+1]-1)], collapse="\n"))
  m1 <- pmc_metadata(doc)
  id <- m1$PMCID
  message("Parsing ", i, ". ", id)
  met1[[i]] <- m1
  txt1[[i]] <- pmc_text(doc)
}
#  Parsing 1. PMC6358576
#  Parsing 2. PMC6358577
#  Parsing 3. PMC6358578
#  Parsing 4. PMC6358579
#  Parsing 5. PMC6358580
#  Parsing 6. PMC6358581
#  Parsing 7. PMC6358585
#  Note: removing table-wrap nested in sec/p tag
#  Note: removing fig nested in sec/p tag
#  Parsing 8. PMC6358587
#  Note: removing table-wrap nested in sec/p tag
#  Note: removing fig nested in sec/p tag
#  Parsing 9. PMC6358588
#  Note: removing fig nested in sec/p tag
#  Parsing 10. PMC6358589
#  Note: removing table-wrap nested in sec/p tag
#  Note: removing fig nested in sec/p tag

Combine the list of metadata and text into tables.

library(dplyr)
met <- bind_rows(met1)
names(txt1) <- met$PMCID
txt <- bind_rows(txt1, .id="PMCID")
met
#  # A tibble: 10 × 13
#     PMCID Title Authors  Year Journal Volume Pages `Published online` `Date received` DOI   Publisher
#     <chr> <chr> <chr>   <int> <chr>   <chr>  <chr> <chr>              <chr>           <chr> <chr>    
#   1 PMC6… Endo… Dana B…  2018 ACG Ca… 5      e87   2018-12-5          2018-7-8        10.1… American…
#   2 PMC6… Chro… Scott …  2018 ACG Ca… 5      e94   2018-12-5          2018-5-5        10.1… American…
#   3 PMC6… Bile… Steffi…  2018 ACG Ca… 5      e88   2018-12-5          2018-5-7        10.1… American…
#   4 PMC6… New … Gordon…  2018 ACG Ca… 5      e92   2018-12-5          2018-3-3        10.1… American…
#   5 PMC6… Bile… Michae…  2018 ACG Ca… 5      e89   2018-12-5          2017-11-3       10.1… American…
#   6 PMC6… Fuso… Akshay…  2018 ACG Ca… 5      e99   2018-12-19         2018-3-8        10.1… American…
#   7 PMC6… Chor… Marcia…  2019 Genes … 20     56-68 2018-1-24          2017-9-1        10.1… Nature P…
#   8 PMC6… The … Tao Zh…  2019 Spinal… 57     141-… 2018-8-8           2017-12-19      10.1… Nature P…
#   9 PMC6… Natu… Marjol…  2019 Molecu… 20     115-… 2018-12-16         2018-10-22      10.1… Elsevier 
#  10 PMC6… Pred… Yury O…  2019 Molecu… 20     63-78 2018-11-16         2018-9-10       10.1… Elsevier 
#  # ℹ 2 more variables: Issue <chr>, PMID <chr>
txt
#  # A tibble: 1,073 × 5
#     PMCID      section      paragraph sentence text                                                  
#     <chr>      <chr>            <int>    <int> <chr>                                                 
#   1 PMC6358576 Title                1        1 Endoscopic versus Surgical Intervention for Jejunal B…
#   2 PMC6358576 Abstract             1        1 Bezoar-induced small bowel obstruction is a rare enti…
#   3 PMC6358576 Abstract             1        2 The cornerstone of treatment for intestinal bezoars h…
#   4 PMC6358576 Abstract             1        3 We present a patient with obstructive jejunal phytobe…
#   5 PMC6358576 Introduction         1        1 Bezoars are aggregates of undigested foreign material…
#   6 PMC6358576 Introduction         1        2 There are currently four classifications of bezoars: …
#   7 PMC6358576 Introduction         1        3 Endoscopic treatment of bezoars causing intestinal ob…
#   8 PMC6358576 Case Report          1        1 A 60-year old diabetic woman with a past cholecystect…
#   9 PMC6358576 Case Report          1        2 Physical examination revealed mild diffuse abdominal …
#  10 PMC6358576 Case Report          1        3 Computed tomography (CT) of the abdomen and pelvis re…
#  # ℹ 1,063 more rows

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.