The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

From R to RDF

From tidy data to RDF triples

This vignette demonstrates how to convert tidy R datasets into semantically enriched RDF triple structures, using the dataset and rdflib packages. These packages help you annotate variables with machine-readable concepts, units, and links to controlled vocabularies.

We’ll start with a small example of a tidy dataset representing countries (geo) with unique identifiers (rowid) and then show how to transform the dataset into RDF triples using standard vocabularies.

library(dataset)
library(rdflib)
data("gdp")

Creating a minimal semantically defined dataset

small_geo <- dataset_df(
  geo = defined(
    gdp$geo[1:3],
    label = "Geopolitical entity",
    concept = "http://purl.org/linked-data/sdmx/2009/dimension#refArea",
    namespace = "https://www.geonames.org/countries/$1/"
  ),
  identifier = c(
    obs = "https://dataset.dataobservatory.eu/examples/dataset.html#"
  )
)

The dataset has no creator or author, but the rows have identifiers that can be resolved with https://dataset.dataobservatory.eu/examples/dataset.html#. In real publishing scenarios, you would replace these with persistent URIs that identify actual datasets and their observations. For example, a DOI-based identifier such as:

https://doi.org/10.5281/zenodo.14917851#obs:1

So let’s see how this minimal dataset prints in R:

print(small_geo)
#> Unknown (2025): Untitled Dataset [dataset]
#>   rowid     geo       
#>   <defined> <defined>
#> 1 obs1      AD       
#> 2 obs2      AD       
#> 3 obs3      AD

A tidy dataset can always be pivotted to a three-column long (tidy) format, which can define every cell value in the tabular dataset with a subject-predicate-object triple.

triples_df <- dataset_to_triples(small_geo)
knitr::kable(triples_df)
s p o
https://dataset.dataobservatory.eu/examples/dataset.html#obs1 http://purl.org/linked-data/sdmx/2009/dimension#refArea https://www.geonames.org/countries/AD/
https://dataset.dataobservatory.eu/examples/dataset.html#obs2 http://purl.org/linked-data/sdmx/2009/dimension#refArea https://www.geonames.org/countries/AD/
https://dataset.dataobservatory.eu/examples/dataset.html#obs3 http://purl.org/linked-data/sdmx/2009/dimension#refArea https://www.geonames.org/countries/AD/

This produces triples like:

ntriples <- dataset_to_triples(small_geo, format = "nt")
cat(ntriples, sep = "\n")
cat(ntriples, sep = "\n")
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs1> <http://purl.org/linked-data/sdmx/2009/dimension#refArea> <https://www.geonames.org/countries/AD/> .
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs2> <http://purl.org/linked-data/sdmx/2009/dimension#refArea> <https://www.geonames.org/countries/AD/> .
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs3> <http://purl.org/linked-data/sdmx/2009/dimension#refArea> <https://www.geonames.org/countries/AD/> .

Each row of your dataset becomes a subject, each variable a predicate, and each value either a URI or a typed literal (like a date or number) — depending on how it’s defined. The first statement in the example defines the intersection of the first row (observation, identified by the rowid) dataset#eg:1 and the column reference area defined by the URI as Andorra.The advantage of this approach is that the row and column definitions as well as coded cell values have a permanent metadata definition.

RDF triples enable interoperability

The Resource Description Framework (RDF) represents data as subject–predicate–object triples. This allows your dataset to be machine-readable, linkable to external vocabularies, and to be ready for queries via SPARQL.

RDF triples enable interoperability

The Resource Description Framework (RDF) represents data as subject–predicate–object triples. This allows your dataset to be machine-readable, linkable to external vocabularies, and queryable via SPARQL.

n_triple(
  s = "https://dataset.dataobservatory.eu/examples/dataset.html#obs1",
  p = "http://purl.org/dc/terms/title",
  o = "Small Country Dataset"
)
#> [1] "<https://dataset.dataobservatory.eu/examples/dataset.html#obs1> <http://purl.org/dc/terms/title> \"Small Country Dataset\"^^<http://www.w3.org/2001/XMLSchema#string> ."
# We write to a temporary file our Ntriples created earlier
temp_file <- tempfile(fileext = ".nt")
writeLines(ntriples, con = temp_file)

rdf_graph <- rdf()
rdf_parse(rdf_graph, doc = temp_file, format = "ntriples")
#> Total of 3 triples, stored in hashes
#> -------------------------------
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs1> <http://purl.org/linked-data/sdmx/2009/dimension#refArea> <https://www.geonames.org/countries/AD/> .
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs2> <http://purl.org/linked-data/sdmx/2009/dimension#refArea> <https://www.geonames.org/countries/AD/> .
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs3> <http://purl.org/linked-data/sdmx/2009/dimension#refArea> <https://www.geonames.org/countries/AD/> .
rdf_graph
#> Total of 3 triples, stored in hashes
#> -------------------------------
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs1> <http://purl.org/linked-data/sdmx/2009/dimension#refArea> <https://www.geonames.org/countries/AD/> .
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs2> <http://purl.org/linked-data/sdmx/2009/dimension#refArea> <https://www.geonames.org/countries/AD/> .
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs3> <http://purl.org/linked-data/sdmx/2009/dimension#refArea> <https://www.geonames.org/countries/AD/> .

A simple, serverless scaffolding for publishing dataset_df objects on the web (with HTML + RDF exports) is available at https://github.com/dataobservatory-eu/dataset-template with the example of this vignette tutorial.

Clean up

It is a good practice to close connections, or clean up larger objects living in the memory:

# Clean up: delete file and clear RDF graph
unlink(temp_file)
rm(rdf_graph)
gc()
#>           used (Mb) gc trigger  (Mb) max used (Mb)
#> Ncells  983786 52.6    1913322 102.2  1444067 77.2
#> Vcells 1763779 13.5    8388608  64.0  3137445 24.0

Scale up

We build a slightly bigger graph, save it, and reload it.

small_country_dataset <- dataset_df(
  geo = defined(
    gdp$geo,
    label = "Country name",
    concept = "http://dd.eionet.europa.eu/vocabulary/eurostat/geo/",
    namespace = "https://www.geonames.org/countries/$1/"
  ),
  year = defined(
    gdp$year,
    label = "Reference Period (Year)",
    concept = "http://purl.org/linked-data/sdmx/2009/dimension#refPeriod"
  ),
  gdp = defined(
    gdp$gdp,
    label = "Gross Domestic Product",
    unit = "https://dd.eionet.europa.eu/vocabularyconcept/eurostat/unit/CP_MEUR",
    concept = "http://data.europa.eu/83i/aa/GDP"
  ),
  unit = gdp$unit,
  freq = defined(
    gdp$freq,
    label = "Frequency",
    concept = "http://purl.org/linked-data/sdmx/2009/code"
  ),
  identifier = c(
    obs = "https://dataset.dataobservatory.eu/examples/dataset.html#"
  ),
  dataset_bibentry = dublincore(
    title = "Small Country Dataset",
    creator = person("Jane", "Doe"),
    publisher = "Example Inc.",
    datasource = "https://doi.org/10.2908/NAIDA_10_GDP",
    rights = "CC-BY",
    coverage = "Andorra, Lichtenstein and San Marino"
  )
)
small_country_df_nt <- dataset_to_triples(
  small_country_dataset,
  format = "nt"
)

The following lines read as:

## See rows 1,11,21
small_country_df_nt[c(1, 11, 21, 31, 41)]
#> [1] "<https://dataset.dataobservatory.eu/examples/dataset.html#obs1> <http://dd.eionet.europa.eu/vocabulary/eurostat/geo/> <https://www.geonames.org/countries/AD/> ."
#> [2] "<https://dataset.dataobservatory.eu/examples/dataset.html#obs1> <http://purl.org/linked-data/sdmx/2009/dimension#refPeriod> \"2020\"^^<xsd:integer> ."           
#> [3] "<https://dataset.dataobservatory.eu/examples/dataset.html#obs1> <http://data.europa.eu/83i/aa/GDP> \"2354.8\"^^<xsd:decimal> ."                                  
#> [4] "<https://dataset.dataobservatory.eu/examples/dataset.html#obs1> <http://example.com/prop/unit> \"CP_MEUR\"^^<xsd:string> ."                                      
#> [5] "<https://dataset.dataobservatory.eu/examples/dataset.html#obs1> <http://purl.org/linked-data/sdmx/2009/code> \"A\"^^<xsd:string> ."

he statements about Observation 1, i.e. Andorra’s national economy in 2020, is not serialised consecutively in the text file. This is not necessary, because each cell is precisely connected to the row (first part of the triple) and column (second part of the triple). We could say that the entire map to the original dataset is embedded into the flat text file, therefore it can be easily imported into a database.

Note: The .html# in these example IRIs does not mean the resource is an HTML file.
Any absolute IRI is valid in RDF. This form is used here only for illustration;
in practice, a bare namespace such as /dataset# is more conventional.

# We write to a temporary file our Ntriples created earlier
temp_file <- tempfile(fileext = ".nt")
writeLines(small_country_df_nt,
  con = temp_file
)

rdf_graph <- rdf()
rdf_parse(rdf_graph, doc = temp_file, format = "ntriples")
#> Total of 50 triples, stored in hashes
#> -------------------------------
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs8> <http://example.com/prop/unit> "CP_MEUR"^^<xsd:string> .
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs8> <http://dd.eionet.europa.eu/vocabulary/eurostat/geo/> <https://www.geonames.org/countries/SM/> .
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs6> <http://dd.eionet.europa.eu/vocabulary/eurostat/geo/> <https://www.geonames.org/countries/LI/> .
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs7> <http://dd.eionet.europa.eu/vocabulary/eurostat/geo/> <https://www.geonames.org/countries/LI/> .
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs6> <http://purl.org/linked-data/sdmx/2009/code> "A"^^<xsd:string> .
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs10> <http://purl.org/linked-data/sdmx/2009/code> "A"^^<xsd:string> .
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs7> <http://purl.org/linked-data/sdmx/2009/code> "A"^^<xsd:string> .
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs4> <http://data.europa.eu/83i/aa/GDP> "3119.5"^^<xsd:decimal> .
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs3> <http://example.com/prop/unit> "CP_MEUR"^^<xsd:string> .
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs2> <http://example.com/prop/unit> "CP_MEUR"^^<xsd:string> .
#> 
#> ... with 40 more triples
rdf_graph
rdf_graph
#> Total of 50 triples, stored in hashes
#> -------------------------------
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs8> <http://example.com/prop/unit> "CP_MEUR"^^<xsd:string> .
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs8> <http://dd.eionet.europa.eu/vocabulary/eurostat/geo/> <https://www.geonames.org/countries/SM/> .
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs6> <http://dd.eionet.europa.eu/vocabulary/eurostat/geo/> <https://www.geonames.org/countries/LI/> .
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs7> <http://dd.eionet.europa.eu/vocabulary/eurostat/geo/> <https://www.geonames.org/countries/LI/> .
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs6> <http://purl.org/linked-data/sdmx/2009/code> "A"^^<xsd:string> .
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs10> <http://purl.org/linked-data/sdmx/2009/code> "A"^^<xsd:string> .
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs7> <http://purl.org/linked-data/sdmx/2009/code> "A"^^<xsd:string> .
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs4> <http://data.europa.eu/83i/aa/GDP> "3119.5"^^<xsd:decimal> .
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs3> <http://example.com/prop/unit> "CP_MEUR"^^<xsd:string> .
#> <https://dataset.dataobservatory.eu/examples/dataset.html#obs2> <http://example.com/prop/unit> "CP_MEUR"^^<xsd:string> .
#> 
#> ... with 40 more triples

Your dataset is now ready to be exported to meet the true FAIR standards, because they are:

# Create temporary JSON-LD output file
jsonld_file <- tempfile(fileext = ".jsonld")

# Serialize (export) the entire graph to JSON-LD format
rdf_serialize(rdf_graph, doc = jsonld_file, format = "jsonld")

Read it back to R for display (only first 30 lines are shown):

cat(readLines(jsonld_file)[1:30], sep = "\n")
#> {
#>   "@graph": [
#>     {
#>       "@id": "https://dataset.dataobservatory.eu/examples/dataset.html#obs1",
#>       "http://data.europa.eu/83i/aa/GDP": {
#>         "@type": "xsd:decimal",
#>         "@value": "2354.8"
#>       },
#>       "http://dd.eionet.europa.eu/vocabulary/eurostat/geo/": {
#>         "@id": "https://www.geonames.org/countries/AD/"
#>       },
#>       "http://example.com/prop/unit": {
#>         "@type": "xsd:string",
#>         "@value": "CP_MEUR"
#>       },
#>       "http://purl.org/linked-data/sdmx/2009/code": {
#>         "@type": "xsd:string",
#>         "@value": "A"
#>       },
#>       "http://purl.org/linked-data/sdmx/2009/dimension#refPeriod": {
#>         "@type": "xsd:integer",
#>         "@value": "2020"
#>       }
#>     },
#>     {
#>       "@id": "https://dataset.dataobservatory.eu/examples/dataset.html#obs10",
#>       "http://data.europa.eu/83i/aa/GDP": {
#>         "@type": "xsd:decimal",
#>         "@value": "1612.3"
#>       },
#>           used (Mb) gc trigger  (Mb) max used  (Mb)
#> Ncells 1048686 56.1    1913322 102.2  1913322 102.2
#> Vcells 1872251 14.3    8388608  64.0  3137445  24.0

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.