Why Semantics Matter for R Data Frames

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

R users love data.frames and tibbles for tidy, rectangular data. But tidy data isn’t always meaningful data. What does a column labelled gdp actually represent? Euros? Millions? Per capita? Current prices? Constant 2010 prices? These questions matter—especially in statistics, open data publishing, and knowledge graph integration.

The dataset_df class extends the familiar data.frame structure with lightweight, semantically meaningful metadata. It’s built for:

dataset_df helps you preserve the meaning of variables, units, identifiers, and dataset-level context.

From Tidy to Meaningful: An Example

Let’s start with a basic data frame and upgrade it to a dataset_df with semantically enriched columns using defined():

small_country_dataset <- dataset_df(
  country_name = defined(c("AD", "LI"),
    label = "Country name",
    concept = "http://data.europa.eu/bna/c_6c2bb82d",
    namespace = "https://www.geonames.org/countries/$1/"
  ),
  gdp = defined(c(3897, 7365),
    label = "Gross Domestic Product",
    unit = "million dollars",
    concept = "http://data.europa.eu/83i/aa/GDP"
  ),
  dataset_bibentry = dublincore(
    title = "Small Country Dataset",
    creator = person("Jane", "Doe"),
    publisher = "Example Inc."
  )
)

The defined() vectors attach metadata to each column:

label: a human-readable name
unit: an explicit measurement unit
concept: a URI identifying the concept measured
namespace: for generating full subject URIs when exporting to RDF

The dataset_df() call also allows bibliographic metadata:

dataset_bibentry: Dublin Core metadata for citation, reuse, and provenance

Why Units Matter

Many statistical errors begin with a silent assumption about units. In Eurostat data, it’s common to see:

EUR: Euros
MIO_EUR: Millions of euros
PPS: Purchasing Power Standards

By making units explicit at the column level, you:

Prevent decimal-scale mistakes (e.g., thousands vs millions)
Avoid joining or averaging incompatible series
Gain confidence in your data exports (CSV, RDF, JSON-LD, etc.)

This is especially important in multi-currency and multi-country datasets such as those published by Eurostat, where harmonization is crucial.

A Final Structure, Ready for Export

The enriched dataset_df object can be serialized to RDF using:

triples <- dataset_to_triples(small_country_dataset)

n_triples(mapply(n_triple, triples$s, triples$p, triples$o))
#> [1] "\"1\"^^<http://www.w3.org/2001/XMLSchema#string> \"rowid\"^^<http://www.w3.org/2001/XMLSchema#string> \"eg:1\"^^<http://www.w3.org/2001/XMLSchema#string> ."     
#> [2] "\"2\"^^<http://www.w3.org/2001/XMLSchema#string> \"rowid\"^^<http://www.w3.org/2001/XMLSchema#string> \"eg:2\"^^<http://www.w3.org/2001/XMLSchema#string> ."     
#> [3] "\"1\"^^<http://www.w3.org/2001/XMLSchema#string> \"country_name\"^^<http://www.w3.org/2001/XMLSchema#string> \"AD\"^^<http://www.w3.org/2001/XMLSchema#string> ."
#> [4] "\"2\"^^<http://www.w3.org/2001/XMLSchema#string> \"country_name\"^^<http://www.w3.org/2001/XMLSchema#string> \"LI\"^^<http://www.w3.org/2001/XMLSchema#string> ."
#> [5] "\"1\"^^<http://www.w3.org/2001/XMLSchema#string> \"gdp\"^^<http://www.w3.org/2001/XMLSchema#string> \"3897\"^^<http://www.w3.org/2001/XMLSchema#string> ."       
#> [6] "\"2\"^^<http://www.w3.org/2001/XMLSchema#string> \"gdp\"^^<http://www.w3.org/2001/XMLSchema#string> \"7365\"^^<http://www.w3.org/2001/XMLSchema#string> ."

This supports export to:

Wikibase via wbdataset
RDF Data Cube via datacube
DataCite or DCAT metadata formats

This vignette represents the final conceptual structure for dataset_df before its rOpenSci submission. Future work will build on this foundation without breaking it.

Summary: Why Use `dataset_df`

Feature	What It Adds
`label`	Human-readable variable name
`unit`	Explicit unit (e.g., `MIO_EUR`)
`concept`	URI identifying what is measured
`subject`	Dataset-level topical classification
`namespace`	Base URI for RDF subject identifiers
`dataset_bibentry`	Bibliographic metadata via Dublin Core

The dataset_df class is designed to remain fully compatible with the tidyverse data workflow, while offering a metadata structure suitable for:

Receiving SDMX-style statistical data into R
Exporting semantically meaningful datasets to DCAT, RDF, or Wikibase
Complying with open science repository requirements (e.g., DataCite, Zenodo)

Start tidy. Stay meaningful. Embrace dataset_df.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.

Why Semantics Matter for R Data Frames

From Tidy to Meaningful: An Example

Why Units Matter

A Final Structure, Ready for Export

Summary: Why Use dataset_df

Summary: Why Use `dataset_df`