The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Why Semantics Matter for R Data Frames

library(dataset)

R users love data.frames and tibbles for tidy, rectangular data. But tidy data isn’t always meaningful data. What does a column labelled gdp actually represent? Euros? Millions? Per capita? Current prices? Constant 2010 prices? These questions matter—especially in statistics, open data publishing, and knowledge graph integration.

The dataset_df class extends the familiar data.frame structure with lightweight, semantically meaningful metadata. It’s built for:

dataset_df helps you preserve the meaning of variables, units, identifiers, and dataset-level context.

From Tidy to Meaningful: An Example

Let’s start with a basic data frame and upgrade it to a dataset_df with semantically enriched columns using defined():

small_country_dataset <- dataset_df(
  country_name = defined(c("AD", "LI"),
    label = "Country name",
    concept = "http://data.europa.eu/bna/c_6c2bb82d",
    namespace = "https://www.geonames.org/countries/$1/"
  ),
  gdp = defined(c(3897, 7365),
    label = "Gross Domestic Product",
    unit = "million dollars",
    concept = "http://data.europa.eu/83i/aa/GDP"
  ),
  dataset_bibentry = dublincore(
    title = "Small Country Dataset",
    creator = person("Jane", "Doe"),
    publisher = "Example Inc."
  )
)

The defined() vectors attach metadata to each column:

The dataset_df() call also allows bibliographic metadata:

Why Units Matter

Many statistical errors begin with a silent assumption about units. In Eurostat data, it’s common to see:

By making units explicit at the column level, you:

This is especially important in multi-currency and multi-country datasets such as those published by Eurostat, where harmonization is crucial.

A Final Structure, Ready for Export

The enriched dataset_df object can be serialized to RDF using:

triples <- dataset_to_triples(small_country_dataset)

n_triples(mapply(n_triple, triples$s, triples$p, triples$o))
#> [1] "\"1\"^^<http://www.w3.org/2001/XMLSchema#string> \"rowid\"^^<http://www.w3.org/2001/XMLSchema#string> \"eg:1\"^^<http://www.w3.org/2001/XMLSchema#string> ."     
#> [2] "\"2\"^^<http://www.w3.org/2001/XMLSchema#string> \"rowid\"^^<http://www.w3.org/2001/XMLSchema#string> \"eg:2\"^^<http://www.w3.org/2001/XMLSchema#string> ."     
#> [3] "\"1\"^^<http://www.w3.org/2001/XMLSchema#string> \"country_name\"^^<http://www.w3.org/2001/XMLSchema#string> \"AD\"^^<http://www.w3.org/2001/XMLSchema#string> ."
#> [4] "\"2\"^^<http://www.w3.org/2001/XMLSchema#string> \"country_name\"^^<http://www.w3.org/2001/XMLSchema#string> \"LI\"^^<http://www.w3.org/2001/XMLSchema#string> ."
#> [5] "\"1\"^^<http://www.w3.org/2001/XMLSchema#string> \"gdp\"^^<http://www.w3.org/2001/XMLSchema#string> \"3897\"^^<http://www.w3.org/2001/XMLSchema#string> ."       
#> [6] "\"2\"^^<http://www.w3.org/2001/XMLSchema#string> \"gdp\"^^<http://www.w3.org/2001/XMLSchema#string> \"7365\"^^<http://www.w3.org/2001/XMLSchema#string> ."

This supports export to:

This vignette represents the final conceptual structure for dataset_df before its rOpenSci submission. Future work will build on this foundation without breaking it.

Summary: Why Use dataset_df

Feature What It Adds
label Human-readable variable name
unit Explicit unit (e.g., MIO_EUR)
concept URI identifying what is measured
subject Dataset-level topical classification
namespace Base URI for RDF subject identifiers
dataset_bibentry Bibliographic metadata via Dublin Core

The dataset_df class is designed to remain fully compatible with the tidyverse data workflow, while offering a metadata structure suitable for:

Start tidy. Stay meaningful. Embrace dataset_df.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.