The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
R users love data.frame
s and tibble
s for
tidy, rectangular data. But tidy data isn’t always meaningful
data. What does a column labelled gdp
actually
represent? Euros? Millions? Per capita? Current prices? Constant 2010
prices? These questions matter—especially in statistics, open data
publishing, and knowledge graph integration.
The dataset_df
class extends the familiar
data.frame
structure with lightweight, semantically
meaningful metadata. It’s built for:
Tidyverse lovers who want better documentation and safer analysis
Open science workflows that need interoperable metadata
Semantic web users who want to export structured RDF data from R
dataset_df
helps you preserve the meaning of variables,
units, identifiers, and dataset-level context.
Let’s start with a basic data frame and upgrade it to a
dataset_df
with semantically enriched columns using
defined()
:
small_country_dataset <- dataset_df(
country_name = defined(c("AD", "LI"),
label = "Country name",
concept = "http://data.europa.eu/bna/c_6c2bb82d",
namespace = "https://www.geonames.org/countries/$1/"
),
gdp = defined(c(3897, 7365),
label = "Gross Domestic Product",
unit = "million dollars",
concept = "http://data.europa.eu/83i/aa/GDP"
),
dataset_bibentry = dublincore(
title = "Small Country Dataset",
creator = person("Jane", "Doe"),
publisher = "Example Inc."
)
)
The defined()
vectors attach metadata to each
column:
label
: a human-readable name
unit
: an explicit measurement unit
concept
: a URI identifying the concept
measured
namespace
: for generating full subject URIs when
exporting to RDF
The dataset_df()
call also allows bibliographic
metadata:
dataset_bibentry
: Dublin Core metadata for citation,
reuse, and provenanceMany statistical errors begin with a silent assumption about units. In Eurostat data, it’s common to see:
EUR
: Euros
MIO_EUR
: Millions of euros
PPS
: Purchasing Power Standards
By making units explicit at the column level, you:
Prevent decimal-scale mistakes (e.g., thousands vs millions)
Avoid joining or averaging incompatible series
Gain confidence in your data exports (CSV, RDF, JSON-LD, etc.)
This is especially important in multi-currency and multi-country datasets such as those published by Eurostat, where harmonization is crucial.
The enriched dataset_df
object can be serialized to RDF
using:
triples <- dataset_to_triples(small_country_dataset)
n_triples(mapply(n_triple, triples$s, triples$p, triples$o))
#> [1] "\"1\"^^<http://www.w3.org/2001/XMLSchema#string> \"rowid\"^^<http://www.w3.org/2001/XMLSchema#string> \"eg:1\"^^<http://www.w3.org/2001/XMLSchema#string> ."
#> [2] "\"2\"^^<http://www.w3.org/2001/XMLSchema#string> \"rowid\"^^<http://www.w3.org/2001/XMLSchema#string> \"eg:2\"^^<http://www.w3.org/2001/XMLSchema#string> ."
#> [3] "\"1\"^^<http://www.w3.org/2001/XMLSchema#string> \"country_name\"^^<http://www.w3.org/2001/XMLSchema#string> \"AD\"^^<http://www.w3.org/2001/XMLSchema#string> ."
#> [4] "\"2\"^^<http://www.w3.org/2001/XMLSchema#string> \"country_name\"^^<http://www.w3.org/2001/XMLSchema#string> \"LI\"^^<http://www.w3.org/2001/XMLSchema#string> ."
#> [5] "\"1\"^^<http://www.w3.org/2001/XMLSchema#string> \"gdp\"^^<http://www.w3.org/2001/XMLSchema#string> \"3897\"^^<http://www.w3.org/2001/XMLSchema#string> ."
#> [6] "\"2\"^^<http://www.w3.org/2001/XMLSchema#string> \"gdp\"^^<http://www.w3.org/2001/XMLSchema#string> \"7365\"^^<http://www.w3.org/2001/XMLSchema#string> ."
This supports export to:
Wikibase via wbdataset
RDF Data Cube via datacube
DataCite or DCAT metadata formats
This vignette represents the final conceptual structure for
dataset_df
before its rOpenSci submission. Future work will
build on this foundation without breaking it.
dataset_df
Feature | What It Adds |
---|---|
label |
Human-readable variable name |
unit |
Explicit unit (e.g., MIO_EUR ) |
concept |
URI identifying what is measured |
subject |
Dataset-level topical classification |
namespace |
Base URI for RDF subject identifiers |
dataset_bibentry |
Bibliographic metadata via Dublin Core |
The dataset_df
class is designed to remain fully
compatible with the tidyverse data workflow, while
offering a metadata structure suitable for:
Receiving SDMX-style statistical data into R
Exporting semantically meaningful datasets to DCAT, RDF, or Wikibase
Complying with open science repository requirements (e.g., DataCite, Zenodo)
Start tidy. Stay meaningful. Embrace dataset_df
.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.