The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
The dataset
package enriches R’s native data structures
with machine-readable metadata. It allows variables and datasets to
carry semantic definitions — such as URIs, labels, units, and provenance
— which makes them suitable for long-term reuse, FAIR-compliant
publishing, and integration into semantic web systems.
Unlike most metadata packages that attach metadata after the fact,
dataset
follows a semantic early-binding
approach: metadata is embedded as soon as the data is created.
This vignette provides a high-level introduction. For details on key components, see the following:
vignette("defined", package = "dataset")
: Semantic
vectors with defined()
vignette("dataset_df", package = "dataset")
:
Structuring and metadata with dataset_df()
vignette("rdf", package = "dataset")
: Exporting to RDF
and Linked Datavignette("bibrecord", package = "dataset")
: Creating
rich citation metadata using bibrecord()
Hadley Wickham (2014) defines tidy data with three principles:
This structure is ideal for analysis, but lacks semantic
clarity, particularly when an analyst is working in a
realistic, but not ideal scenario with several datasets received from
various internet services. For example, two datasets might both contain
a column named gdp
, but one might be in euros and the other
in dollars. Without metadata, tools cannot detect this mismatch.
The dataset
package addresses this by allowing you to
define variables explicitly, and to store dataset-level metadata within
a tidy tibble.
Semantically rich vectors are vectors in a data.frame that contain richer semantics than a simple column name; a long-form human-readable title; a machine- and human-readable variable definition; and if needed, an external resource that contains the codebook.
library(dataset)
gdp <- defined(
c(2355, 2592, 2884),
label = "Gross Domestic Product",
unit = "CP_MEUR",
concept = "http://data.europa.eu/83i/aa/GDP"
)
geo <- defined(
rep("AD", 3),
label = "Geopolitical Entity",
concept = "http://purl.org/linked-data/sdmx/2009/dimension#refArea",
namespace = "https://www.geonames.org/countries/$1/"
)
gdp
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR
#> [1] 2355 2592 2884
geo
#> x: Geopolitical Entity
#> Defined as http://purl.org/linked-data/sdmx/2009/dimension#refArea
#> [1] "AD" "AD" "AD"
In this case, we define geo
as the geopolitical entity
http://purl.org/linked-data/sdmx/2009/dimension#refArea,
and we know that the AD
value can resolve to Andorra: https://www.geonames.org/countries/AD/. These vectors
now carry metadata you can inspect directly — including their label,
unit, and concept URI — which will be preserved even after
transformation or storage.
small_dataset <- dataset_df(
geo = geo,
gdp = gdp,
identifier = c(gdp = "http://example.com/dataset#gdp"),
dataset_bibentry = dublincore(
title = "Small GDP Dataset",
creator = person("Jane", "Doe", role = "aut"),
publisher = "Small Repository",
subject = "Gross Domestic Product"
)
)
small_dataset
#> Doe (2025): Small GDP Dataset [dataset]
#> rowid geo gdp
#> <defined> <defined> <defined>
#> 1 gdp1 AD 2355
#> 2 gdp2 AD 2592
#> 3 gdp3 AD 2884
This dataset not only stores the variables and values, but also includes embedded metadata that supports precise interpretation and repository-level publication.
As Carl Boettinger has shown in the vignettes accompanying the R-binding to the popular Python library rdflib, (see: A tidyverse lover’s intro to RDF), tidy datasets can be retrofitted with rich metadata if they are pivoted to a strictly three-column long format.
Our packages tries to lower the burden of such retrofitting with early binding and sensible defaults to serialise the dataset’s contents and the dataset’s bibliographic data to this format for those who are not familiar with RDF.
You can convert any dataset_df
object into a tidy
3-column representation (subject–predicate–object) using
dataset_to_triples()
:
triples <- dataset_to_triples(small_dataset,
format = "nt"
)
triples
#> [1] "<http://example.com/dataset#gdpgdp1> <http://purl.org/linked-data/sdmx/2009/dimension#refArea> <https://www.geonames.org/countries/AD/> ."
#> [2] "<http://example.com/dataset#gdpgdp2> <http://purl.org/linked-data/sdmx/2009/dimension#refArea> <https://www.geonames.org/countries/AD/> ."
#> [3] "<http://example.com/dataset#gdpgdp3> <http://purl.org/linked-data/sdmx/2009/dimension#refArea> <https://www.geonames.org/countries/AD/> ."
#> [4] "<http://example.com/dataset#gdpgdp1> <http://data.europa.eu/83i/aa/GDP> \"2355\"^^<xsd:decimal> ."
#> [5] "<http://example.com/dataset#gdpgdp2> <http://data.europa.eu/83i/aa/GDP> \"2592\"^^<xsd:decimal> ."
#> [6] "<http://example.com/dataset#gdpgdp3> <http://data.europa.eu/83i/aa/GDP> \"2884\"^^<xsd:decimal> ."
This 3-column format (subject–predicate–object) is compatible with
semantic web tools such as SPARQL, rdflib
, and triple
stores.
mycon <- tempfile("my_dataset", fileext = "nt")
my_description <- describe(x = small_dataset, con = mycon)
# Only three statements are shown:
readLines(mycon)[c(4, 8, 12)]
#> [1] "_:doejane <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Agent> ."
#> [2] "<http://example.com/dataset_tba/> <http://purl.org/dc/terms/title> \"Small GDP Dataset\"^^<http://www.w3.org/2001/XMLSchema#string> ."
#> [3] "<http://example.com/dataset_tba/> <http://purl.org/dc/terms/type> <http://purl.org/dc/dcmitype/Dataset> ."
## Show two lines of provenance:
provenance(small_dataset)[c(6, 7)]
#> [1] "<http://example.com/creation> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Activity> ."
#> [2] "<http://example.com/creation> <http://www.w3.org/ns/prov#generatedAtTime> \"2025-08-25T22:11:34Z\"^^<xsd:dateTime> ."
The dataset package enriches tidy data by attaching metadata from the start of the workflow. It helps avoid semantic mismatches, supports RDF publication, and meets interoperability standards like SDMX, DataCite, and Dublin Core. Use it when you need:
Meaningful variable descriptions and URIs
Dataset-level metadata embedded directly in .rds or .rda files
Easy export to RDF and semantic web formats
For deeper examples, see:
vignette("defined", package = "dataset")
: Working
with semantic vectors
vignette("dataset_df", package = "dataset")
:
Dataset-level metadata and structure
vignette("rdf", package = "dataset")
: Linked Data
and export
vignette("bibrecord", package = "dataset")
: Creating
rich citation metadata using bibrecord()
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.