The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

defined: Semantically Enriched Vectors

The dataset package extends R’s native data structures with machine-readable metadata. It follows a semantic early-binding approach, which means metadata is embedded as soon as the data is created, making datasets suitable for long-term reuse, FAIR-compliant publishing, and integration into semantic web systems.

defined works naturally with data structured according to tidy data principles (Wickham, 2014), where each variable is a column, each observation is a row, and each type of observational unit forms a table. It adds an additional semantic layer to individual vectors so their meaning is explicit, consistent, and machine-readable.

This vignette focuses specifically on the defined function, which you can use to create a semantically enriched vector. For details on semantically enriched data frames, see vignette("dataset_df", package = "dataset").

Purpose

The defined() function helps you create semantically rich labelled vectors that are easier to:

By attaching metadata at creation time, defined prevents the loss of context and meaning that often occurs when data is exchanged or archived. This approach supports the FAIR data principles (Findable, Accessible, Interoperable, Reusable) and facilitates integration into semantic web systems.

Getting started

library(dataset)
data("gdp")

We’ll start by wrapping a numeric GDP vector using defined().

gdp_1 <- defined(
  gdp$gdp,
  label = "Gross Domestic Product",
  unit = "CP_MEUR",
  concept = "http://data.europa.eu/83i/aa/GDP"
)

The defined() class builds on labelled vectors by adding rich metadata:

This is particularly useful for reproducible research, standard-compliant data, or long-term interoperability. The class is implemented with R’s attributes() function, which guarantees wide compatibility. A defined vector can be used even in base R.

attributes(gdp_1)
#> $label
#> [1] "Gross Domestic Product"
#> 
#> $class
#> [1] "haven_labelled_defined" "haven_labelled"         "vctrs_vctr"            
#> [4] "double"                
#> 
#> $unit
#> [1] "CP_MEUR"
#> 
#> $concept
#> [1] "http://data.europa.eu/83i/aa/GDP"

From this output it is clear that the actual S3 class is called haven_labelled_defined, which clearly indicates the inheritance from haven_labelled (See: labelled::labelled). In the dataset summary headers the <defined> abbreviation is used.

Use the var_label(), var_unit() and var_concept() helper functions to set or retrieve metadata individually.

cat("Get the label only: ", var_label(gdp_1), "\n")
#> Get the label only:  Gross Domestic Product
cat("Get the unit only: ", var_unit(gdp_1), "\n")
#> Get the unit only:  CP_MEUR
cat("Get the concept definition only: ", var_concept(gdp_1), "\n")
#> Get the concept definition only:  http://data.europa.eu/83i/aa/GDP
cat("All attributes:\n")
#> All attributes:

Printing and summary

The most frequently used vector methods, such as print or summary are implemented as expected:

print(gdp_1)
#> gdp_1: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR 
#>  [1] 2354.8 2593.9 2883.7 3119.5 5430.5 6423.7 6758.6 1265.1 1461.4 1612.3
summary(gdp_1)
#> Gross Domestic Product (CP_MEUR)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1265    1798    2739    3390    4853    6759

Handling ambiguity

If you try to concatenate a semantically under-specified new vector to an existing defined vector, you will get an intended error indicating that some attributes are not compatible. This prevents combining values that differ in meaning, such as GDP figures expressed in different currencies.

gdp_2 <- defined(
  c(2523.6, 2725.8, 3013.2),
  label = "Gross Domestic Product"
)

In the following example, gdp_1 and gdp_2 are not defined with the same level of precision.

c(gdp_1, gdp_2)
Error in vec_c():
! Can't combine ..1 <haven_labelled_defined> and ..2 <haven_labelled_defined>.
✖ Some attributes are incompatible.

To resolve this, you can add the missing attributes so that the vectors are semantically compatible.

Let’s define better the GDP of the Faroe Islands:

var_unit(gdp_2) <- "CP_MEUR"
var_concept(gdp_2) <- "http://data.europa.eu/83i/aa/GDP"

Once the metadata matches, you can combine them.

new_gdp <- c(gdp_1, gdp_2)
summary(new_gdp)
#> Gross Domestic Product (CP_MEUR)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1265    2355    2726    3244    3120    6759

Using namespaces for coded values

You can also define variables that store codes (like country codes) with a namespace that points to a human- and machine-readable definition of those codes. In statistical datasets, such attribute columns describe characteristics of the observations or the measured variables.

country <- defined(
  c("AD", "LI", "SM"),
  label = "Country name",
  concept = "http://purl.org/linked-data/sdmx/2009/dimension#refArea",
  namespace = "https://www.geonames.org/countries/$1/"
)

For example, the namespace definition above points to:

You can get or set the namespace of a defined vector with var_namespace().

var_namespace(country)
#> [1] "https://www.geonames.org/countries/$1/"

A URI such as http://publications.europa.eu/resource/authority/bna/c_6c2bb82d resolves to a machine-readable definition of geographical names.

The use of several defined vectors in a dataset_df object is explained in a separate vignette.

Basic Usage

You can create defined vectors from character values as well as numeric values. Methods like as_character() and as_numeric() let you coerce back to base R types while controlling what happens to the metadata.

countries <- defined(
  c("AD", "LI"),
  label = "Country code",
  namespace = "https://www.geonames.org/countries/$1/"
)

countries
#> x: Country code
#> Defined vector 
#> [1] "AD" "LI"
as_character(countries)
#> [1] "AD" "LI"

Subsetting and coercion

Subsetting a defined vector works like subsetting any other vector.

gdp_1[1:2]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR 
#> [1] 2354.8 2593.9
gdp_1[gdp_1 > 5000]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR 
#> [1] 5430.5 6423.7 6758.6
as.vector(gdp_1)
#>  [1] 2354.8 2593.9 2883.7 3119.5 5430.5 6423.7 6758.6 1265.1 1461.4 1612.3
as.list(gdp_1)
#> [[1]]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR 
#> [1] 2354.8
#> 
#> [[2]]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR 
#> [1] 2593.9
#> 
#> [[3]]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR 
#> [1] 2883.7
#> 
#> [[4]]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR 
#> [1] 3119.5
#> 
#> [[5]]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR 
#> [1] 5430.5
#> 
#> [[6]]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR 
#> [1] 6423.7
#> 
#> [[7]]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR 
#> [1] 6758.6
#> 
#> [[8]]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR 
#> [1] 1265.1
#> 
#> [[9]]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR 
#> [1] 1461.4
#> 
#> [[10]]
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR 
#> [1] 1612.3

Coerce to base R types

Use as_character() to convert to a character vector.

as_character(country)
#> [1] "AD" "LI" "SM"
as_character(c(gdp_1, gdp_2))
#>  [1] "2354.8" "2593.9" "2883.7" "3119.5" "5430.5" "6423.7" "6758.6" "1265.1"
#>  [9] "1461.4" "1612.3" "2523.6" "2725.8" "3013.2"

Use as_factor() to convert a categorical variable to a factor:

as_factor(country)
#> [1] AD LI SM
#> Levels: AD LI SM

Use as_numeric() to convert to a numeric vector.

as_numeric(c(gdp_1, gdp_2))
#>  [1] 2354.8 2593.9 2883.7 3119.5 5430.5 6423.7 6758.6 1265.1 1461.4 1612.3
#> [11] 2523.6 2725.8 3013.2

Conclusion

The defined() function provides a lightweight yet powerful way to make vectors self-descriptive by attaching semantic metadata directly to them. By combining a variable label, unit of measurement, concept definition, and optional namespace, defined ensures that each vector’s meaning is explicit, consistent, and machine-readable.

Because the metadata is embedded at creation time, it travels with the vector throughout your workflow — whether you are analysing, transforming, or exporting data.
This prevents context loss, supports the FAIR data principles (Findable, Accessible, Interoperable, Reusable), and facilitates integration with semantic web technologies.

defined vectors work seamlessly with the dataset_df class to create semantically enriched data frames where both datasets and their constituent variables carry rich, standardised metadata.
For more on creating semantically enriched datasets, see the dataset_df vignette.

For guidance on recording bibliographic metadata and citations, see the bibrecord vignette.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.