Repository Mirror for your Cloud Server and Webhosting

Type:

Package

Title:

Parse StatCan PUMF Files

Version:

0.5.2

Description:

Facilitate working with Statistics Canada (StatCan) Public Use Microdata Files (PUMF). Enables downloading of available PUMF data, parsing of metadata from command files or other sources to infer the layout structure, variable labels and value labels as well as missing data values, and returns a connection to a 'DuckDB' database with the labelled data. Data and documentation come from Statistics Canada's Public Use Microdata Files https://www.statcan.gc.ca/en/microdata/pumf, distributed under the Statistics Canada Open Licence https://www.statcan.gc.ca/en/terms-conditions/open-licence.

License:

MIT + file LICENSE

Encoding:

UTF-8

Depends:

R (≥ 4.2)

Imports:

dplyr (≥ 1.1.0), readr, stringr, rlang, utils, tibble, rvest, httr, purrr, DBI, duckdb (≥ 1.5.2), duckplyr (≥ 1.2.1), dbplyr, haven (≥ 2.5.0), zip

Suggests:

rmarkdown, knitr, scales, ggplot2, testthat (≥ 3.0.0), withr, microbenchmark, DiagrammeR, DiagrammeRsvg, rsvg, pdftools, tidyr

URL:

https://github.com/mountainMath/canpumf, https://mountainmath.github.io/canpumf/

BugReports:

https://github.com/mountainMath/canpumf/issues

VignetteBuilder:

knitr

Language:

en-CA

Config/roxygen2/version:

8.0.0

NeedsCompilation:

Packaged:

2026-06-27 16:14:21 UTC; jens

Author:

Jens von Bergmann [aut, cre]

Maintainer:

Jens von Bergmann <jens@mountainmath.ca>

Repository:

CRAN

Date/Publication:

2026-07-03 12:10:09 UTC

Find the registered sibling whose config best fits an unregistered year

Description

Considers only plain four-digit-year keys ('series/2023') so multi-part versions (Census '1971/individuals_prov') never inherit across types. Prefers the newest sibling not later than 'version'; if the requested year predates every entry, falls back to the oldest registered sibling.

Usage

.pumf_registry_newest_sibling(series, version)

Value

the chosen sibling version string, or 'NULL' if no year-keyed sibling

Generate bootstrap weights for a PUMF dataset

Description

For a **DuckDB-backed lazy table** (the typical case), bootstrap replicate weights are written directly into the DuckDB file as a separate table and exposed through a persistent VIEW that joins the main survey table with the BSW columns. The returned 'tbl' references this view, so all downstream dplyr operations have access to every replicate.

Usage

add_bootstrap_weights(
  tbl,
  weight_col,
  id_col = NULL,
  strata_cols = NULL,
  n_replicates = 500L,
  prefix = "CPBSW",
  bsw_table = NULL,
  seed = NULL,
  overwrite = FALSE
)

Arguments

tbl

A lazy 'dplyr::tbl()' returned by [get_pumf()], **or** an in-memory 'data.frame' / 'tibble'.

weight_col

Name of the column holding the survey weights (string, e.g. '"PWEIGHT"').

id_col

Optional name of a column that uniquely identifies each row (DuckDB path only). If 'NULL' (default), the registry 'bsw_join_key' is used when available; otherwise 'pumf_row_id' is added to the main table.

strata_cols

Optional character vector of column names to stratify on. Resampling is performed independently within each unique combination of stratum values, preserving stratum sample sizes across replicates. For LFS, defaults to 'c("SURVYEAR", "SURVMNTH")' so each month is resampled separately. For other surveys, use the registry 'bsw_strata' field or pass explicitly (e.g. province, age group). Pass 'character(0)' to suppress the LFS default and generate unstratified weights.

n_replicates

Number of bootstrap replicates to generate (default '500L').

prefix

Column-name prefix for replicate columns (default '"CPBSW"'). Columns are named 'prefix1', 'prefix2', ...

bsw_table

Name of the DuckDB table that stores the replicate weights (DuckDB path only). Defaults to 'NULL', which auto-names it 'paste0("pumf_bsw_", tolower(weight_col))' so separate calls with different weight columns do not overwrite each other.

seed

Optional integer seed for reproducibility.

overwrite

If the 'bsw_table' already exists in the DuckDB file, regenerate and overwrite it when 'TRUE'. When 'FALSE' (default) the existing table is reused silently – no computation is performed.

Details

For an **in-memory 'data.frame' or 'tibble'**, bootstrap weights are generated entirely in memory and the augmented data frame is returned.

Bootstrap weights are generated by the rescaled bootstrap: for each replicate a sample of n rows is drawn with replacement; the bootstrap weight for row i in replicate b is 'original_weight[i] * count[i,b]', where 'count[i,b]' is the number of times row i appeared in draw b.

**Incremental re-runs (DuckDB path):** when a BSW table already exists the call only does the work needed to satisfy the request: * **More replicates** than stored (and no new rows): the additional replicate columns are appended; existing columns are kept. * **New rows** in the main table (some rows have no weights yet): because a bootstrap replicate resamples the full population, added rows invalidate the existing weights of their resampling universe, so those weights are deleted and regenerated. Unstratified, this regenerates every row; when 'strata_cols' are in effect, only the strata that gained rows are regenerated and complete strata keep their existing weights. * **Neither:** the stored weights are reused without recomputation. Pass 'overwrite = TRUE' to force a full fresh regeneration regardless.

**Multiple weight columns (hierarchical data):** by default 'bsw_table' is named after 'weight_col' (e.g. '"pumf_bsw_wstpwgt"'), so calling the function twice with different weight columns (e.g. household weight and person weight) produces two independent BSW tables and two separate views without any conflict.

**Connection note (DuckDB path):** calling this function fully shuts down the DuckDB in-process instance held by 'tbl' (because a write connection requires exclusive access). The input 'tbl' and any other lazy tables backed by the same DuckDB file become invalid after the call. Use the returned tbl instead.

**Filtered input tbls (DuckDB path):** bootstrap weights always cover the complete physical survey table. If 'tbl' has dplyr 'filter()' operations applied, they are captured and automatically re-applied to the returned VIEW tbl so the visible rows match the original subset. Other operations ('select()', 'mutate()', etc.) are not replayed – they would interfere with the BSW columns – so apply them manually to the returned tbl if needed.

**ID column (DuckDB path):** a stable row identifier is needed to link the main table to the BSW table. If 'id_col' is 'NULL' (the default): * The survey registry 'bsw_join_key' is used when available (e.g. '"PEFAMID"' for SFS 2016-2023) – no table modification needed. * Otherwise a 'pumf_row_id' column (DuckDB 'rowid') is added to the main survey table. The 'ALTER TABLE ADD COLUMN' is O(1); the 'UPDATE' that fills the values is O(n).

Value

* **DuckDB path:** a lazy 'dplyr::tbl()' backed by a persistent DuckDB VIEW that contains all original survey columns plus the 'n_replicates' bootstrap weight columns, with any input 'filter()' operations re-applied. * **In-memory path:** the input 'data.frame' / 'tibble' with bootstrap weight columns appended so that 'n_replicates' replicates are present. If the input already carries replicate columns for 'prefix', only the additional ones are generated (existing columns are preserved); when it already has at least 'n_replicates', the data frame is returned unchanged.

Examples


sfs <- get_pumf("SFS", "2019")
if (!is.null(sfs)) {
  sfs_bsw <- add_bootstrap_weights(sfs, weight_col = "PWEIGHT",
                                   n_replicates = 200L, seed = 42L)
  bsw_info(sfs_bsw)
  close_pumf(sfs_bsw)
}

Add a harmonised gender/sex column to an LFS table

Description

LFS introduced 'GENDER' (with values '"Men+"' / '"Women+"' / '"Non-binary persons"') to replace the binary 'SEX' variable ('"Male"' / '"Female"') starting in 2020. In any given row exactly one of the two columns is non-'NA'. 'add_lfs_GENDER_SEX()' coalesces them into a single harmonised column, recoding 'SEX' values to the 'GENDER' scale so the result is consistent across all LFS vintages.

Usage

add_lfs_GENDER_SEX(tbl)

Arguments

tbl

A lazy 'dplyr::tbl()' returned by [get_pumf()] for an LFS survey, optionally passed through [label_pumf_columns()].

Details

Works on both unlabelled tables (columns 'SEX' / 'GENDER', output column named 'GENDER_SEX') and labelled tables produced by [label_pumf_columns()] (columns '"Sex of respondent"' / '"Gender of respondent"', output column named '"Gender/sex of respondent"').

The mapping applied to 'SEX' / '"Sex of respondent"' when the gender column is 'NA':

'"Male"' \rightarrow '"Men+"'
'"Female"' \rightarrow '"Women+"'

The output column is inserted after 'GENDER' / '"Gender of respondent"' when present, or after 'SEX' / '"Sex of respondent"' otherwise.

Value

The same lazy table with a new harmonised gender/sex column.

Examples


lfs <- get_pumf("LFS")   # NULL if StatCan is unreachable
if (!is.null(lfs)) {
  # Unlabelled
  lfs |> add_lfs_GENDER_SEX() |>
    dplyr::count(SEX, GENDER, GENDER_SEX) |> dplyr::collect()

  # Labelled
  lfs |> label_pumf_columns() |> add_lfs_GENDER_SEX() |>
    dplyr::count(`Sex of respondent`, `Gender of respondent`,
                 `Gender/sex of respondent`) |> dplyr::collect()

  close_pumf(lfs)
}

Add a date column to an LFS table

Description

Creates a date column set to the first day of the survey month and inserts it immediately after the survey month column. Works on both unlabelled tables (columns 'SURVYEAR' / 'SURVMNTH', date column named 'SURVDATE') and labelled tables produced by [label_pumf_columns()] (columns '"Survey year"' / '"Survey month"', date column named '"Survey date"').

Usage

add_lfs_SURVDATE(tbl)

Arguments

tbl

A lazy 'dplyr::tbl()' returned by [get_pumf()] for an LFS survey, optionally passed through [label_pumf_columns()].

Value

The same lazy table with a new date column positioned after the survey month column.

Examples


lfs <- get_pumf("LFS", "2023")   # NULL if StatCan is unreachable
if (!is.null(lfs)) {
  # Unlabelled
  lfs |> add_lfs_SURVDATE() |> dplyr::select(SURVYEAR, SURVMNTH, SURVDATE) |>
    dplyr::distinct() |> dplyr::collect()

  # Labelled
  lfs |> label_pumf_columns() |> add_lfs_SURVDATE() |>
    dplyr::select(`Survey year`, `Survey month`, `Survey date`) |>
    dplyr::distinct() |> dplyr::collect()

  close_pumf(lfs)
}

Summarise bootstrap weight tables present in a PUMF DuckDB database

Description

Queries the DuckDB file backing a PUMF lazy table for bootstrap weight tables created by [add_bootstrap_weights()] and returns a one-row-per-table summary tibble. Returns an empty tibble (invisibly) when no BSW tables are found.

Usage

bsw_info(tbl)

Arguments

tbl

A lazy 'dplyr::tbl()' returned by [get_pumf()] or by [add_bootstrap_weights()].

Value

A tibble with columns:

'weight_col': The weight column the BSW table was built from (matched back to the case used in the main survey table).
'bsw_table': Name of the DuckDB table storing the weights.
'view_name': Name of the DuckDB VIEW joining survey + BSW.
'view_exists': Whether the companion VIEW is present.
'n_replicates': Number of bootstrap replicate columns.
'size_mb': Estimated table size in megabytes (from DuckDB metadata; 'NA' when unavailable).

Examples


sfs <- get_pumf("SFS", "2019")
if (!is.null(sfs)) {
  sfs_bsw <- add_bootstrap_weights(sfs, weight_col = "PWEIGHT", seed = 1L)
  bsw_info(sfs_bsw)
  close_pumf(sfs_bsw)
}

Warn when French label coverage is partial

Description

Emits a warning when French files were clearly present (some label_fr are non-NA) but more than threshold of variable labels are missing. Surveys with no French files at all (all label_fr are NA) do not trigger the warning.

Usage

check_bilingual_coverage(metadata, threshold = 0.2)

Arguments

metadata

List from read_metadata().

threshold

Proportion of missing label_fr above which a warning is emitted (default 0.2).

Value

NULL invisibly.

Close the DuckDB connection backing a PUMF lazy table

Description

Disconnects the DuckDB connection associated with 'x'. 'x' may be either a lazy 'dplyr::tbl()' returned by [get_pumf()] (the connection embedded in the tbl is closed) or a DuckDB connection object returned by [get_pumf_connection()] (closed directly). After calling this function the table or connection can no longer be queried.

Usage

close_pumf(x)

Arguments

x

A lazy 'dplyr::tbl()' returned by [get_pumf()], or a DuckDB connection returned by [get_pumf_connection()]. 'NULL' is accepted and is a no-op, so 'close_pumf()' can be called unconditionally on a [get_pumf()] result that may be 'NULL' (e.g. when Statistics Canada was unreachable).

Details

All lazy tables and sibling modules opened from one [get_pumf()] call share a single connection, so a single 'close_pumf()' on any of them releases it.

Closing is only necessary when you need to release the file lock – for example, before calling 'get_pumf(..., refresh = TRUE)' on the same survey, or before writing to the DuckDB from another process. Read-only connections (the default) do not block other readers.

Value

Invisibly 'NULL'.

Examples


sfs <- get_pumf("SFS", "2019")
if (!is.null(sfs)) {
  # ... analysis ...
  close_pumf(sfs)
}

# Also accepts a raw connection from get_pumf_connection()
con <- get_pumf_connection("SHS", "2017")
if (!is.null(con)) {
  DBI::dbListTables(con)
  close_pumf(con)
}

Detect parseable metadata formats in a PUMF directory tree

Description

Walks the directory tree under pumf_dir and returns every recognised command-file format. Multiple formats may be present in one release; all are returned so that the caller can run all applicable parsers and merge results.

Usage

detect_formats(pumf_dir, sps_mask = NULL)

Arguments

pumf_dir

Top-level version directory (i.e. the directory that contains the extracted PUMF zip contents).

Value

A named list (possibly empty) whose elements are a subset of "lfs_csv", "cpss_csv", "sas_cards", "spss_split", and "spss_mono". Each element is either a single path string or (for spss_mono) a list(eng = ..., fra = ...) where fra is NULL when no French file is found.

Get a Statistics Canada PUMF dataset as a lazy DuckDB table

Description

Main entry point for the canpumf package. Downloads (if needed), parses metadata, applies bilingual labels, and returns a lazy 'dplyr::tbl()' backed by a DuckDB file in the cache directory. Subsequent calls reuse the cached DuckDB without re-downloading.

Usage

get_pumf(
  series = NULL,
  version = NULL,
  lang = "eng",
  cache_path = getOption("canpumf.cache_path", tempdir()),
  refresh = FALSE,
  redownload = FALSE,
  read_only = TRUE,
  registry = NULL,
  module = NULL,
  register_connection = getOption("canpumf.register_connection", TRUE),
  ...
)

Arguments

series

Survey series acronym, e.g. '"SFS"', '"CHS"', '"LFS"', '"Census"', '"CPSS"'. See [list_canpumf_collection()] for all supported series and versions.

version

Version string (e.g. '"2019"', '"2021 (individuals)"', '"2023-06"'). For series with a single version omit or pass 'NULL'.

lang

'"eng"' (default) or '"fra"'. Selects which set of labels to apply. Each language creates a separate DuckDB table (created lazily on first request).

cache_path

Root cache directory. Defaults to 'getOption("canpumf.cache_path", tempdir())'. Set persistently in '.Rprofile' with 'options(canpumf.cache_path = "<path>")'.

refresh

'FALSE' (default) reuses cached data. 'TRUE' clears the DuckDB table and metadata and rebuilds from the already-extracted raw files (does not re-download). '"auto"' is accepted for LFS only and downloads all available versions not yet in the database.

redownload

If 'TRUE', delete the cached zip and extracted files and re-download from StatCan before rebuilding. Implies 'refresh = TRUE'. Not valid with 'refresh = "auto"'.

read_only

Open the DuckDB connection in read-only mode (default 'TRUE'). Pass 'FALSE' to allow write access, e.g. to persist custom views or derived tables in the DuckDB file. Use [close_pumf()] to release the connection when done.

registry

Optional custom configuration created by [pumf_registry_entry()] (or [pumf_registry()]), used to parse and build a survey that is not in the built-in registry, or to override fields of one that is. Applied only when a build actually happens – on an already-imported survey it has no effect unless 'refresh = TRUE' is also passed (a message is emitted in that case). Not supported for LFS. For a survey not in [list_canpumf_collection()], deposit the raw files under '<cache_path>/<series>/<version>/' first (there is no download URL).

module

For multi-module surveys (several linked files in one DuckDB, e.g. GSS cycle 16 / "Aging and Social Support" 2002, whose 'MAIN', 'CG4', 'CG6' and 'CR' files join on 'RECID'), selects which module table to return. ‘NULL' (default) returns the survey’s primary module; for a multi-module survey a one-time message then lists the sibling modules and shows how to open one. Use [pumf_module()] to open a sibling module on the *same* connection so the two tbls are joinable. Not supported for LFS.

register_connection

If 'TRUE' (default), the DuckDB connection backing the returned tbl may appear in the RStudio Connections pane (subject to RStudio/duckdb settings). Pass 'FALSE' to suppress that registration – useful when opening and closing many connections programmatically (e.g. iterating over surveys in a notebook), where the pane would otherwise be spammed. Defaults to 'getOption("canpumf.register_connection", TRUE)', so you can disable it globally with 'options(canpumf.register_connection = FALSE)'.

...

Accepts deprecated parameter names ('pumf_series', 'pumf_version', 'pumf_cache_path', 'layout_mask', 'file_mask', 'guess_numeric', 'timeout', 'refresh_layout') with a warning.

Details

The LFS is treated specially: all versions share a single 'LFS.duckdb' database. Pass 'version = "YYYY"' (annual) or '"YYYY-MM"' (monthly). 'refresh = "auto"' downloads every available LFS version that is not yet in the database; this is only valid for LFS.

Value

A lazy 'dplyr::tbl()' backed by a DuckDB connection. Data values are pre-labeled as factors. Call 'dplyr::collect()' to materialise a local tibble, [label_pumf_columns()] to rename columns to their human-readable labels, or [close_pumf()] to release the connection. Returns 'invisible(NULL)' with an informative message if the data must be downloaded but Statistics Canada is unreachable.

Examples


# Download and open the SFS 2019 as a lazy DuckDB table.
# get_pumf() returns NULL if Statistics Canada is unreachable.
sfs <- get_pumf("SFS", "2019")
if (!is.null(sfs)) {
  dplyr::glimpse(sfs)

  # Collect a local tibble (here, the first 100 records)
  sfs_local <- sfs |>
    head(100) |>
    dplyr::collect()

  # Release the connection when done
  close_pumf(sfs)
}

# French labels (opened and released on its own connection)
sfs_fr <- get_pumf("SFS", "2019", lang = "fra")
if (!is.null(sfs_fr)) close_pumf(sfs_fr)

Get a read-write DuckDB connection to a PUMF database

Description

Runs the full pipeline and returns a raw read-write [DBI::DBIConnection-class]. Use this when you need direct SQL access — to persist custom views, join derived tables, or inspect DuckDB internals. For everyday analysis use [get_pumf()], which returns a safer read-only lazy 'dplyr::tbl()'.

Usage

get_pumf_connection(
  series = NULL,
  version = NULL,
  lang = "eng",
  cache_path = getOption("canpumf.cache_path", tempdir()),
  refresh = FALSE,
  redownload = FALSE,
  ...
)

Arguments

series

Survey series acronym, e.g. '"SFS"', '"Census"'.

version

Version string, e.g. '"2019"'. 'NULL' for single-version series.

lang

'"eng"' (default) or '"fra"'.

cache_path

Root cache directory. Defaults to 'getOption("canpumf.cache_path", tempdir())'.

refresh

If 'TRUE', rebuild from already-extracted files (no re-download).

redownload

If 'TRUE', re-download and rebuild from scratch.

...

Accepts deprecated parameter names ('pumf_series', 'pumf_version', 'pumf_cache_path') with a warning.

Value

A [DBI::DBIConnection-class] in read-write mode. Disconnect with 'DBI::dbDisconnect(con, shutdown = TRUE)' when done. For a safer read-only lazy table use [get_pumf()] instead. Returns 'invisible(NULL)' with an informative message if the data must be downloaded but Statistics Canada is unreachable.

Examples


con <- get_pumf_connection("SFS", "2019")  # NULL if StatCan is unreachable
if (!is.null(con)) {
  tables <- DBI::dbListTables(con)
  DBI::dbGetQuery(con, sprintf('SELECT COUNT(*) AS n FROM "%s"', tables[1]))
  DBI::dbDisconnect(con, shutdown = TRUE)
}

Rename PUMF table columns to human-readable variable labels

Description

Takes a lazy 'dplyr::tbl()' returned by [get_pumf()] and returns the same lazy table with column names replaced by the variable labels from the survey metadata (e.g. 'PHHSIZE' becomes '"Household size"'). Duplicate labels are disambiguated by appending ' (VAR_NAME)'.

Usage

label_pumf_columns(tbl)

Arguments

tbl

A lazy 'dplyr::tbl()' returned by [get_pumf()].

Details

The 'tbl' must have been produced by [get_pumf()]; the function reads survey provenance (series, version, cache path, language) from the underlying DuckDB connection. Use [pumf_var_labels()] to inspect the name-to-label mapping without renaming.

Value

A lazy 'dplyr::tbl()' with column names replaced by human-readable variable labels. Columns with no metadata label are left unchanged.

Examples


sfs <- get_pumf("SFS", "2019")
if (!is.null(sfs)) {
  sfs_labeled <- label_pumf_columns(sfs)
  colnames(sfs_labeled)
  close_pumf(sfs_labeled)
}

Get Labour Force Survey PUMF data from a shared longitudinal DuckDB

Description

Manages a single 'LFS.duckdb' file that accumulates all downloaded LFS versions. Each call either retrieves already-loaded data or downloads, parses, labels, and appends a new version.

Usage

lfs_get_pumf(
  version = NULL,
  lang = "eng",
  cache_path = getOption("canpumf.cache_path", tempdir()),
  refresh = FALSE,
  redownload = FALSE,
  read_only = TRUE
)

Arguments

version

LFS version string ('"YYYY"' or '"YYYY-MM"'), or 'NULL' to report database state and return the full table.

lang

'"eng"' (default) or '"fra"'.

cache_path

Root cache directory.

refresh

'FALSE' (default), 'TRUE' (re-parse and re-label the specified version from the cached raw files), or '"auto"' (download all versions not yet in the database). 'refresh = TRUE' requires a non-NULL 'version'.

redownload

If 'TRUE', delete the cached zip and extracted content for the specified version and re-download from StatCan before rebuilding. Implies 'refresh = TRUE'. Requires a non-NULL 'version'.

read_only

Open the DuckDB connection in read-only mode (default 'TRUE'). Pass 'FALSE' to allow write access to the LFS DuckDB.

Details

**Version types**: - '"YYYY"' (e.g. '"2023"') — annual file released by StatCan after year-end. - '"YYYY-MM"' (e.g. '"2024-06"') — monthly file for the current year.

When an annual file for year Y is loaded and monthly files for that year are already in the database, the monthly rows are replaced (supersession). Conversely, if an annual for year Y is already loaded, requesting a monthly for that year returns the annual data filtered to that month without re-downloading.

**Connection note**: the returned 'tbl' holds an open DuckDB connection. Loading a second version (i.e. calling 'lfs_get_pumf' again while holding the first result) requires the first tbl's connection to be closed first. Use [close_pumf()] or 'dplyr::collect()' the result before the next call.

Value

A lazy 'dplyr::tbl()', or 'invisible(NULL)' when 'version = NULL' and no data has been loaded.

List available LFS PUMF versions

Description

Scrapes the Statistics Canada LFS PUMF publication page and returns a tibble of all available annual and monthly versions with their download URLs. Requires an internet connection. For the broader collection of all supported surveys see [list_canpumf_collection()].

Usage

list_available_lfs_pumf_versions()

Value

A tibble with columns 'Date' (human-readable label from the StatCan page), 'version' (a string of the form '"YYYY"' for annual versions or '"YYYY-MM"' for monthly versions), and 'url' (direct download link). If the StatCan website is unreachable the function returns an empty tibble (with those columns) and a warning rather than erroring.

Examples


lfs_versions <- list_available_lfs_pumf_versions()
tail(lfs_versions)

List Statistics Canada PUMF datasets supported by canpumf

Description

Returns a tibble of all survey series and versions for which canpumf has download wrappers. Scrapes the StatCan website to discover Census versions; other series are hard-coded. Requires an internet connection.

Usage

list_canpumf_collection()

Value

A tibble with columns 'Title', 'Acronym', 'Version', 'Survey Number', and 'url'. The 'url' column contains the download URL or '"(EFT)"' for versions distributed via the Research Data Centre (EFT only). Pass 'Acronym' and 'Version' to [get_pumf()] to download a dataset.

Examples


collection <- list_canpumf_collection()
# Show all SFS versions
collection[collection$Acronym == "SFS", c("Acronym", "Version")]

List the contents of the local canpumf cache

Description

Scans the cache directory and returns a tibble describing every downloaded PUMF version — which raw files, parsed metadata, and DuckDB tables are present — along with their disk sizes.

Usage

list_pumf_cache(cache_path = getOption("canpumf.cache_path", tempdir()))

Arguments

cache_path

Root cache directory. Defaults to 'getOption("canpumf.cache_path", tempdir())'.

Details

For LFS surveys the DuckDB is a single shared file ('LFS.duckdb') that accumulates all versions; its total size is reported in 'duckdb_mb' for every LFS row. Use [remove_pumf_cache()] to free disk space.

Value

A tibble with columns:

'series': Survey series acronym.
'version': Version string.
'has_raw': 'TRUE' if a zip or extracted data files are present.
'has_metadata': 'TRUE' if a parsed 'metadata/' directory exists.
'has_duckdb': 'TRUE' if a DuckDB table is built for this version.
'raw_mb': Disk size of raw files in MB (excluding metadata and DuckDB).
'duckdb_mb': Disk size of the DuckDB file in MB. For LFS this is the total shared 'LFS.duckdb' size, repeated for each version row.

Returns a zero-row tibble with the same column structure if the cache directory does not exist or is empty.

Examples


list_pumf_cache()
# With an explicit cache path:
list_pumf_cache(cache_path = file.path(tempdir(), "pumf_cache"))

Overview of all built-in registry entries

Description

Overview of all built-in registry entries

Usage

list_pumf_registry()

Value

A tibble with one row per registered '(series, version)' and columns summarising the key configuration: 'file_mask', 'layout_mask', 'bsw_join_key', and 'data_fixups' (comma-separated fixup types present).

Examples

list_pumf_registry()

Crawl the full Statistics Canada PUMF catalogue (experimental)

Description

Scrapes the live StatCan "Public use microdata" listing and follows each survey to its product page to discover every PUMF series, its editions, and direct-download URLs. This is an exploratory counterpart to [list_canpumf_collection()], which returns only the curated set of surveys canpumf has tested download wrappers for.

Usage

list_statcan_pumf_catalogue(
  prefer = names(.statcan_format_tokens),
  max_surveys = NULL,
  surveys = NULL,
  verbose = TRUE,
  refresh = FALSE,
  cache_path = getOption("canpumf.cache_path")
)

Arguments

prefer

Character vector of format tokens in order of preference; the default puts CSV / flat text ahead of statistical-package formats.

max_surveys

Optional integer: only crawl the first N surveys (useful for a quick look — a full crawl issues a few hundred requests).

surveys

Optional character vector of catalogue ids to restrict to.

verbose

If 'TRUE', print progress as each survey is crawled.

refresh

If 'FALSE' (the default), the crawl result is cached and reused — a full crawl is expensive (hundreds of requests). Within a session it is held in memory; a *full* crawl (no 'max_surveys'/'surveys') is also persisted to disk under 'cache_path' so it survives across sessions. Set 'TRUE' to re-scrape the live catalogue and replace both caches, e.g. to pick up a newly released survey.

cache_path

Directory for the cross-session catalogue cache ('pumf_catalogue.rds'). Defaults to 'getOption("canpumf.cache_path")'; when unset there is no durable cache and only the in-session cache is used. A persisted catalogue older than 'getOption("canpumf.catalogue_max_age_days", 30)' triggers a staleness warning suggesting 'refresh = TRUE'. If a live crawl fails (StatCan unreachable) the last persisted copy is returned with a warning.

Details

The StatCan markup is irregular and this crawler is best-effort: surveys distributed only by Electronic File Transfer (EFT) report 'url = "(EFT)"', and some products may not be parsed. When an edition is offered in several formats the one highest in 'prefer' is kept (CSV/flat-text first).

Value

A tibble with one row per discovered edition: 'catalogue_id', 'Acronym', 'SeriesTitle', 'Title', 'survey_url', 'edition', 'format', 'url', and 'product_url'. 'SeriesTitle' is the plain-language series name matching the acronym (the catalogue title with the edition-specific tail and "Public Use Microdata File" boilerplate stripped). 'Title' is edition-specific: StatCan's own per-edition catalogue title where it carries one, otherwise — for *umbrella* products whose catalogue title is only the series name (e.g. the consolidated General Social Survey, or a census year's individuals/hierarchical pair) — a synthesised '"<series> — <edition>"', where the structural edition descriptor disambiguates colliding years (GSS '"Cycle 16 (2002)"', census '"2021 (individuals)"'). 'edition' remains the reference period/variant. ‘survey_url' is the survey’s catalogue overview page (the L2 page the crawler followed); 'url'/'product_url' point at the individual edition's download and product page. 'Acronym' and 'SeriesTitle' are derived from the title since StatCan exposes no such field; they match the curated [list_canpumf_collection()] values for most surveys but are best-effort (Census 'Acronym' is hard-coded to '"Census"'). Surveys with no downloadable file get a single row with 'url = "(EFT)"'.

Examples


# Quick look at the first 5 surveys
head(list_statcan_pumf_catalogue(max_surveys = 5))

Merge metadata from multiple parser outputs

Description

Sources are applied in priority order: spss_mono > spss_split > sas_cards > lfs_csv > cpss_csv. For each variable / code, the highest-priority source provides the English label; French labels fill in from lower-priority sources when missing. Conflicting missing ranges or English code labels emit warnings.

Usage

merge_metadata(parsed_list)

Arguments

parsed_list

Named list of parser outputs (each a list with elements variables, codes, and layout).

Value

Single merged canonical metadata list.

Check whether canonical metadata exists for a version directory

Description

Check whether canonical metadata exists for a version directory

Usage

metadata_exists(version_dir, bare = FALSE)

Arguments

version_dir

Path to the version directory.

bare

When 'TRUE', treat 'version_dir' as the metadata directory itself (used for per-module subdirectories), checking it directly for 'variables.csv' instead of a 'metadata/' child.

Value

Logical.

Open PUMF documentation in the browser

Description

Scans the cached version directory for PDF documentation files and opens them interactively. If no PDFs are found, falls back to small text files (filtering out large FWF data files by size). When multiple candidate files exist, an interactive menu lets you choose which to open, with "Open all" as the last option. In non-interactive mode the first preferred-language file is opened automatically.

Usage

open_pumf_documentation(
  series = NULL,
  version = NULL,
  lang = NULL,
  cache_path = getOption("canpumf.cache_path", tempdir()),
  pumf_series = NULL,
  pumf_version = NULL,
  pumf_cache_path = NULL
)

Arguments

series

Survey series acronym (e.g. '"SFS"', '"Census"'), **or** a lazy 'dplyr::tbl()' / DuckDB connection returned by [get_pumf()]. When a tbl or connection is supplied, 'version', 'cache_path', and 'lang' are read from the connection provenance; explicit arguments take precedence.

version

Version string (e.g. '"2019"', '"2021 (individuals)"'). For LFS, omit to open documentation for the most recently downloaded version. Ignored when 'series' is a tbl or connection.

lang

'"eng"' (default) or '"fra"'. Documentation files whose names match the requested language are sorted first. When 'series' is a connection and ‘lang' is not supplied, the connection’s language is used.

cache_path

Root cache directory. Defaults to 'getOption("canpumf.cache_path", tempdir())'.

pumf_series

Deprecated; use 'series'.

pumf_version

Deprecated; use 'version'.

pumf_cache_path

Deprecated; use 'cache_path'.

Details

After opening documentation, emits a message listing any manual registry overrides (sentinel values, forced-numeric columns, column swaps, etc.) that were applied at import so values can be interpreted correctly.

Value

Invisibly, the file path(s) of the opened documentation, or 'invisible(NULL)' when no documentation is found or data has not been downloaded yet.

Examples

if (interactive()) {
# Open by series and version
open_pumf_documentation("SFS", "2019")

# Open from an existing tbl (reads provenance automatically)
sfs <- get_pumf("SFS", "2019")
open_pumf_documentation(sfs)
close_pumf(sfs)

# French documentation
open_pumf_documentation("SFS", "2019", lang = "fra")
}

Parse a CPSS variables.csv into canonical metadata

Description

The CPSS variables.csv is a single bilingual CSV. Variable rows have the Variable column filled; code rows have Variable empty and the Code column filled.

Usage

parse_cpss_csv(variables_path, encoding = "Latin1")

Arguments

variables_path

Path to the CPSS variables.csv file.

encoding

File encoding (default "Latin1").

Value

Named list with elements variables, codes, and layout (always NULL for CPSS CSV data).

Parse an LFS codebook CSV into canonical metadata

Description

The LFS codebook is a single bilingual CSV with rows alternating between variable definitions (Field_Champ filled) and code values (Field_Champ NA).

Usage

parse_lfs_codebook(codebook_path, encoding = "CP1252")

Arguments

codebook_path

Path to the LFS codebook.csv file.

encoding

File encoding (default "CP1252").

Value

Named list with elements variables, codes, and layout (always NULL for LFS CSV data).

Parse a StatCan bilingual PDF frequency codebook

Description

Extracts variable names/labels and code-value labels from the bilingual PDF codebook shipped with surveys such as CPSS 1, which (unlike CPSS 2-6) ships no machine-readable variables.csv. The codebook lists each variable as a “Variable Name:/Concept:” block followed by an “Answer Categories” frequency table whose Code column supplies the value labels. Positions in the codebook are not used; this parser produces only variables and codes (no layout).

Usage

parse_pdf_codebook(eng_pdf, fra_pdf = NULL)

Arguments

eng_pdf

Path to the English codebook PDF.

fra_pdf

Optional path to the French codebook PDF.

Value

Named list with elements variables, codes, layout (always NULL), or NULL if no variable blocks were found.

Parse a StatCan PDF Data Dictionary for variable and code labels

Description

Extracts variable long-names and code-value labels from the bilingual PDF data dictionaries shipped with some older StatCan PUMF releases (e.g. SFS 1999). Positions in the PDF do not match the PUMF flat file; this parser produces only variables and codes (no layout).

Usage

parse_pdf_dictionary(eng_pdf, fra_pdf = NULL)

Arguments

eng_pdf

Path to the English Dictionary.pdf.

fra_pdf

Optional path to the French Dictionnaire.pdf.

Value

Named list with elements variables, codes, layout (always NULL).

Parse a set of SAS/SPSS reading-card files for PUMF metadata

Description

Handles surveys where metadata is stored in reading-card files: .lay (column positions), .lbe (English variable labels), .cde (English value labels), .mvs (missing values), with optional .lbf / .cdf for French.

Usage

parse_sas_cards(cards_dir, layout_mask = NULL, encoding = "Latin1")

Arguments

cards_dir

Path to the directory containing the reading-card files.

layout_mask

Optional string; used to filter when multiple files share the same extension (e.g. SHS 2017 has separate interview and diary files).

encoding

Character encoding (e.g. "Latin1", "CP1252").

Details

The .lay file supports two sub-formats detected automatically:

Reading-card: Lines like NAME start - end (A)
SAS input: Lines like @pos NAME \$CHAR6.

Value

A list with elements variables (name, label_en, label_fr, type, missing_low, missing_high), codes (name, val, label_en, label_fr), and layout (name, start, end; NULL when no .lay file is found).

Parse a monolithic SPSS command file for PUMF metadata

Description

Handles Census-style monolithic SPSS files where a single .sps file contains DATA LIST, FORMATS, VARIABLE LABELS, VALUE LABELS, and optionally MISSING VALUES sections. Supports both the 2021 style (single-quoted labels, + continuation, /VAR headers in VALUE LABELS) and the 2016 style (double-quoted labels, / on its own line separating variable groups).

Usage

parse_spss_mono(eng_sps_path, fra_sps_path = NULL, encoding = "Latin1")

Arguments

eng_sps_path

Path to the English .sps file.

fra_sps_path

Optional path to the French .sps file. When provided, French labels are joined onto the canonical metadata. When NULL (default), all label_fr values are NA.

encoding

Character encoding of the file(s), e.g. "Latin1", "CP1252", or "UTF-8".

Value

Parse an SPSS `.sav` data file for embedded metadata

Description

Uses haven::read_sav() to extract variable labels, value labels, and SPSS format codes (which give type and decimal precision) without loading any data rows. This is the primary metadata source for surveys such as CIS 2016/2017 that ship only a .sav file and no separate SPSS command files.

Usage

parse_spss_sav(sav_path)

Arguments

sav_path

Path to the .sav file.

Value

Named list with elements variables, codes, and layout (always NULL – .sav files embed metadata but use a binary record format that does not translate to start/end positions).

Parse a set of SPSS split command files for PUMF metadata

Description

Handles surveys where metadata is spread across separate files: *_i.sps (layout), *vare.sps (English variable labels), *vale.sps (English value labels), *miss.sps (missing values), with optional *varf.sps / *valf.sps for French.

Usage

parse_spss_split(layout_dir, layout_mask = NULL, encoding = "Latin1")

Arguments

layout_dir

Path to the directory containing the SPSS split files.

layout_mask

Optional string or regex; passed to find_unique_layout_file() to disambiguate when multiple sets of split files exist (e.g. SFS has both EFAM_PUMF_* and bsweights_pumf_* files in the same directory).

encoding

Character encoding, e.g. "Latin1" or "CP1252".

Value

Build a labeled DuckDB table for a PUMF version

Description

Reads the canonical metadata from 'metadata/', reads the raw data file, optionally joins bootstrap weights, applies code labels as factors, converts numeric columns, and writes to a '.duckdb' file.

Usage

pumf_build_duckdb(
  version_dir,
  series,
  version,
  lang = "eng",
  layout_mask = NULL,
  file_mask = NULL,
  refresh = FALSE,
  db_path = NULL,
  meta_subdir = NULL,
  data_fixups = NULL,
  bsw_override = NULL
)

Arguments

version_dir

Path returned by [pumf_locate_or_download()].

series

Survey series acronym, e.g. '"SFS"'.

version

Version string, e.g. '"2019"'.

lang

'"eng"' (default) or '"fra"'.

layout_mask

Optional layout mask; used in the DuckDB table name.

file_mask

Optional regex to select the data file. Overrides registry.

refresh

If 'TRUE', drop and rewrite the DuckDB table.

db_path

Optional explicit path to the DuckDB file. Defaults to '<version_dir>/<series>_<version>.duckdb'. Multi-module surveys pass one shared path so every module table lands in the same file.

meta_subdir

Optional metadata subdirectory under 'metadata/' to read for this build. 'NULL' (default) uses 'metadata/' (the primary module); secondary modules pass their module id so 'metadata/<id>/' is read.

data_fixups

Optional 'data_fixups' list overriding the registry entry's for this build. Used by secondary modules, which supply their own complete fixup set (e.g. ‘force_numeric'), replacing the entry’s primary fixups.

bsw_override

Optional list of bootstrap-weight config ('bsw_mask', 'bsw_file_mask', 'bsw_join_key', 'bsw_drop_cols', ‘bsw_strata') overriding the registry entry’s BSW config for this build. Each module of a multi-module survey joins its own bootstrap weights (or none), so the caller passes that module's BSW config; an override whose fields are all 'NULL' means "this module has no bootstrap weights".

Details

Skips re-building if the named table already exists in the DuckDB file and 'refresh = FALSE'. Passing 'refresh = TRUE' drops and rewrites the table without re-downloading or re-extracting raw data.

Returns the db path and table name invisibly. Call [pumf_open_duckdb()] to open a read-only connection and get a lazy 'dplyr::tbl()'. Keeping Stage 3 and connection-opening separate prevents DuckDB file-lock conflicts when building multiple language tables for the same survey in one session.

Value

Invisibly, a named list with 'db_path' and 'table_name'.

Locate or download a PUMF version directory

Description

Ensures the version directory exists and its zip has been extracted. With 'refresh = TRUE', clears the DuckDB file(s) and 'metadata/' subdirectory so that Stages 2 and 3 re-run, but does **not** re-download or re-extract raw data.

Usage

pumf_locate_or_download(
  series,
  version,
  cache_path = getOption("canpumf.cache_path", tempdir()),
  refresh = FALSE,
  redownload = FALSE
)

Arguments

series

Survey series acronym, e.g. '"SFS"'.

version

Version string, e.g. '"2019"' or '"2021 (individuals)"'.

cache_path

Root cache directory. Defaults to 'getOption("canpumf.cache_path", tempdir())'.

refresh

If 'TRUE', delete the '.duckdb' file and 'metadata/' dir so the downstream stages re-run. Raw zip and extracted files are untouched.

Details

For EFT-only surveys (older Census years), stops with an informative error asking the user to deposit the zip manually.

Value

The version directory path, invisibly.

Download and parse PUMF metadata without building a DuckDB table

Description

Runs Stage 1 (locate or download) and Stage 2 (parse metadata) and returns the full bilingual canonical metadata. Both 'label_en' and 'label_fr' columns are always returned regardless of language. This is useful for inspecting variable definitions and code labels before loading data with [get_pumf()].

Usage

pumf_metadata(
  series,
  version,
  cache_path = getOption("canpumf.cache_path", tempdir()),
  refresh = FALSE,
  redownload = FALSE,
  registry = NULL
)

Arguments

series

Survey series acronym, e.g. '"SFS"', '"LFS"', '"Census"'.

version

Version string, e.g. '"2019"', '"2021 (individuals)"'.

cache_path

Root cache directory. Defaults to 'getOption("canpumf.cache_path", tempdir())'.

refresh

If 'TRUE', re-parse metadata from the already-extracted raw command files (does not re-download).

redownload

If 'TRUE', delete the cached zip and extracted files and re-download from StatCan before re-parsing. Implies 'refresh = TRUE'.

registry

Optional custom configuration created by [pumf_registry_entry()] (or [pumf_registry()]) to drive metadata parsing for a survey not in the built-in registry, or to override fields of one that is. Not supported for LFS.

Value

A named list with three elements:

'variables': Tibble with columns 'name', 'label_en', 'label_fr', 'type', 'decimals', 'missing_low', 'missing_high'.
'codes': Tibble with columns 'name', 'val', 'label_en', 'label_fr', mapping numeric codes to their labels.
'layout': Tibble with columns 'name', 'start', 'end' for fixed-width data files; 'NULL' for CSV-format surveys.

Returns 'invisible(NULL)' with an informative message if the data must be downloaded but Statistics Canada is unreachable.

Examples


meta <- pumf_metadata("SFS", "2019")
if (!is.null(meta)) {
  meta$variables
  meta$codes[meta$codes$name == "PEFAMID", ]
}

Open a sibling module of a multi-module survey

Description

Some surveys ship several linked fixed-width files that share a respondent key (e.g. GSS cycle 16, "Aging and Social Support", 2002, whose MAIN, CG4, CG6 and CR files all join on 'RECID', with the person weight 'WGHT_PER' living only in MAIN). ‘get_pumf()' returns the survey’s primary module; 'pumf_module()' returns one of its sibling modules **on the same DuckDB connection**, so the two tbls are joinable on the shared key without opening a second connection.

Usage

pumf_module(tbl, module)

Arguments

tbl

A lazy tbl returned by [get_pumf()] for a multi-module survey.

module

Name of the module to open (e.g. ‘"CG4"'). See the survey’s registry entry for available module ids.

Value

A lazy 'dplyr::tbl()' for the requested module, backed by the same connection as 'tbl'.

Examples


main <- get_pumf("GSS", "Cycle 16 (2002)") # primary module (MAIN), has WGHT_PER
if (!is.null(main)) {
  cg4 <- pumf_module(main, "CG4")          # caregiving module, same connection
  dplyr::left_join(main, cg4, by = "RECID")
  close_pumf(main)
}

Open a DuckDB table as a lazy dplyr tbl

Description

Opens a connection to the '.duckdb' file produced by [pumf_build_duckdb()] and returns a lazy 'dplyr::tbl()'. Use [close_pumf()] to release the connection when done.

Usage

pumf_open_duckdb(db_path, table_name, read_only = TRUE)

Arguments

db_path

Path to the '.duckdb' file.

table_name

Name of the table to open.

read_only

Open in read-only mode (default 'TRUE'). Pass 'FALSE' to allow write operations on the DuckDB file (e.g. to add custom views).

Value

A lazy 'dplyr::tbl()'.

Parse all metadata from a PUMF version directory

Description

Detects every parseable command-file format in version_dir, runs all applicable parsers, merges the results into the canonical schema, and writes metadata/variables.csv, metadata/codes.csv (and optionally metadata/layout.csv) under version_dir.

Usage

pumf_parse_metadata(
  version_dir,
  layout_mask = NULL,
  metadata_encoding = NULL,
  refresh = FALSE,
  meta_subdir = NULL
)

Arguments

version_dir

Path to the extracted version directory.

layout_mask

Optional string to disambiguate when multiple command-file sets coexist in one directory (e.g. "CDN" for Census); passed through to parse_spss_split and parse_sas_cards.

refresh

If TRUE, re-parse even if cached metadata exists.

Details

Idempotent: skips parsing if metadata/variables.csv already exists and refresh = FALSE.

Value

metadata_dir path invisibly.

Inspect a survey's registry configuration

Description

Returns the resolved configuration entry for a '(series, version)' pair: the built-in registry entry when one exists, otherwise an all-default entry. Useful for understanding the parsing strategy and overrides applied to a survey, and as a template for [pumf_registry_entry()].

Usage

pumf_registry(series, version)

Arguments

series

Survey series acronym, e.g. '"SFS"'.

version

Version string, e.g. '"2019"'.

Value

A classed '"pumf_registry_entry"' list of all configuration fields.

Examples

pumf_registry("SFS", "2019")

Construct a custom PUMF registry entry

Description

Builds a survey-configuration patch that can be passed to [get_pumf()] (or [pumf_metadata()]) via the 'registry' argument to drive parsing and building for a survey that is not in the built-in registry, or to override specific fields of one that is.

Usage

pumf_registry_entry(
  layout_mask = NULL,
  bsw_mask = NULL,
  bsw_file_mask = NULL,
  bsw_join_key = NULL,
  bsw_drop_cols = NULL,
  bsw_strata = NULL,
  file_mask = NULL,
  data_encoding = NULL,
  metadata_encoding = NULL,
  data_fixups = NULL,
  bundled_eng_sps = NULL,
  bundle_source = NULL,
  bundle_sps_mask = NULL,
  doc_mask = NULL,
  ...
)

Arguments

layout_mask

SPSS/SAS command-file disambiguator for split-file surveys; also becomes part of the DuckDB table name when set.

bsw_mask, bsw_file_mask, bsw_join_key, bsw_drop_cols, bsw_strata

Bootstrap weight join configuration.

file_mask

Regex selecting the data file (its extension also decides CSV vs fixed-width).

data_encoding, metadata_encoding

Encoding overrides (default '"CP1252"' in the pipeline).

data_fixups

A named list of pre-label fixups: any of 'str_pad', 'rename', 'cols_swap', 'na_values', 'force_numeric', 'force_character', 'force_integer', 'force_bigint', 'codes_supplement', 'missing_supplement', 'labels_supplement'. The 'force_character'/'force_integer'/'force_bigint' fields take character vectors of variable names and override the DuckDB storage type (VARCHAR / INTEGER / BIGINT) so geographic codes keep leading zeros and large IDs are not lost; a variable may appear in at most one 'force_*' set.

bundled_eng_sps, bundle_source, bundle_sps_mask, doc_mask

Advanced bundled-archive and documentation options.

...

Reserved; passing any unrecognised field name raises an error.

Details

Only the arguments you actually supply are recorded; unspecified fields fall back to the built-in entry (when overriding a known survey) or to the pipeline defaults (for a new survey). This makes the result a *patch* rather than a full replacement. Use [pumf_registry()] to inspect an existing entry as a starting template.

The custom registry covers parsing and building configuration only; it does not provide a download URL. For a survey not in [list_canpumf_collection()], deposit the raw zip (or extracted files) under '<cache_path>/<series>/<version>/' first, then call 'get_pumf(series, version, registry = ...)'.

Value

A classed '"pumf_registry_entry"' list containing only the supplied fields.

Examples

## Not run: 
# New CSV survey not yet in the registry (raw files already in the cache):
entry <- pumf_registry_entry(
  file_mask   = "DATA\\.csv",
  data_fixups = list(force_numeric = "WEIGHT"))
get_pumf("NEWSURVEY", "2025", registry = entry)

## End(Not run)

List all registered survey keys

Description

List all registered survey keys

Usage

pumf_registry_keys()

Value

character vector of '"series/version"' keys

Look up survey registry configuration

Description

Returns the configuration entry for a given survey series and version, or 'NULL' if the survey falls back to auto-detection.

Usage

pumf_registry_lookup(series, version)

Arguments

series

survey series acronym (e.g. '"SFS"')

version

survey version string (e.g. '"2019"')

Value

named list of configuration fields, or 'NULL' if not in registry

Resolve version aliases

Description

Canonicalises user-supplied version strings for Census of Population. Any string starting with a four-digit year is parsed flexibly: the file type is detected by grepping for "hierarchical", "household", or "famil" (defaulting to "individuals"), and CMA vs provincial by grepping for "cma". The registry is then probed to find the correct canonical format for that year (e.g. '"1971/households_cma"', '"1986/households"', or '"2001 (households)"').

Usage

pumf_resolve_version(series, version)

Arguments

series

survey series acronym

version

raw version string supplied by the caller, or 'NULL'

Details

Examples of accepted inputs (case-insensitive keywords): - '"2021"' -> '"2021 (individuals)"' - '"1971"' -> '"1971/individuals_prov"' - '"1971 CMA"' -> '"1971/individuals_cma"' - '"1971 households CMA"' -> '"1971/households_cma"' - '"1986 families"' -> '"1986/families"'

Value

canonical version string (or 'NULL' if 'version' was 'NULL')

Run the full three-stage PUMF pipeline for one survey version

Description

Convenience wrapper that: 1. Looks up the survey registry entry for '(series, version)'. 2. Calls [pumf_locate_or_download()] (Stage 1). 3. Calls [pumf_parse_metadata()] (Stage 2) with the registry's 'layout_mask' and 'metadata_encoding'. 4. Calls [pumf_build_duckdb()] (Stage 3). 5. Returns a lazy 'dplyr::tbl()' via [pumf_open_duckdb()].

Usage

pumf_run_pipeline(
  series,
  version,
  lang = "eng",
  cache_path = getOption("canpumf.cache_path", tempdir()),
  refresh = FALSE,
  redownload = FALSE,
  read_only = TRUE
)

Arguments

series

Survey series acronym, e.g. '"SFS"' or '"CHS"'.

version

Version string, e.g. '"2019"' or '"2021 (individuals)"'.

lang

'"eng"' (default) or '"fra"'.

cache_path

Root cache directory. Defaults to 'getOption("canpumf.cache_path", tempdir())'.

refresh

If 'TRUE', clear DuckDB and metadata and rebuild from already-extracted raw data. Does **not** re-download.

redownload

If 'TRUE', delete the zip and all extracted content and re-download from StatCan before rebuilding. Implies 'refresh = TRUE'.

read_only

Open the DuckDB in read-only mode (default 'TRUE').

Details

Each stage is idempotent: subsequent calls reuse cached results unless 'refresh = TRUE'.

Value

A lazy 'dplyr::tbl()' backed by a DuckDB connection.

Retrieve variable labels as a tibble

Description

Returns a tibble mapping short coded column names to their bilingual human-readable variable labels. Use this as a quick reference without renaming the table itself; to rename, use [label_pumf_columns()].

Usage

pumf_var_labels(tbl)

Arguments

tbl

A lazy 'dplyr::tbl()' returned by [get_pumf()].

Value

A tibble with columns 'name' (coded column name), 'label_en' (English label), and 'label_fr' (French label). Rows follow survey-metadata order.

Examples


sfs <- get_pumf("SFS", "2019")
if (!is.null(sfs)) {
  pumf_var_labels(sfs)
  close_pumf(sfs)
}

Read canonical PUMF metadata from CSV files

Description

Read canonical PUMF metadata from CSV files

Usage

read_metadata(metadata_dir)

Arguments

metadata_dir

Path to the metadata/ directory.

Value

A list with elements variables, codes, and layout (NULL when no layout.csv is present).

Remove bootstrap weight tables and views from a PUMF DuckDB database

Description

Drops the bootstrap weight table(s) created by [add_bootstrap_weights()] and their companion VIEWs from the DuckDB file. When all BSW tables have been removed and the main survey table has a 'pumf_row_id' column (added automatically by [add_bootstrap_weights()] when no natural key was available), that column is also dropped.

Usage

remove_bootstrap_weights(tbl, weight_col = NULL)

Arguments

tbl

A lazy 'dplyr::tbl()' returned by [get_pumf()] or by [add_bootstrap_weights()].

weight_col

Name of the weight column whose BSW table should be removed (e.g. '"PWEIGHT"'). If 'NULL' (default), **all** bootstrap weight tables (and their companion VIEWs) are removed.

Details

Like [add_bootstrap_weights()], this function requires brief exclusive write access: the read-only connection backing 'tbl' is shut down, the tables are dropped, and a fresh read-only connection is returned.

Value

A lazy 'dplyr::tbl()' backed by the original physical survey table (without BSW columns), with a fresh read-only DuckDB connection.

Examples


sfs <- get_pumf("SFS", "2019")
if (!is.null(sfs)) {
  sfs_bsw <- add_bootstrap_weights(sfs, weight_col = "PWEIGHT", seed = 1L)
  # Remove only the PWEIGHT BSW table
  sfs_clean <- remove_bootstrap_weights(sfs_bsw, weight_col = "PWEIGHT")
  close_pumf(sfs_clean)
}

Remove a PUMF version from the local cache

Description

Deletes the DuckDB table (and optionally the raw zip and extracted files) for one cached PUMF version.

Usage

remove_pumf_cache(
  series,
  version,
  keep_raw = TRUE,
  cache_path = getOption("canpumf.cache_path", tempdir())
)

Arguments

series

Survey series acronym, e.g. '"SFS"' or '"LFS"'.

version

Version string, e.g. '"2019"' or '"2023-06"'.

keep_raw

If 'TRUE' (default), keep the raw zip and extracted data so [get_pumf()] can rebuild without re-downloading. If 'FALSE', delete everything including raw files.

cache_path

Root cache directory. Defaults to 'getOption("canpumf.cache_path", tempdir())'.

Details

With the default 'keep_raw = TRUE', only the DuckDB and parsed 'metadata/' are removed; the raw zip and extracted data are left intact so that [get_pumf()] can rebuild without re-downloading. Set 'keep_raw = FALSE' to delete everything, freeing the full disk space.

For LFS surveys the DuckDB is shared across all versions. Removing one version deletes only that version's rows from the shared 'LFS.duckdb'; if it was the last loaded version the shared database file is also deleted.

Value

Invisibly 'NULL'.

Examples


# Remove only DuckDB and metadata, keep raw files for quick rebuild:
remove_pumf_cache("SFS", "2019")

# Remove everything including raw files:
remove_pumf_cache("SFS", "2019", keep_raw = FALSE)

Select language-specific labels from a metadata list

Description

Picks label_en or label_fr based on lang and adds a label column to variables and codes. When lang = "fra" and individual label_fr values are NA, those entries fall back to label_en and a warning is emitted listing the affected variable names.

Usage

select_labels(metadata, lang = "eng")

Arguments

metadata

List from read_metadata().

lang

"eng" (default) or "fra".

Value

Modified metadata list with an additional label column in variables and codes.

Write canonical PUMF metadata to CSV files

Description

Write canonical PUMF metadata to CSV files

Usage

write_metadata(metadata, metadata_dir)

Arguments

metadata

A list with elements variables, codes, and optionally layout. Each is a tibble matching the canonical schema.

metadata_dir

Path to the metadata/ directory; created if absent.

Value

metadata_dir invisibly.

Package {canpumf}

Find the registered sibling whose config best fits an unregistered year

Description

Usage

Value

Generate bootstrap weights for a PUMF dataset

Description

Usage

Arguments

Details

Value

See Also

Examples

Add a harmonised gender/sex column to an LFS table

Description

Usage

Arguments

Details

Value

See Also

Examples

Add a date column to an LFS table

Description

Usage

Arguments

Value

See Also

Examples

Summarise bootstrap weight tables present in a PUMF DuckDB database

Description

Usage

Arguments

Value

See Also

Examples

Warn when French label coverage is partial

Description

Usage

Arguments

Value

Close the DuckDB connection backing a PUMF lazy table

Description

Usage

Arguments

Details

Value

See Also

Examples

Detect parseable metadata formats in a PUMF directory tree

Description

Usage

Arguments

Value

Get a Statistics Canada PUMF dataset as a lazy DuckDB table

Description

Usage

Arguments

Details

Value

See Also

Examples

Get a read-write DuckDB connection to a PUMF database

Description

Usage

Arguments

Value

See Also

Examples

Rename PUMF table columns to human-readable variable labels

Description

Usage

Arguments

Details

Value

See Also

Examples

Get Labour Force Survey PUMF data from a shared longitudinal DuckDB

Description

Usage

Arguments