The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Documenting Datasets

Introduction

Proper dataset documentation is crucial for reproducible research and effective data sharing. The {qtkit} package provides two main functions to help standardize and automate the documentation process:

create_data_origin(): Creates standardized metadata about data(set) sources
create_data_dictionary(): Generates the scaffolding for a detailed variable-level documentation or can use AI to generate descriptions to be reviewed and updated as necessary

Creating Dataset Origin Documentation

Basic Usage

Let’s start with documenting the built-in mtcars dataset:

# Create a temporary file for our documentation
origin_file <- file_temp(ext = "csv")

# Create the origin documentation template
origin_doc <- create_data_origin(
  file_path = origin_file,
  return = TRUE
)
#> Data origin file created at `file_path`.


# View the template
origin_doc |>
  glimpse()
#> Rows: 8
#> Columns: 2
#> $ attribute   <chr> "Resource name", "Data source", "Data sampling frame", "Da…
#> $ description <chr> "The name of the resource.", "URL, DOI, etc.", "Language, …

The template provides fields for essential metadata. You can either open the CSV file in a spreadsheet editor or fill it out programmatically, as shown below.

Here’s how you might fill it out for mtcars:

origin_doc |>
  mutate(description = c(
    "Motor Trend Car Road Tests",
    "Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.",
    "US automobile market, passenger vehicles",
    "1973-74",
    "Built-in R dataset (.rda)",
    "Single data frame with 32 observations of 11 variables",
    "Public Domain",
    "Citation: Henderson and Velleman (1981)"
  )) |>
  write_csv(origin_file)

Customizing Origin Documentation

You can force overwrite existing documentation:

create_data_origin(
  file_path = origin_file,
  force = TRUE
)
#> Data origin file created at `file_path`.

Creating Data Dictionaries

Basic Dictionary Creation

Create a basic data dictionary without AI assistance:

# Create a temporary file for our dictionary
dict_file <- file_temp(ext = "csv")

# Generate dictionary for iris dataset
iris_dict <- create_data_dictionary(
  data = iris,
  file_path = dict_file
)

# View the results
iris_dict |>
  glimpse()
#> Rows: 5
#> Columns: 4
#> $ variable    <chr> "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Widt…
#> $ name        <chr> NA, NA, NA, NA, NA
#> $ type        <chr> "numeric", "numeric", "numeric", "numeric", "factor"
#> $ description <chr> NA, NA, NA, NA, NA

AI-Enhanced Data Dictionaries

If you have an OpenAI API key, you can generate more detailed descriptions:

# Not run - requires API key
Sys.setenv(OPENAI_API_KEY = "your-api-key")

iris_dict_ai <- create_data_dictionary(
  data = iris,
  file_path = dict_file,
  model = "gpt-4",
  sample_n = 5
)

Example output might look like:

#> # A tibble: 2 × 4
#>   variable     name         type    description                       
#>   <chr>        <chr>        <chr>   <chr>                             
#> 1 Sepal.Length Sepal Length numeric Length of the sepal in centimeters
#> 2 Sepal.Width  Sepal Width  numeric Width of the sepal in centimeters

Working with Larger Datasets

For larger datasets, you can use sampling and grouping:

diamonds_dict <- diamonds |>
  create_data_dictionary(
    file_path = "diamonds_dict.csv",
    model = "gpt-4",
    sample_n = 3,
    grouping = "cut" # Sample across different cut categories
  )

Best Practices

Create documentation when first obtaining/creating a dataset
Update documentation when:
- Adding new variables
- Modifying data structure
- Changing data sources
Store documentation alongside data in version control
Include documentation paths in your project README

Conclusion

The {qtkit} package provides flexible tools for standardizing dataset documentation. By combining create_data_origin() and create_data_dictionary(), you can create comprehensive documentation that enhances reproducibility and data sharing.

Additional Resources

Package documentation: help(package = "qtkit")
Related packages: {dataMaid}, {codebook}
Project homepage

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.