The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Proper dataset documentation is crucial for reproducible research and effective data sharing. The {qtkit} package provides two main functions to help standardize and automate the documentation process:
create_data_origin()
: Creates standardized metadata
about data(set) sourcescreate_data_dictionary()
: Generates the scaffolding for
a detailed variable-level documentation or can use AI to generate
descriptions to be reviewed and updated as necessaryLet’s start with documenting the built-in mtcars
dataset:
# Create a temporary file for our documentation
origin_file <- file_temp(ext = "csv")
# Create the origin documentation template
origin_doc <- create_data_origin(
file_path = origin_file,
return = TRUE
)
#> Data origin file created at `file_path`.
# View the template
origin_doc |>
glimpse()
#> Rows: 8
#> Columns: 2
#> $ attribute <chr> "Resource name", "Data source", "Data sampling frame", "Da…
#> $ description <chr> "The name of the resource.", "URL, DOI, etc.", "Language, …
The template provides fields for essential metadata. You can either open the CSV file in a spreadsheet editor or fill it out programmatically, as shown below.
Here’s how you might fill it out for mtcars
:
origin_doc |>
mutate(description = c(
"Motor Trend Car Road Tests",
"Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.",
"US automobile market, passenger vehicles",
"1973-74",
"Built-in R dataset (.rda)",
"Single data frame with 32 observations of 11 variables",
"Public Domain",
"Citation: Henderson and Velleman (1981)"
)) |>
write_csv(origin_file)
Create a basic data dictionary without AI assistance:
# Create a temporary file for our dictionary
dict_file <- file_temp(ext = "csv")
# Generate dictionary for iris dataset
iris_dict <- create_data_dictionary(
data = iris,
file_path = dict_file
)
# View the results
iris_dict |>
glimpse()
#> Rows: 5
#> Columns: 4
#> $ variable <chr> "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Widt…
#> $ name <chr> NA, NA, NA, NA, NA
#> $ type <chr> "numeric", "numeric", "numeric", "numeric", "factor"
#> $ description <chr> NA, NA, NA, NA, NA
If you have an OpenAI API key, you can generate more detailed descriptions:
# Not run - requires API key
Sys.setenv(OPENAI_API_KEY = "your-api-key")
iris_dict_ai <- create_data_dictionary(
data = iris,
file_path = dict_file,
model = "gpt-4",
sample_n = 5
)
Example output might look like:
#> # A tibble: 2 × 4
#> variable name type description
#> <chr> <chr> <chr> <chr>
#> 1 Sepal.Length Sepal Length numeric Length of the sepal in centimeters
#> 2 Sepal.Width Sepal Width numeric Width of the sepal in centimeters
For larger datasets, you can use sampling and grouping:
The {qtkit} package provides flexible tools for standardizing dataset
documentation. By combining create_data_origin()
and
create_data_dictionary()
, you can create comprehensive
documentation that enhances reproducibility and data sharing.
help(package = "qtkit")
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.