The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

End-to-End Pipeline: From API to Multi-Sport Analysis

vald.extractor

2026-01-18

Introduction

The vald.extractor package provides a robust, production-ready pipeline for extracting, cleaning, and analyzing VALD ForceDecks data across multiple sports. This vignette demonstrates the complete workflow from API authentication to publication-ready visualizations.

Key Problems Solved

  1. Stability: Chunked batch processing prevents timeout errors when working with large datasets (1000+ tests)
  2. Data Cleaning: Automated sports taxonomy mapping standardizes inconsistent team/group names
  3. Reproducibility: All processing steps are documented and version-controlled
  4. Generic Analysis: Test-type suffix removal enables writing analysis code once that works for all test types

Step 1: Authentication and Data Extraction

First, set your VALD API credentials and extract test/trial data:

library(vald.extractor)

# Set credentials
valdr::set_credentials(
  client_id     = "your_client_id",
  client_secret = "your_client_secret",
  tenant_id     = "your_tenant_id",
  region        = "aue"
)

# Fetch data from 2020 onwards in chunks of 100 tests
vald_data <- fetch_vald_batch(
  start_date = "2020-01-01T00:00:00Z",
  chunk_size = 100,
  verbose = TRUE
)

# Extract components
tests_df <- vald_data$tests
trials_df <- vald_data$trials

cat("Extracted", nrow(tests_df), "tests and", nrow(trials_df), "trials\n")

Why chunking matters: Without chunking, large organizations with 5000+ tests will experience API timeout errors. The chunked approach processes 100 tests at a time, with fault-tolerant error handling that logs issues without halting the entire extraction.

Step 2: Fetch and Standardize Metadata

Retrieve athlete profiles and group memberships via OAuth2:

# Fetch raw metadata
metadata <- fetch_vald_metadata(
  client_id     = "your_client_id",
  client_secret = "your_client_secret",
  tenant_id     = "your_tenant_id",
  region        = "aue"
)

# Standardize: unnest group memberships and create unified athlete records
athlete_metadata <- standardize_vald_metadata(
  profiles = metadata$profiles,
  groups   = metadata$groups
)

head(athlete_metadata)

Understanding the Metadata Structure

The VALD API stores group memberships as a nested array (groupIds). The standardize_vald_metadata() function:

  1. Unnests the array so each athlete-group pair gets its own row
  2. Joins with group names from the Groups API
  3. Collapses back to one row per athlete with all group names concatenated

Result: A clean metadata table where all_group_names contains “Football, U18, Elite” for an athlete in multiple groups.

Step 3: Apply Sports Taxonomy

Map inconsistent team names to standardized sports categories:

athlete_metadata <- classify_sports(
  data = athlete_metadata,
  group_col = "all_group_names",
  output_col = "sports_clean"
)

# Inspect the mapping
table(athlete_metadata$sports_clean)

The Value Add: This regex-based classification is the core innovation. Organizations often have:

Without this automation, analysts spend hours manually categorizing athletes. The package includes patterns for 15+ sports and can be easily extended.

Step 4: Transform to Wide Format and Join

Combine trials into tests, pivot to wide format, and merge with metadata:

library(dplyr)

# Join trials and tests
all_data <- left_join(trials_df, tests_df, by = c("testId", "athleteId"))

# Aggregate trials and pivot to wide format
structured_test_data <- all_data %>%
  group_by(athleteId, testId, testType, recordedUTC,
           recordedDateOffset, trialLimb, definition_name) %>%
  summarise(
    mean_result = mean(as.numeric(value), na.rm = TRUE),
    mean_weight = mean(as.numeric(weight), na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(
    TestTimestampUTC = lubridate::ymd_hms(recordedUTC),
    TestTimestampLocal = TestTimestampUTC + lubridate::minutes(recordedDateOffset),
    Testdate = as.Date(TestTimestampLocal)
  ) %>%
  select(athleteId, Testdate, testId, testType, trialLimb,
         definition_name, mean_result, mean_weight) %>%
  tidyr::pivot_wider(
    id_cols = c(athleteId, Testdate, testId, mean_weight),
    names_from = c(definition_name, trialLimb, testType),
    values_from = mean_result,
    names_glue = "{definition_name}_{trialLimb}_{testType}"
  ) %>%
  rename(Weight_on_Test_Day = mean_weight)

# Join with metadata
final_analysis_data <- structured_test_data %>%
  mutate(profileId = as.character(athleteId)) %>%
  left_join(
    athlete_metadata %>% mutate(profileId = as.character(profileId)),
    by = "profileId"
  ) %>%
  mutate(
    Testdate = as.Date(Testdate),
    dateofbirth = as.Date(dateOfBirth),
    age = as.numeric((Testdate - dateofbirth) / 365.25),
    sports = sports_clean
  )

cat("Final dataset:", nrow(final_analysis_data), "rows with",
    ncol(final_analysis_data), "columns\n")

Step 5: Split by Test Type

The “Don’t Repeat Yourself” (DRY) principle in action:

# Split into separate datasets per test type
test_datasets <- split_by_test(
  data = final_analysis_data,
  metadata_cols = c("profileId", "sex", "Testdate", "dateofbirth",
                    "age", "testId", "Weight_on_Test_Day", "sports")
)

# Access individual test types
cmj_data <- test_datasets$CMJ
dj_data <- test_datasets$DJ

# Crucially: column names are now generic
head(names(cmj_data))
# "profileId", "sex", "Testdate", "PEAK_FORCE_Both", "JUMP_HEIGHT_Both", ...
# Note: "_CMJ" suffix has been removed!

Why this matters: You can now write one analysis function that works for all test types:

analyze_peak_force <- function(test_data) {
  summary(test_data$PEAK_FORCE_Both)  # Works for CMJ, DJ, ISO, etc.
}

# Apply to all test types
lapply(test_datasets, analyze_peak_force)

Without suffix removal, you’d need separate code for PEAK_FORCE_Both_CMJ, PEAK_FORCE_Both_DJ, etc.

Step 6: Patch Missing Metadata (Optional)

Fix missing or incorrect demographic data:

# Create an Excel file with: profileId, sex, dateOfBirth
# Example: corrections.xlsx with rows like:
#   profileId         sex       dateOfBirth
#   abc123           Male      1995-03-15
#   def456           Female    1998-07-22

cmj_data <- patch_metadata(
  data = cmj_data,
  patch_file = "corrections.xlsx",
  patch_sheet = 1,
  id_col = "profileId",
  fields_to_patch = c("sex", "dateOfBirth")
)

# Verify corrections
table(cmj_data$sex)  # "Unknown" values should now be fixed

Step 7: Generate Summary Statistics

Create publication-ready summary tables:

cmj_summary <- summary_vald_metrics(
  data = cmj_data,
  group_vars = c("sex", "sports"),
  exclude_cols = c("profileId", "testId", "Testdate", "dateofbirth", "age")
)

# View summary
print(cmj_summary)

# Export to CSV
write.csv(cmj_summary, "cmj_summary_by_sport_sex.csv", row.names = FALSE)

Output example:

sex    sports      PEAK_FORCE_Both_Mean  PEAK_FORCE_Both_SD  PEAK_FORCE_Both_CV  PEAK_FORCE_Both_N
Male   Football    2450.32               245.67              10.02               45
Male   Basketball  2310.45               198.23              8.58                32
Female Football    1980.12               187.45              9.47                38

Step 9: Compare Across Groups

Create boxplots for cross-sectional comparisons:

plot_vald_compare(
  data = cmj_data,
  metric_col = "PEAK_FORCE_Both",
  group_col = "sports",
  fill_col = "sex",
  title = "CMJ Peak Force Comparison by Sport and Sex"
)

# Compare jump height
plot_vald_compare(
  data = cmj_data,
  metric_col = "JUMP_HEIGHT_Both",
  group_col = "sports",
  fill_col = "sex",
  title = "CMJ Jump Height Comparison"
)

Advanced: Multi-Test Analysis

Analyze multiple test types simultaneously:

# Define a function to extract a common metric across test types
compare_metric_across_tests <- function(test_datasets, metric = "PEAK_FORCE_Both") {

  results <- lapply(names(test_datasets), function(test_name) {
    test_data <- test_datasets[[test_name]]

    if (metric %in% names(test_data)) {
      data.frame(
        testType = test_name,
        metric = metric,
        mean = mean(test_data[[metric]], na.rm = TRUE),
        sd = sd(test_data[[metric]], na.rm = TRUE),
        n = sum(!is.na(test_data[[metric]]))
      )
    }
  })

  do.call(rbind, results)
}

# Compare peak force across CMJ, DJ, and ISO
force_comparison <- compare_metric_across_tests(test_datasets, "PEAK_FORCE_Both")
print(force_comparison)

Best Practices for Production Use

1. Schedule Regular Updates

# Weekly refresh script
library(vald.extractor)

# Fetch only new data since last update
last_update <- "2024-01-01T00:00:00Z"

new_data <- fetch_vald_batch(
  start_date = last_update,
  chunk_size = 100
)

# Append to existing database
load("vald_database.RData")
updated_tests <- rbind(existing_tests, new_data$tests)
updated_trials <- rbind(existing_trials, new_data$trials)

save(updated_tests, updated_trials, file = "vald_database.RData")

2. Error Logging

The chunked extraction automatically logs errors without halting:

# Errors are printed to console with chunk information:
# "ERROR on chunk 23 (rows 2201-2300): API timeout"
# "Continuing to next chunk..."

# This ensures partial data extraction even if some chunks fail

3. Version Control Your Taxonomy

Store your sports classification rules in a separate config file:

# sports_taxonomy.R
sports_patterns <- list(
  Football = "Football|FSI|TCFC|MCFC|Soccer",
  Basketball = "Basketball|BBall",
  Cricket = "Cricket",
  # ... add your organization's patterns
)

# Then use in classify_sports()

Conclusion

The vald.extractor package transforms raw VALD API data into analysis-ready datasets with:

This workflow is production-tested with 10,000+ tests across 15+ sports and is designed for CRAN submission.

Next Steps

For issues or feature requests, visit: GitHub Issues

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.