--- title: "PONG2 Training: Building Custom KIR Prediction Models" author: "Norman Lab" output: rmarkdown::html_vignette: toc: true toc_depth: 3 fig_width: 7 fig_height: 5 vignette: > %\VignetteIndexEntry{PONG2 Training: Building Custom KIR Prediction Models} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ## Overview This vignette provides a complete, step-by-step guide to training custom KIR prediction models using the `train` command in PONG2. Training is useful when: - You have a new or population-specific reference dataset - You want to update models with additional samples - You need locus-specific models (e.g. only `KIR3DL1` or `KIR2DL1`) - You want to experiment with different SNP quality or allele frequency filters --- ## Prerequisites | Requirement | Version | Notes | |-------------|---------|-------| | PLINK2 | ≥ 2.0 | Must be in PATH | | R | ≥ 4.0 | With PONG2 installed | | Reference PLINK files | — | chr19 covering the KIR locus | | KIR allele calls | — | CSV format (see below) | --- ## Step 1: Prepare Input Data ### 1a. Reference genotypes (`--bfile`) PLINK bed/bim/fam files containing SNPs in the KIR locus (chr19). #### Using the 1000 Genomes Project (1KGP) as reference panel The 1KGP is the recommended reference panel for training PONG2 models. Choose the appropriate panel for your genome assembly: | Assembly | 1KGP Reference Panel | Samples | URL | |----------|---------------------|---------|-----| | hg19 | Phase 3 | 2,504 | https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ | | hg38 | High Coverage | 3,202 | https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/ | Download and extract the KIR region: ```bash # ── hg19: 1KGP Phase 3 ────────────────────────────────────────────────────── wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/\ ALL.chr19.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz plink2 \ --vcf ALL.chr19.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz \ --chr 19 \ --from-bp 55000000 \ --to-bp 55400000 \ --make-bed \ --out reference_chr19_hg19 # ── hg38: 1KGP High Coverage ───────────────────────────────────────────────── wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/\ 1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/\ 1kGP_high_coverage_Illumina.chr19.filtered.SNV_INDEL_SV_phased_panel.vcf.gz plink2 \ --vcf 1kGP_high_coverage_Illumina.chr19.filtered.SNV_INDEL_SV_phased_panel.vcf.gz \ --chr 19 \ --from-bp 54000000 \ --to-bp 55000000 \ --make-bed \ --out reference_chr19_hg38 ``` #### Using your own reference dataset Extract chr19 from your full-genome PLINK files: ```bash plink2 --bfile full_reference --chr 19 --make-bed --out reference_chr19 ``` ### 1b. Known KIR allele calls (`--kfile`) A comma-separated CSV file with sample IDs and phased KIR allele calls. Each locus requires two columns representing the two haplotypes (`_h1` and `_h2`). #### Format ``` Sample,KIR2DL1_h1,KIR2DL1_h2,KIR3DL1_h1,KIR3DL1_h2 SAMPLE001,KIR2DL1*00101,KIR2DL1*00302,KIR3DL1*001,KIR3DL1*002 SAMPLE002,KIR2DL1*00104,KIR2DL1*0000,KIR3DL1*004,KIR3DL1*005 SAMPLE003,KIR2DL1*009,KIR2DL1*00601,KIR3DL1*00302,KIR3DL1*001 ``` #### Rules | Rule | Detail | |------|--------| | Header required | First row must be `Sample,_h1,_h2,...` | | Sample IDs | Must exactly match IDs in the PLINK `.fam` file | | Allele names | Use full standard nomenclature (e.g. `KIR3DL1*001`, `KIR2DL1*00201`) | | Null allele | Use `KIR*0000` (not blank, not `null`) | | Unresolved alleles | Rows with `*new` or `*unresolved` are automatically excluded | --- ## Step 2: Run Training ```bash pong2 train \ -i reference_chr19 \ -k kir_calls.csv \ -o models/KIR3DL1 \ -l KIR3DL1 \ -a hg38 \ -t 20 ``` ### With optional parameters ```bash pong2 train \ -i reference_chr19 \ -k kir_calls.csv \ -o models/KIR3DL1 \ -l KIR3DL1 \ -a hg38 \ -t 20 \ --nclassifier 200 \ --split 0.8 \ --kirmaf 0.005 \ --mac 5 ``` ### Key training parameters | Flag | Default | Description | |------|---------|-------------| | `--nclassifier` | `100` | Number of ensemble classifiers — higher = more accurate but slower | | `--split` | `0.7` | Proportion of samples for training (remainder used for validation) | | `--kirmaf` | `0.00` | Minimum KIR allele frequency — filters rare alleles from training | | `--mac` | `3` | Minimum allele count for SNPs — removes very rare variants | | `-r, --region` | Optimized default | Custom SNP region (e.g. `55281035-55295784` for hg19) | --- ## Step 3: Training Output After training completes, the output directory (`-o`) will contain: | File | Description | |------|-------------| | `_model.RData` | Trained prediction model — main output | | `_test.RData` | Test genotypes (only when `--split < 1`) | | `_split.RData` | Train/test split object (only when `--split < 1`) | > **Note:** Temporary files in `tmp/` are automatically removed after training completes. --- ## Step 4: Evaluate Model Performance If `--split < 1`, PONG2 holds out a validation set during training. ### Option A: Evaluate from the terminal (recommended) ```bash pong2 evaluate \ --model-dir models/KIR3DL1 \ --locus KIR3DL1 \ --threshold 0.5 ``` This outputs haplotype accuracy, genotype accuracy, call rate, and per-allele sensitivity/specificity, and saves a summary CSV to the model directory. ### Option B: Evaluate in R ```r library(PONG2) # Load saved objects path <- "models/KIR3DL1" mobj <- get(load(paste0(path, "_model.RData"))) test.geno <- get(load(paste0(path, "_test.RData"))) kirtab <- get(load(paste0(path, "_split.RData"))) model <- hlaModelFromObj(mobj) # Predict on test set pred <- kirPredict(model, test.geno, type = "response+prob", verbose = FALSE) # Evaluate using hlaCompareAllele comp <- hlaCompareAllele(kirtab$validation, pred, allele.limit = model, call.threshold = 0.5) # Overall accuracy cat(sprintf("Haplotype accuracy: %.1f%%\n", comp$overall$acc.haplo * 100)) cat(sprintf("Genotype accuracy: %.1f%%\n", comp$overall$acc.geno * 100)) cat(sprintf("Call rate: %.1f%%\n", comp$overall$call.rate * 100)) cat(sprintf("Test samples: %d\n", comp$overall$n.samp)) # Per-allele accuracy if (!is.null(comp$detail)) { allele_detail <- comp$detail[order(comp$detail$acc.haplo), ] print(allele_detail) } # Save summary eval_summary <- data.frame( Locus = "KIR3DL1", N_test = comp$overall$n.samp, Acc_Haplo = round(comp$overall$acc.haplo * 100, 2), Acc_Geno = round(comp$overall$acc.geno * 100, 2), Call_Rate = round(comp$overall$call.rate * 100, 2) ) write.csv(eval_summary, file = paste0(path, "_eval_summary.csv"), row.names = FALSE) ``` --- ## Step 5: Use a Custom Model for Imputation Once trained, use your model directly with `pong2 impute`: ```bash pong2 impute \ -i chr19_target \ -o results/ \ -l KIR3DL1 \ -a hg38 \ -m models/KIR3DL1/KIR3DL1_model.RData ``` Or load it directly in R: ```r library(PONG2) load("models/KIR3DL1/KIR3DL1_model.RData") model <- hlaModelFromObj(mobj) geno <- hlaBED2Geno("chr19_target.bed", "chr19_target.fam", "chr19_target.bim", import.chr = "19", assembly = "hg38") pred <- kirPredict(model, geno, type = "response+prob") ``` --- ## Troubleshooting | Error | Likely Cause | Fix | |-------|-------------|-----| | `No matching samples` | Sample IDs in `--kfile` don't match `.fam` | Check ID format — must match exactly | | `Insufficient training samples` | Too few overlapping samples (<10) | Verify PLINK and KIR file sample overlap | | `No SNPs found in region` | Wrong assembly or region coordinates | Check `--assembly` and `--region` | | `No model found for locus` | Locus name typo or unsupported locus | Check locus spelling (e.g. `KIR3DL1` not `KIR3DL`) | | Memory issues | Too many threads or large dataset | Reduce `--threads` or use HPC with more RAM | | Slow training | Insufficient threads | Increase `--threads` or reduce `--nclassifier` | | Low accuracy | Too few training samples or rare alleles | Increase sample size or adjust `--kirmaf` | --- ## Next Steps - See vignette [PONG2-imputation](https://normanlabucd.github.io/PONG2/articles/PONG2-imputation.html) for the full imputation workflow - See vignette [PONG2 Basics](https://normanlabucd.github.io/PONG2/articles/PONG2-basics.html) for installation and quick start - Run the complete end-to-end workflow script: [example/full_workflow.sh](https://github.com/NormanLabUCD/PONG2/blob/main/example/full_workflow.sh) - Report issues: [Open a GitHub issue](https://github.com/NormanLabUCD/PONG2/issues/new) Happy KIR model training! 🧬