PONG2 Training: Building Custom KIR Prediction Models

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Norman Lab

Overview

This vignette provides a complete, step-by-step guide to training custom KIR prediction models using the train command in PONG2.

Training is useful when:

You have a new or population-specific reference dataset
You want to update models with additional samples
You need locus-specific models (e.g. only KIR3DL1 or KIR2DL1)
You want to experiment with different SNP quality or allele frequency filters

Prerequisites

Requirement	Version	Notes
PLINK2	≥ 2.0	Must be in PATH
R	≥ 4.0	With PONG2 installed
Reference PLINK files	—	chr19 covering the KIR locus
KIR allele calls	—	CSV format (see below)

Step 1: Prepare Input Data

1a. Reference genotypes (`--bfile`)

PLINK bed/bim/fam files containing SNPs in the KIR locus (chr19).

Using the 1000 Genomes Project (1KGP) as reference panel

The 1KGP is the recommended reference panel for training PONG2 models. Choose the appropriate panel for your genome assembly:

Assembly	1KGP Reference Panel	Samples	URL
hg19	Phase 3	2,504	https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/
hg38	High Coverage	3,202	https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/

Download and extract the KIR region:

# ── hg19: 1KGP Phase 3 ──────────────────────────────────────────────────────
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/\
ALL.chr19.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz

plink2 \
  --vcf ALL.chr19.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz \
  --chr 19 \
  --from-bp 55000000 \
  --to-bp 55400000 \
  --make-bed \
  --out reference_chr19_hg19

# ── hg38: 1KGP High Coverage ─────────────────────────────────────────────────
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/\
1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/\
1kGP_high_coverage_Illumina.chr19.filtered.SNV_INDEL_SV_phased_panel.vcf.gz

plink2 \
  --vcf 1kGP_high_coverage_Illumina.chr19.filtered.SNV_INDEL_SV_phased_panel.vcf.gz \
  --chr 19 \
  --from-bp 54000000 \
  --to-bp 55000000 \
  --make-bed \
  --out reference_chr19_hg38

Using your own reference dataset

Extract chr19 from your full-genome PLINK files:

plink2 --bfile full_reference --chr 19 --make-bed --out reference_chr19

1b. Known KIR allele calls (`--kfile`)

A comma-separated CSV file with sample IDs and phased KIR allele calls. Each locus requires two columns representing the two haplotypes (_h1 and _h2).

Format

Sample,KIR2DL1_h1,KIR2DL1_h2,KIR3DL1_h1,KIR3DL1_h2
SAMPLE001,KIR2DL1*00101,KIR2DL1*00302,KIR3DL1*001,KIR3DL1*002
SAMPLE002,KIR2DL1*00104,KIR2DL1*0000,KIR3DL1*004,KIR3DL1*005
SAMPLE003,KIR2DL1*009,KIR2DL1*00601,KIR3DL1*00302,KIR3DL1*001

Rules

Rule	Detail
Header required	First row must be `Sample,<locus>_h1,<locus>_h2,...`
Sample IDs	Must exactly match IDs in the PLINK `.fam` file
Allele names	Use full standard nomenclature (e.g. `KIR3DL1001`, `KIR2DL100201`)
Null allele	Use `KIR<locus>*0000` (not blank, not `null`)
Unresolved alleles	Rows with `new` or `unresolved` are automatically excluded

Step 2: Run Training

pong2 train \
  -i reference_chr19 \
  -k kir_calls.csv \
  -o models/KIR3DL1 \
  -l KIR3DL1 \
  -a hg38 \
  -t 20

With optional parameters

pong2 train \
  -i reference_chr19 \
  -k kir_calls.csv \
  -o models/KIR3DL1 \
  -l KIR3DL1 \
  -a hg38 \
  -t 20 \
  --nclassifier 200 \
  --split 0.8 \
  --kirmaf 0.005 \
  --mac 5

Key training parameters

Flag	Default	Description
`--nclassifier`	`100`	Number of ensemble classifiers — higher = more accurate but slower
`--split`	`0.7`	Proportion of samples for training (remainder used for validation)
`--kirmaf`	`0.00`	Minimum KIR allele frequency — filters rare alleles from training
`--mac`	`3`	Minimum allele count for SNPs — removes very rare variants
`-r, --region`	Optimized default	Custom SNP region (e.g. `55281035-55295784` for hg19)

Step 3: Training Output

After training completes, the output directory (-o) will contain:

File	Description
`<locus>_model.RData`	Trained prediction model — main output
`<locus>_test.RData`	Test genotypes (only when `--split < 1`)
`<locus>_split.RData`	Train/test split object (only when `--split < 1`)

Note: Temporary files in tmp/ are automatically removed after training completes.

Step 4: Evaluate Model Performance

If --split < 1, PONG2 holds out a validation set during training.

Option A: Evaluate from the terminal (recommended)

pong2 evaluate \
  --model-dir models/KIR3DL1 \
  --locus KIR3DL1 \
  --threshold 0.5

This outputs haplotype accuracy, genotype accuracy, call rate, and per-allele sensitivity/specificity, and saves a summary CSV to the model directory.

Option B: Evaluate in R

library(PONG2)

# Load saved objects
path      <- "models/KIR3DL1"
mobj      <- get(load(paste0(path, "_model.RData")))
test.geno <- get(load(paste0(path, "_test.RData")))
kirtab    <- get(load(paste0(path, "_split.RData")))
model     <- hlaModelFromObj(mobj)

# Predict on test set
pred <- kirPredict(model, test.geno, type = "response+prob", verbose = FALSE)

# Evaluate using hlaCompareAllele
comp <- hlaCompareAllele(kirtab$validation, pred,
                         allele.limit = model, call.threshold = 0.5)

# Overall accuracy
cat(sprintf("Haplotype accuracy: %.1f%%\n", comp$overall$acc.haplo * 100))
cat(sprintf("Genotype accuracy:  %.1f%%\n", comp$overall$acc.geno  * 100))
cat(sprintf("Call rate:          %.1f%%\n", comp$overall$call.rate  * 100))
cat(sprintf("Test samples:       %d\n",     comp$overall$n.samp))

# Per-allele accuracy
if (!is.null(comp$detail)) {
  allele_detail <- comp$detail[order(comp$detail$acc.haplo), ]
  print(allele_detail)
}

# Save summary
eval_summary <- data.frame(
  Locus     = "KIR3DL1",
  N_test    = comp$overall$n.samp,
  Acc_Haplo = round(comp$overall$acc.haplo  * 100, 2),
  Acc_Geno  = round(comp$overall$acc.geno   * 100, 2),
  Call_Rate = round(comp$overall$call.rate  * 100, 2)
)
write.csv(eval_summary, file = paste0(path, "_eval_summary.csv"), row.names = FALSE)

Step 5: Use a Custom Model for Imputation

Once trained, use your model directly with pong2 impute:

pong2 impute \
  -i chr19_target \
  -o results/ \
  -l KIR3DL1 \
  -a hg38 \
  -m models/KIR3DL1/KIR3DL1_model.RData

Or load it directly in R:

library(PONG2)

load("models/KIR3DL1/KIR3DL1_model.RData")
model <- hlaModelFromObj(mobj)

geno <- hlaBED2Geno("chr19_target.bed", "chr19_target.fam", "chr19_target.bim",
                    import.chr = "19", assembly = "hg38")
pred <- kirPredict(model, geno, type = "response+prob")

Troubleshooting

Error	Likely Cause	Fix
`No matching samples`	Sample IDs in `--kfile` don’t match `.fam`	Check ID format — must match exactly
`Insufficient training samples`	Too few overlapping samples (<10)	Verify PLINK and KIR file sample overlap
`No SNPs found in region`	Wrong assembly or region coordinates	Check `--assembly` and `--region`
`No model found for locus`	Locus name typo or unsupported locus	Check locus spelling (e.g. `KIR3DL1` not `KIR3DL`)
Memory issues	Too many threads or large dataset	Reduce `--threads` or use HPC with more RAM
Slow training	Insufficient threads	Increase `--threads` or reduce `--nclassifier`
Low accuracy	Too few training samples or rare alleles	Increase sample size or adjust `--kirmaf`

Next Steps

See vignette PONG2-imputation for the full imputation workflow
See vignette PONG2 Basics for installation and quick start
Run the complete end-to-end workflow script: example/full_workflow.sh
Report issues: Open a GitHub issue

Happy KIR model training! 🧬

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.

PONG2 Training: Building Custom KIR Prediction Models

Norman Lab

Overview

Prerequisites

Step 1: Prepare Input Data

1a. Reference genotypes (--bfile)

Using the 1000 Genomes Project (1KGP) as reference panel

Using your own reference dataset

1b. Known KIR allele calls (--kfile)

Format

Rules

Step 2: Run Training

With optional parameters

Key training parameters

Step 3: Training Output

Step 4: Evaluate Model Performance

Option A: Evaluate from the terminal (recommended)

Option B: Evaluate in R

Step 5: Use a Custom Model for Imputation

Troubleshooting

Next Steps

1a. Reference genotypes (`--bfile`)

1b. Known KIR allele calls (`--kfile`)