The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This vignette provides a complete, step-by-step guide to training
custom KIR prediction models using the train command in
PONG2.
Training is useful when:
KIR3DL1 or
KIR2DL1)| Requirement | Version | Notes |
|---|---|---|
| PLINK2 | ≥ 2.0 | Must be in PATH |
| R | ≥ 4.0 | With PONG2 installed |
| Reference PLINK files | — | chr19 covering the KIR locus |
| KIR allele calls | — | CSV format (see below) |
--bfile)PLINK bed/bim/fam files containing SNPs in the KIR locus (chr19).
The 1KGP is the recommended reference panel for training PONG2 models. Choose the appropriate panel for your genome assembly:
| Assembly | 1KGP Reference Panel | Samples | URL |
|---|---|---|---|
| hg19 | Phase 3 | 2,504 | https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ |
| hg38 | High Coverage | 3,202 | https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/ |
Download and extract the KIR region:
# ── hg19: 1KGP Phase 3 ──────────────────────────────────────────────────────
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/\
ALL.chr19.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz
plink2 \
--vcf ALL.chr19.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz \
--chr 19 \
--from-bp 55000000 \
--to-bp 55400000 \
--make-bed \
--out reference_chr19_hg19
# ── hg38: 1KGP High Coverage ─────────────────────────────────────────────────
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/\
1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/\
1kGP_high_coverage_Illumina.chr19.filtered.SNV_INDEL_SV_phased_panel.vcf.gz
plink2 \
--vcf 1kGP_high_coverage_Illumina.chr19.filtered.SNV_INDEL_SV_phased_panel.vcf.gz \
--chr 19 \
--from-bp 54000000 \
--to-bp 55000000 \
--make-bed \
--out reference_chr19_hg38--kfile)A comma-separated CSV file with sample IDs and phased KIR allele
calls. Each locus requires two columns representing the two haplotypes
(_h1 and _h2).
Sample,KIR2DL1_h1,KIR2DL1_h2,KIR3DL1_h1,KIR3DL1_h2
SAMPLE001,KIR2DL1*00101,KIR2DL1*00302,KIR3DL1*001,KIR3DL1*002
SAMPLE002,KIR2DL1*00104,KIR2DL1*0000,KIR3DL1*004,KIR3DL1*005
SAMPLE003,KIR2DL1*009,KIR2DL1*00601,KIR3DL1*00302,KIR3DL1*001
| Rule | Detail |
|---|---|
| Header required | First row must be
Sample,<locus>_h1,<locus>_h2,... |
| Sample IDs | Must exactly match IDs in the PLINK .fam file |
| Allele names | Use full standard nomenclature (e.g. KIR3DL1*001,
KIR2DL1*00201) |
| Null allele | Use KIR<locus>*0000 (not blank, not
null) |
| Unresolved alleles | Rows with *new or *unresolved are
automatically excluded |
pong2 train \
-i reference_chr19 \
-k kir_calls.csv \
-o models/KIR3DL1 \
-l KIR3DL1 \
-a hg38 \
-t 20| Flag | Default | Description |
|---|---|---|
--nclassifier |
100 |
Number of ensemble classifiers — higher = more accurate but slower |
--split |
0.7 |
Proportion of samples for training (remainder used for validation) |
--kirmaf |
0.00 |
Minimum KIR allele frequency — filters rare alleles from training |
--mac |
3 |
Minimum allele count for SNPs — removes very rare variants |
-r, --region |
Optimized default | Custom SNP region (e.g. 55281035-55295784 for
hg19) |
After training completes, the output directory (-o) will
contain:
| File | Description |
|---|---|
<locus>_model.RData |
Trained prediction model — main output |
<locus>_test.RData |
Test genotypes (only when --split < 1) |
<locus>_split.RData |
Train/test split object (only when --split < 1) |
Note: Temporary files in
tmp/are automatically removed after training completes.
If --split < 1, PONG2 holds out a validation set
during training.
This outputs haplotype accuracy, genotype accuracy, call rate, and per-allele sensitivity/specificity, and saves a summary CSV to the model directory.
library(PONG2)
# Load saved objects
path <- "models/KIR3DL1"
mobj <- get(load(paste0(path, "_model.RData")))
test.geno <- get(load(paste0(path, "_test.RData")))
kirtab <- get(load(paste0(path, "_split.RData")))
model <- hlaModelFromObj(mobj)
# Predict on test set
pred <- kirPredict(model, test.geno, type = "response+prob", verbose = FALSE)
# Evaluate using hlaCompareAllele
comp <- hlaCompareAllele(kirtab$validation, pred,
allele.limit = model, call.threshold = 0.5)
# Overall accuracy
cat(sprintf("Haplotype accuracy: %.1f%%\n", comp$overall$acc.haplo * 100))
cat(sprintf("Genotype accuracy: %.1f%%\n", comp$overall$acc.geno * 100))
cat(sprintf("Call rate: %.1f%%\n", comp$overall$call.rate * 100))
cat(sprintf("Test samples: %d\n", comp$overall$n.samp))
# Per-allele accuracy
if (!is.null(comp$detail)) {
allele_detail <- comp$detail[order(comp$detail$acc.haplo), ]
print(allele_detail)
}
# Save summary
eval_summary <- data.frame(
Locus = "KIR3DL1",
N_test = comp$overall$n.samp,
Acc_Haplo = round(comp$overall$acc.haplo * 100, 2),
Acc_Geno = round(comp$overall$acc.geno * 100, 2),
Call_Rate = round(comp$overall$call.rate * 100, 2)
)
write.csv(eval_summary, file = paste0(path, "_eval_summary.csv"), row.names = FALSE)Once trained, use your model directly with
pong2 impute:
pong2 impute \
-i chr19_target \
-o results/ \
-l KIR3DL1 \
-a hg38 \
-m models/KIR3DL1/KIR3DL1_model.RDataOr load it directly in R:
library(PONG2)
load("models/KIR3DL1/KIR3DL1_model.RData")
model <- hlaModelFromObj(mobj)
geno <- hlaBED2Geno("chr19_target.bed", "chr19_target.fam", "chr19_target.bim",
import.chr = "19", assembly = "hg38")
pred <- kirPredict(model, geno, type = "response+prob")| Error | Likely Cause | Fix |
|---|---|---|
No matching samples |
Sample IDs in --kfile don’t match
.fam |
Check ID format — must match exactly |
Insufficient training samples |
Too few overlapping samples (<10) | Verify PLINK and KIR file sample overlap |
No SNPs found in region |
Wrong assembly or region coordinates | Check --assembly and --region |
No model found for locus |
Locus name typo or unsupported locus | Check locus spelling (e.g. KIR3DL1 not
KIR3DL) |
| Memory issues | Too many threads or large dataset | Reduce --threads or use HPC with more RAM |
| Slow training | Insufficient threads | Increase --threads or reduce
--nclassifier |
| Low accuracy | Too few training samples or rare alleles | Increase sample size or adjust --kirmaf |
Happy KIR model training! 🧬
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.