---
title: "PONG2 Training: Building Custom KIR Prediction Models"
author: "Norman Lab"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
    fig_width: 7
    fig_height: 5
vignette: >
  %\VignetteIndexEntry{PONG2 Training: Building Custom KIR Prediction Models}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)
```

## Overview

This vignette provides a complete, step-by-step guide to training custom KIR prediction
models using the `train` command in PONG2.

Training is useful when:

- You have a new or population-specific reference dataset
- You want to update models with additional samples
- You need locus-specific models (e.g. only `KIR3DL1` or `KIR2DL1`)
- You want to experiment with different SNP quality or allele frequency filters

---

## Prerequisites

| Requirement | Version | Notes |
|-------------|---------|-------|
| PLINK2 | ≥ 2.0 | Must be in PATH |
| R | ≥ 4.0 | With PONG2 installed |
| Reference PLINK files | — | chr19 covering the KIR locus |
| KIR allele calls | — | CSV format (see below) |

---

## Step 1: Prepare Input Data

### 1a. Reference genotypes (`--bfile`)

PLINK bed/bim/fam files containing SNPs in the KIR locus (chr19).

#### Using the 1000 Genomes Project (1KGP) as reference panel

The 1KGP is the recommended reference panel for training PONG2 models. Choose the
appropriate panel for your genome assembly:

| Assembly | 1KGP Reference Panel | Samples | URL |
|----------|---------------------|---------|-----|
| hg19 | Phase 3 | 2,504 | https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ |
| hg38 | High Coverage | 3,202 | https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/ |

Download and extract the KIR region:

```bash
# ── hg19: 1KGP Phase 3 ──────────────────────────────────────────────────────
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/\
ALL.chr19.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz

plink2 \
  --vcf ALL.chr19.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz \
  --chr 19 \
  --from-bp 55000000 \
  --to-bp 55400000 \
  --make-bed \
  --out reference_chr19_hg19

# ── hg38: 1KGP High Coverage ─────────────────────────────────────────────────
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/\
1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/\
1kGP_high_coverage_Illumina.chr19.filtered.SNV_INDEL_SV_phased_panel.vcf.gz

plink2 \
  --vcf 1kGP_high_coverage_Illumina.chr19.filtered.SNV_INDEL_SV_phased_panel.vcf.gz \
  --chr 19 \
  --from-bp 54000000 \
  --to-bp 55000000 \
  --make-bed \
  --out reference_chr19_hg38
```

#### Using your own reference dataset

Extract chr19 from your full-genome PLINK files:

```bash
plink2 --bfile full_reference --chr 19 --make-bed --out reference_chr19
```

### 1b. Known KIR allele calls (`--kfile`)

A comma-separated CSV file with sample IDs and phased KIR allele calls.
Each locus requires two columns representing the two haplotypes (`_h1` and `_h2`).

#### Format

```
Sample,KIR2DL1_h1,KIR2DL1_h2,KIR3DL1_h1,KIR3DL1_h2
SAMPLE001,KIR2DL1*00101,KIR2DL1*00302,KIR3DL1*001,KIR3DL1*002
SAMPLE002,KIR2DL1*00104,KIR2DL1*0000,KIR3DL1*004,KIR3DL1*005
SAMPLE003,KIR2DL1*009,KIR2DL1*00601,KIR3DL1*00302,KIR3DL1*001
```

#### Rules

| Rule | Detail |
|------|--------|
| Header required | First row must be `Sample,<locus>_h1,<locus>_h2,...` |
| Sample IDs | Must exactly match IDs in the PLINK `.fam` file |
| Allele names | Use full standard nomenclature (e.g. `KIR3DL1*001`, `KIR2DL1*00201`) |
| Null allele | Use `KIR<locus>*0000` (not blank, not `null`) |
| Unresolved alleles | Rows with `*new` or `*unresolved` are automatically excluded |

---

## Step 2: Run Training

```bash
pong2 train \
  -i reference_chr19 \
  -k kir_calls.csv \
  -o models/KIR3DL1 \
  -l KIR3DL1 \
  -a hg38 \
  -t 20
```

### With optional parameters

```bash
pong2 train \
  -i reference_chr19 \
  -k kir_calls.csv \
  -o models/KIR3DL1 \
  -l KIR3DL1 \
  -a hg38 \
  -t 20 \
  --nclassifier 200 \
  --split 0.8 \
  --kirmaf 0.005 \
  --mac 5
```

### Key training parameters

| Flag | Default | Description |
|------|---------|-------------|
| `--nclassifier` | `100` | Number of ensemble classifiers — higher = more accurate but slower |
| `--split` | `0.7` | Proportion of samples for training (remainder used for validation) |
| `--kirmaf` | `0.00` | Minimum KIR allele frequency — filters rare alleles from training |
| `--mac` | `3` | Minimum allele count for SNPs — removes very rare variants |
| `-r, --region` | Optimized default | Custom SNP region (e.g. `55281035-55295784` for hg19) |

---

## Step 3: Training Output

After training completes, the output directory (`-o`) will contain:

| File | Description |
|------|-------------|
| `<locus>_model.RData` | Trained prediction model — main output |
| `<locus>_test.RData` | Test genotypes (only when `--split < 1`) |
| `<locus>_split.RData` | Train/test split object (only when `--split < 1`) |

> **Note:** Temporary files in `tmp/` are automatically removed after training completes.

---

## Step 4: Evaluate Model Performance

If `--split < 1`, PONG2 holds out a validation set during training.

### Option A: Evaluate from the terminal (recommended)

```bash
pong2 evaluate \
  --model-dir models/KIR3DL1 \
  --locus KIR3DL1 \
  --threshold 0.5
```

This outputs haplotype accuracy, genotype accuracy, call rate, and per-allele
sensitivity/specificity, and saves a summary CSV to the model directory.

### Option B: Evaluate in R

```r
library(PONG2)

# Load saved objects
path      <- "models/KIR3DL1"
mobj      <- get(load(paste0(path, "_model.RData")))
test.geno <- get(load(paste0(path, "_test.RData")))
kirtab    <- get(load(paste0(path, "_split.RData")))
model     <- hlaModelFromObj(mobj)

# Predict on test set
pred <- kirPredict(model, test.geno, type = "response+prob", verbose = FALSE)

# Evaluate using hlaCompareAllele
comp <- hlaCompareAllele(kirtab$validation, pred,
                         allele.limit = model, call.threshold = 0.5)

# Overall accuracy
cat(sprintf("Haplotype accuracy: %.1f%%\n", comp$overall$acc.haplo * 100))
cat(sprintf("Genotype accuracy:  %.1f%%\n", comp$overall$acc.geno  * 100))
cat(sprintf("Call rate:          %.1f%%\n", comp$overall$call.rate  * 100))
cat(sprintf("Test samples:       %d\n",     comp$overall$n.samp))

# Per-allele accuracy
if (!is.null(comp$detail)) {
  allele_detail <- comp$detail[order(comp$detail$acc.haplo), ]
  print(allele_detail)
}

# Save summary
eval_summary <- data.frame(
  Locus     = "KIR3DL1",
  N_test    = comp$overall$n.samp,
  Acc_Haplo = round(comp$overall$acc.haplo  * 100, 2),
  Acc_Geno  = round(comp$overall$acc.geno   * 100, 2),
  Call_Rate = round(comp$overall$call.rate  * 100, 2)
)
write.csv(eval_summary, file = paste0(path, "_eval_summary.csv"), row.names = FALSE)
```

---

## Step 5: Use a Custom Model for Imputation

Once trained, use your model directly with `pong2 impute`:

```bash
pong2 impute \
  -i chr19_target \
  -o results/ \
  -l KIR3DL1 \
  -a hg38 \
  -m models/KIR3DL1/KIR3DL1_model.RData
```

Or load it directly in R:

```r
library(PONG2)

load("models/KIR3DL1/KIR3DL1_model.RData")
model <- hlaModelFromObj(mobj)

geno <- hlaBED2Geno("chr19_target.bed", "chr19_target.fam", "chr19_target.bim",
                    import.chr = "19", assembly = "hg38")
pred <- kirPredict(model, geno, type = "response+prob")
```

---

## Troubleshooting

| Error | Likely Cause | Fix |
|-------|-------------|-----|
| `No matching samples` | Sample IDs in `--kfile` don't match `.fam` | Check ID format — must match exactly |
| `Insufficient training samples` | Too few overlapping samples (<10) | Verify PLINK and KIR file sample overlap |
| `No SNPs found in region` | Wrong assembly or region coordinates | Check `--assembly` and `--region` |
| `No model found for locus` | Locus name typo or unsupported locus | Check locus spelling (e.g. `KIR3DL1` not `KIR3DL`) |
| Memory issues | Too many threads or large dataset | Reduce `--threads` or use HPC with more RAM |
| Slow training | Insufficient threads | Increase `--threads` or reduce `--nclassifier` |
| Low accuracy | Too few training samples or rare alleles | Increase sample size or adjust `--kirmaf` |

---

## Next Steps

- See vignette [PONG2-imputation](https://normanlabucd.github.io/PONG2/articles/PONG2-imputation.html)
  for the full imputation workflow
- See vignette [PONG2 Basics](https://normanlabucd.github.io/PONG2/articles/PONG2-basics.html)
  for installation and quick start
- Run the complete end-to-end workflow script: [example/full_workflow.sh](https://github.com/NormanLabUCD/PONG2/blob/main/example/full_workflow.sh)
- Report issues: [Open a GitHub issue](https://github.com/NormanLabUCD/PONG2/issues/new)

Happy KIR model training! 🧬
