---
title: "PONG2 Basics: Installation, Quick Start, and Core Usage"
author: "Norman Lab"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
    fig_width: 7
    fig_height: 5
vignette: >
  %\VignetteIndexEntry{PONG2 Basics: Installation, Quick Start, and Core Usage}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE  # overridden per chunk
)
```

## Overview

PONG2 enables scalable and accurate KIR genotyping by combining:

- Region-specific PLINK2 preprocessing
- C++-accelerated SNP filtering and matching
- Optional local minimac4 pre-imputation for missing variants
- Supervised allele prediction models tailored to the highly polymorphic KIR region
- Automatic chunked prediction for large biobank datasets (>2,000 samples)

It supports **hg19** and **hg38** assemblies and is particularly useful for studying immune response variation, HLA–KIR interactions, and disease associations in diverse populations.

---

## Features

- Multi-ancestry pre-trained models (EUR, AMR, AFR, EAS, SAS)
- Fast C++ backend for SNP matching and quality filtering
- Automatic hg19 / hg38 coordinate handling
- Configurable SNP missingness threshold
- Built-in local imputation fallback (`--fill-missing`) using minimac4
- Support for external pre-imputation (e.g. Michigan Imputation Server)
- Multi-threading via `--threads`
- Automatic chunked prediction for large biobank datasets (>2,000 samples)
- Force-run mode for low SNP match scenarios
- Built-in colorful help system (`pong2 --help`)

---

## Requirements

**R version:** ≥ 4.0

**Required R packages** (loaded at runtime):

- `PONG2` (this package)
- `readr`
- `tidyverse`
- `parallel`

**System tools** (must be in PATH):

| Tool | Version | Required When |
|------|---------|--------------|
| PLINK2 | ≥ 2.0 | Always |
| minimac4 | ≥ 4.1.6 | `--fill-missing` only |
| bgzip & tabix | HTSlib | `--fill-missing` only |
| Eagle2 | ≥ 2.4 | Pre-phasing before `--fill-missing` |

---

## Installation

### From GitHub (recommended — latest version)

```r
# Install remotes if needed
if (!require("remotes", quietly = TRUE)) install.packages("remotes")

# Install PONG2
remotes::install_github("NormanLabUCD/PONG2")
```

### From release tarball

Download [PONG2_1.0.0.tar.gz](https://github.com/NormanLabUCD/PONG2/releases/download/v1.0.0/PONG2_1.0.0.tar.gz) from the latest release:

```bash
# Standard install
R CMD INSTALL PONG2_1.0.0.tar.gz

# Custom library path
R CMD INSTALL --library=/your/custom/path PONG2_1.0.0.tar.gz
```

### CLI Setup

After installation, make the `pong2` script executable and add it to your PATH:

```bash
# Locate the pong2 script
PONG2_BIN=$(Rscript -e "cat(system.file('scripts', 'pong2', package='PONG2'))")

# Make executable
chmod +x "$PONG2_BIN"

# Add to PATH (add this line to your ~/.bashrc or ~/.bash_profile)
export PATH="$(dirname $PONG2_BIN):$PATH"
```

### Verify installation

```{r verify, eval = TRUE}
library(PONG2)
packageVersion("PONG2")
```

```bash
pong2 version
```

---

## Quick Start Examples

### 1. Basic imputation

```bash
pong2 impute \
  -i data/target_chr19 \
  -o results/basic \
  -l KIR3DL1 \
  -a hg38 \
  -t 16
```

### 2. Imputation with missing SNP fill-in

Pre-phase your data first (see [Pre-phasing section](#pre-phasing-the-kir-region)), then:

```bash
pong2 impute \
  --vcf data/chr19.phased.vcf.gz \
  -o results/imputed \
  -l KIR3DL1 \
  -a hg38 \
  --fill-missing \
  -t 20
```

> **Note:** `--vcf` (pre-phased VCF) is the **only input** required with `--fill-missing`.  
> PLINK files cannot hold phased haplotype data — the pipeline derives everything from the VCF.

### 3. Training a new model

```bash
pong2 train \
  -i data/reference_chr19 \
  -k data/kir_calls.csv \
  -o models/custom \
  -l KIR3DL1 \
  -a hg19 \
  -t 20
```

### 4. Evaluating a trained model

```bash
pong2 evaluate \
  --model-dir models/custom \
  --locus KIR3DL1 \
  --threshold 0.5
```

---

## Core Usage Reference

### Help

```bash
pong2 --help              # General overview + list of commands
pong2 --help impute       # Detailed help for imputation
pong2 --help train        # Detailed help for training
pong2 version             # Show version number
```

---

### `impute` command

```bash
pong2 impute [options]
```

#### Required flags

| Flag | Description | Example |
|------|-------------|---------|
| `-i, --bfile` | PLINK bed/bim/fam prefix (normal imputation) | `data/chr19` |
| `--vcf` | Pre-phased VCF file (required with `--fill-missing`) | `data/chr19.phased.vcf.gz` |
| `-o, --output` | Output directory (created if it doesn't exist) | `results/imputation` |
| `-l, --locus` | KIR locus to impute | `KIR3DL1` |
| `-a, --assembly` | Genome build | `hg19` or `hg38` |

> **Note:** `-i` and `--vcf` are mutually exclusive:
> - Normal imputation: use `-i` (PLINK bfile)
> - `--fill-missing`: use `--vcf` only (PLINK derived internally from VCF)

#### Optional flags

| Flag | Default | Description |
|------|---------|-------------|
| `--filter` | `0.005` | Allele frequency filter threshold (`0.005` or `0.01`) |
| `-t, --threads` | `4` | Number of CPU threads |
| `-f, --force` | `false` | Proceed even if SNP matching rate is below 50% |
| `--fill-missing` | `false` | Impute missing SNPs locally with minimac4 (requires `--vcf`) |

---

### `train` command

```bash
pong2 train [options]
```

#### Required flags

| Flag | Description | Example |
|------|-------------|---------|
| `-i, --bfile` | Reference PLINK bed/bim/fam prefix | `data/chr19` |
| `-k, --kfile` | CSV with sample IDs and phased KIR allele calls | `data/kir_calls.csv` |
| `-o, --output` | Directory to save trained model | `models/KIR3DL1` |
| `-l, --locus` | KIR locus to train | `KIR3DL1` |
| `-a, --assembly` | Genome build | `hg19` or `hg38` |

#### Optional flags

| Flag | Default | Description |
|------|---------|-------------|
| `-t, --threads` | `4` | Number of CPU threads |
| `--nclassifier` | `100` | Number of ensemble classifiers |
| `--split` | `0.7` | Train/validation split proportion |
| `--kirmaf` | `0.00` | Minimum KIR allele frequency filter |
| `--mac` | `3` | Minimum allele count for SNPs |
| `-r, --region` | Optimized default | Custom KIR region (e.g. `55281035-55295784`) |

#### KIR file format

The KIR file (`--kfile`) must be a comma-separated CSV:

```
Sample,KIR3DL1_h1,KIR3DL1_h2
HG00096,KIR3DL1*001,KIR3DL1*002
HG00097,KIR3DL1*005,KIR3DL1*015
HG00099,KIR3DL1*020,KIR3DL1*00302
```

---

### `evaluate` command

Evaluate a trained model against the held-out validation set directly from the terminal:

```bash
pong2 evaluate [options]
```

| Flag | Description | Example |
|------|-------------|---------|
| `--model-dir` | Directory containing trained model files | `models/KIR3DL1` |
| `-l, --locus` | KIR locus to evaluate | `KIR3DL1` |
| `--threshold` | Minimum confidence threshold for calls | `0.5` |

```bash
pong2 evaluate \
  --model-dir models/KIR3DL1 \
  --locus KIR3DL1 \
  --threshold 0.5
```

> **Note:** Requires `--split < 1` during training to generate held-out test data.

---

## Pre-phasing the KIR Region

Pre-phasing is **required** before using `--fill-missing`. Use Eagle2 to phase your chr19 data:

### hg19

```bash
eagle \
  --bfile=chr19 \
  --geneticMapFile=genetic_map_hg19.txt.gz \
  --outPrefix=chr19.phased \
  --chrom=19 \
  --numThreads=20 \
  --bpStart=55000000 \
  --bpEnd=55400000
```

### hg38

```bash
eagle \
  --bfile=chr19 \
  --geneticMapFile=genetic_map_hg38.txt.gz \
  --outPrefix=chr19.phased \
  --chrom=19 \
  --numThreads=20 \
  --bpStart=54000000 \
  --bpEnd=55000000
```

> Eagle2 outputs a phased VCF (`chr19.phased.vcf.gz`) which is passed directly to `--vcf`.

---

## Improving Imputation Accuracy

> **NOTE: KIR Region SNP Overlap between input data and 1KGP**  
> Overlap rate is computed between your input data and the 1000 Genomes Project (1KGP)
> reference panel in the KIR region.
>
> | Overlap Rate | Status | Action |
> |-------------|--------|--------|
> | ≥ 50% | Pass | Proceed with PONG2 directly |
> | < 50% | Fail | Run Eagle2 + minimac4 pre-imputation first |

### Option A: Local pre-imputation (built-in, quick)

```bash
# Step 1: Pre-phase with Eagle2
eagle \
  --bfile=chr19 \
  --geneticMapFile=genetic_map_hg19.txt.gz \
  --outPrefix=chr19.phased \
  --chrom=19 \
  --numThreads=20 \
  --bpStart=55000000 \
  --bpEnd=55400000

# Step 2: Run PONG2 with --fill-missing (VCF only — no -i needed)
pong2 impute \
  --vcf chr19.phased.vcf.gz \
  -o results/imputed \
  -l KIR3DL1 \
  -a hg19 \
  --fill-missing \
  -t 20
```

### Option B: External pre-imputation (recommended for highest accuracy)

Pre-impute your chr19 data using a public server before running PONG2:

**Step 1:** Phase chr19 with Eagle2 (see above)

**Step 2:** Upload phased VCF to [Michigan Imputation Server](https://imputationserver.sph.umich.edu/) or [TOPMed](https://imputation.biodatacatalyst.nhlbi.nih.gov/) (recommended for diverse populations)

- Reference Panel: TOPMed r5
- Chromosome: 19 only

**Step 3:** Download imputed VCF and convert to PLINK:

```bash
plink2 \
  --vcf imputed.dose.vcf.gz dosage=DS \
  --make-bed \
  --out imputed_chr19
```

**Step 4:** Run PONG2:

```bash
pong2 impute \
  -i imputed_chr19 \
  -o results/final \
  -l KIR3DL1 \
  -a hg38 \
  --filter 0.005
```

### Option C: Force imputation (not recommended)

Proceed despite low SNP match rate — use only when you understand the implications:

```bash
pong2 impute -i chr19 -o results -l KIR3DL1 -a hg19 --force
```

---

## Next Steps

- See vignette [PONG2-imputation](https://normanlabucd.github.io/PONG2/articles/PONG2-imputation.html) for detailed imputation workflow
- See vignette [PONG2-training](https://normanlabucd.github.io/PONG2/articles/PONG2-training.html) for custom model training
- Run the complete end-to-end workflow script: [example/full_workflow.sh](https://github.com/NormanLabUCD/PONG2/blob/main/example/full_workflow.sh)
- Report issues: [Open a GitHub issue](https://github.com/NormanLabUCD/PONG2/issues/new)