The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
pairwiseLLM provides a unified, extensible framework for
generating, submitting, and modeling pairwise comparisons of
writing quality using large language models (LLMs).
It includes:
Several vignettes are available to demonstrate functionality.
For basic function usage, see:
For advanced batch processing workflows, see:
For information on prompt evaluation and positional-bias diagnostics, see:
The following models are confirmed to work for pairwise comparisons:
| Provider | Model | Reasoning Mode? |
|---|---|---|
| OpenAI | gpt-5.2 | ✅ Yes |
| OpenAI | gpt-5.1 | ✅ Yes |
| OpenAI | gpt-4o | ❌ No |
| OpenAI | gpt-4.1 | ❌ No |
| Anthropic | claude-sonnet-4-5 | ✅ Yes |
| Anthropic | claude-haiku-4-5 | ✅ Yes |
| Anthropic | claude-opus-4-5 | ✅ Yes |
| Google/Gemini | gemini-3-pro-preview | ✅ Yes |
| DeepSeek-AI1 | DeepSeek-R1 | ✅ Yes |
| DeepSeek-AI1 | DeepSeek-V3 | ❌ No |
| Moonshot-AI1 | Kimi-K2-Instruct-0905 | ❌ No |
| Qwen1 | Qwen3-235B-A22B-Instruct-2507 | ❌ No |
| Qwen2 | qwen3:32b | ✅ Yes |
| Google2 | gemma3:27b | ❌ No |
| Mistral2 | mistral-small3.2:24b | ❌ No |
1 via the together.ai API
2 via Ollama on a local machine
Batch APIs are currently available for OpenAI, Anthropic, and Gemini
only. Models accessed via Together.ai and Ollama are supported for live
comparisons via submit_llm_pairs() /
llm_compare_pair().
| Backend | Live | Batch |
|---|---|---|
| openai | ✅ | ✅ |
| anthropic | ✅ | ✅ |
| gemini | ✅ | ✅ |
| together | ✅ | ❌ |
| ollama | ✅ | ❌ |
Once the package is available on CRAN, install with:
install.packages("pairwiseLLM")To install the development version from GitHub:
# install.packages("pak")
pak::pak("shmercer/pairwiseLLM")Load the package:
library(pairwiseLLM)At a high level, pairwiseLLM workflows follow this
structure:
{TRAIT_NAME}, {TRAIT_DESCRIPTION},
{SAMPLE_1}, {SAMPLE_2}.The package provides helpers for each step.
Use the unified API:
llm_compare_pair() — compare one pairsubmit_llm_pairs() — compare many pairs at onceExample:
data("example_writing_samples")
pairs <- example_writing_samples |>
make_pairs() |>
sample_pairs(5, seed = 123) |>
randomize_pair_order()
td <- trait_description("overall_quality")
tmpl <- get_prompt_template("default")
res <- submit_llm_pairs(
pairs = pairs,
backend = "openai",
model = "gpt-4o",
trait_name = td$name,
trait_description = td$description,
prompt_template = tmpl
)Large-scale runs use:
llm_submit_pairs_batch()llm_download_batch_results()Example:
batch <- llm_submit_pairs_batch(
backend = "anthropic",
model = "claude-sonnet-4-5",
pairs = pairs,
trait_name = td$name,
trait_description = td$description,
prompt_template = tmpl
)
results <- llm_download_batch_results(batch)pairwiseLLM reads keys only from environment
variables.
Keys are never printed, never stored,
and never written to disk.
You can verify which providers are available using:
check_llm_api_keys()This returns a tibble showing whether R can see the required keys for:
You may set keys temporarily for the current R session:
Sys.setenv(OPENAI_API_KEY = "your-key-here")
Sys.setenv(ANTHROPIC_API_KEY = "your-key-here")
Sys.setenv(GEMINI_API_KEY = "your-key-here")
Sys.setenv(TOGETHER_API_KEY = "your-key-here")…but for normal use and for reproducible analyses, it is
strongly recommended
to store them in your ~/.Renviron file.
~/.RenvironOpen your .Renviron file:
usethis::edit_r_environ()Add the following lines:
OPENAI_API_KEY="your-openai-key"
ANTHROPIC_API_KEY="your-anthropic-key"
GEMINI_API_KEY="your-gemini-key"
TOGETHER_API_KEY="your-together-key"
Save the file, then restart R.
You can confirm that R now sees the keys:
check_llm_api_keys()pairwiseLLM includes:
register_prompt_template()list_prompt_templates()
#> [1] "default" "test1" "test2" "test3" "test4" "test5"tmpl <- get_prompt_template("default")
cat(substr(tmpl, 1, 400), "...\n")
#> You are a debate adjudicator. Your task is to weigh the comparative strengths of two writing samples regarding a specific trait.
#>
#> TRAIT: {TRAIT_NAME}
#> DEFINITION: {TRAIT_DESCRIPTION}
#>
#> SAMPLES:
#>
#> === SAMPLE_1 ===
#> {SAMPLE_1}
#>
#> === SAMPLE_2 ===
#> {SAMPLE_2}
#>
#> EVALUATION PROCESS (Mental Simulation):
#>
#> 1. **Advocate for SAMPLE_1**: Mentally list the single strongest point of evidence that makes SAMPLE_1 the ...register_prompt_template("my_template", "
Compare two essays for {TRAIT_NAME}…
{TRAIT_NAME} is defined as {TRAIT_DESCRIPTION}.
SAMPLE 1:
{SAMPLE_1}
SAMPLE 2:
{SAMPLE_2}
<BETTER_SAMPLE>SAMPLE_1</BETTER_SAMPLE> or
<BETTER_SAMPLE>SAMPLE_2</BETTER_SAMPLE>
")Use it in a submission:
tmpl <- get_prompt_template("my_template")Traits define what “quality” means.
trait_description("overall_quality")
#> $name
#> [1] "Overall Quality"
#>
#> $description
#> [1] "Overall quality of the writing, considering how well ideas are expressed,\n how clearly the writing is organized, and how effective the language and\n conventions are."You can also provide custom traits:
trait_description(
custom_name = "Clarity",
custom_description = "How understandable, coherent, and well structured the ideas are."
)LLMs often show a first-position or second-position bias.
pairwiseLLM includes explicit tools for testing this.
pairs_fwd <- make_pairs(example_writing_samples)
pairs_rev <- sample_reverse_pairs(pairs_fwd, reverse_pct = 1.0)Submit:
res_fwd <- submit_llm_pairs(pairs_fwd, model = "gpt-4o", backend = "openai", ...)
res_rev <- submit_llm_pairs(pairs_rev, model = "gpt-4o", backend = "openai", ...)Compute bias:
cons <- compute_reverse_consistency(res_fwd, res_rev)
bias <- check_positional_bias(cons)
cons$summary
bias$summaryFive included templates have been tested across different backend
providers. Complete details are presented in a vignette: vignette("prompt-template-bias")
bt_data <- build_bt_data(res)
bt_fit <- fit_bt_model(bt_data)
summarize_bt_fit(bt_fit)# res: output from submit_llm_pairs() / llm_submit_pairs_batch()
elo_data <- build_elo_data(res)
elo_fit <- fit_elo_model(elo_data, runs = 5)
elo_fit$elo
elo_fit$reliability
elo_fit$reliability_weighted| Workflow | Use Case | Functions |
|---|---|---|
| Live | small or interactive runs | submit_llm_pairs, llm_compare_pair |
| Batch | large jobs, cost control | llm_submit_pairs_batch,
llm_download_batch_results |
Contributions to pairwiseLLM are very welcome!
If you encounter a problem:
Run:
devtools::session_info()Include:
Open an issue at:
https://github.com/shmercer/pairwiseLLM/issues
MIT License. See LICENSE.
Mercer, S. H. (2025). pairwiseLLM: Pairwise writing quality comparisons with large language models (Version 1.0.0) [R package; Computer software]. https://github.com/shmercer/pairwiseLLM
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.