The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
pairwiseLLM provides a unified workflow for generating
and analyzing pairwise comparisons of writing quality
using LLM APIs (OpenAI, Anthropic, Gemini, Together), and local models
via Ollama..
A typical workflow:
For prompt evaluation and positional-bias diagnostics, see:
For advanced batch processing workflows, see:
pairwiseLLM reads provider keys only from
environment variables, never from R options or global
variables.
| Provider | Environment Variable |
|---|---|
| OpenAI | OPENAI_API_KEY |
| Anthropic | ANTHROPIC_API_KEY |
| Gemini | GEMINI_API_KEY |
| Together | TOGETHER_API_KEY |
You should put these in your ~/.Renviron:
OPENAI_API_KEY="sk-..."
ANTHROPIC_API_KEY="..."
GEMINI_API_KEY="..."
TOGETHER_API_KEY="..."
Check which keys are available:
library(pairwiseLLM)
check_llm_api_keys()
#> All known LLM API keys are set: OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY, TOGETHER_API_KEY.
#> # A tibble: 4 × 4
#> backend service env_var has_key
#> 1 openai OpenAI OPENAI_API_KEY TRUE
#> 2 anthropic Anthropic ANTHROPIC_API_KEY TRUE
#> 3 gemini Google Gemini GEMINI_API_KEY TRUE
#> 4 together Together.ai TOGETHER_API_KEY TRUE
Ollama runs locally and does not require an API key, just that the Ollama server is running.
The package ships with 20 authentic student writing samples:
data("example_writing_samples", package = "pairwiseLLM")
dplyr::slice_head(example_writing_samples, n = 3)
#> # A tibble: 3 × 3
#> ID text quality_score
#> <chr> <chr> <int>
#> 1 S01 "Writing assessment is hard. People write different thing… 1
#> 2 S02 "It is hard to grade writing. Some are long and some are … 2
#> 3 S03 "Assessing writing is difficult because everyone writes d… 3Each sample has:
IDtextCreate all unordered pairs:
pairs <- example_writing_samples |>
make_pairs()
dplyr::slice_head(pairs, n = 5)
#> # A tibble: 5 × 4
#> ID1 text1 ID2 text2
#> <chr> <chr> <chr> <chr>
#> 1 S01 "Writing assessment is hard. People write different things.… S02 "It …
#> 2 S01 "Writing assessment is hard. People write different things.… S03 "Ass…
#> 3 S01 "Writing assessment is hard. People write different things.… S04 "Gra…
#> 4 S01 "Writing assessment is hard. People write different things.… S05 "Wri…
#> 5 S01 "Writing assessment is hard. People write different things.… S06 "It …Sample a subset of pairs:
Randomize SAMPLE_1 / SAMPLE_2 order:
td <- trait_description("overall_quality")
td
#> $name
#> [1] "Overall Quality"
#>
#> $description
#> [1] "Overall quality of the writing, considering how well ideas are expressed,\n how clearly the writing is organized, and how effective the language and\n conventions are."Or define your own:
Load default prompt:
tmpl <- set_prompt_template()
cat(substr(tmpl, 1, 300))
#> You are a debate adjudicator. Your task is to weigh the comparative strengths of two writing samples regarding a specific trait.
#>
#> TRAIT: {TRAIT_NAME}
#> DEFINITION: {TRAIT_DESCRIPTION}
#>
#> SAMPLES:
#>
#> === SAMPLE_1 ===
#> {SAMPLE_1}
#>
#> === SAMPLE_2 ===
#> {SAMPLE_2}
#>
#> EVALUATION PROCESS (Mental Simulation):
#>
#> 1. **AdPlaceholders required:
{TRAIT_NAME}{TRAIT_DESCRIPTION}{SAMPLE_1}{SAMPLE_2}Load a template from file:
The unified wrapper works for OpenAI, Anthropic, Gemini, Together, and Ollama.
res_live <- submit_llm_pairs(
pairs = pairs_small,
backend = "openai", # also "anthropic", "gemini", "together", "ollama"
model = "gpt-4o",
trait_name = td$name,
trait_description = td$description,
prompt_template = tmpl
)Preview results:
Each row includes:
pair_idsample1_id, sample2_id<BETTER_SAMPLE> tag →
better_sample and better_idConvert LLM output to a 3-column BT dataset:
# res_live: output from submit_llm_pairs()
bt_data <- build_bt_data(res_live)
dplyr::slice_head(bt_data, 5)and/or a dataset for Elo modeling:
Fit model:
Summarize results:
The output includes:
Outputs:
Most users use the unified interface, but backend helpers are available.
submit_openai_pairs_live()build_openai_batch_requests()run_openai_batch_pipeline()parse_openai_batch_output()submit_anthropic_pairs_live()build_anthropic_batch_requests()run_anthropic_batch_pipeline()parse_anthropic_batch_output()submit_gemini_pairs_live()build_gemini_batch_requests()run_gemini_batch_pipeline()parse_gemini_batch_output()together_compare_pair_live()submit_together_pairs_live()ollama_compare_pair_live()submit_ollama_pairs_live()check_llm_api_keys()
#> All known LLM API keys are set: OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY, TOGETHER_API_KEY.
#> # A tibble: 4 × 4
#> backend service env_var has_key
#> <chr> <chr> <chr> <lgl>
#> 1 openai OpenAI OPENAI_API_KEY TRUE
#> 2 anthropic Anthropic ANTHROPIC_API_KEY TRUE
#> 3 gemini Google Gemini GEMINI_API_KEY TRUE
#> 4 together Together.ai TOGETHER_API_KEY TRUEUse the default template or set
include_thoughts = FALSE.
Use batch APIs for >40 pairs.
Use compute_reverse_consistency() +
check_positional_bias() (see vignette(“prompt-template-bias”)
for a full example).
Mercer, S. (2025). Getting started with pairwiseLLM (Version 1.0.0) [R package vignette]. In pairwiseLLM: Pairwise Comparison Tools for Large Language Model-Based Writing Evaluation. https://shmercer.github.io/pairwiseLLM/
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.