The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

llm-join: Fuzzy Dataframe Join in R with LLMs

Introduction

llm-join is an R package designed to use Large Language Models (LLMs, such as GPT-5, Claude, DeepSeek, etc.) for fuzzy joining of dataframes. When the key columns of two dataframes have spelling differences, are in different languages, or cannot be matched exactly, llm-join can automatically generate prompts and utilize LLMs to assist in high-quality joining.

Installation

You can install the development version of llmjoin from GitHub with:

devtools::install_github("evanliu3594/llmjoin")

Usage

1. setup your LLM services.

Please note that all information is stored strictly locally in your system configuration (run tools::R_user_dir("llmjoin", "config") to see the full path), and is never uploaded or shared.

library(llmjoin)

# OpenAI
set_llm(provider = "openai", key = "your-api-key")

# Claude (Anthropic)
set_llm(provider = "claude", key = "your-api-key")

# Gemini (via OpenAI-compatible endpoint)
set_llm(provider = "gemini", key = "your-api-key")

# Custom endpoint (Ollama, DeepSeek, Kimi, Grok, Qwen, etc.)
set_llm(provider = "openai",
        url = "https://your-custom-endpoint/v1/chat/completions",
        key = "your-api-key",
        model = "your-model-name")

2. use LLM-JOIN

Below examples used Deepseek-V4-Flash.

Example 1: Numbers ↔︎ Months matching

Match numeric month codes to month names — the LLM understands that “01” means January.

x <- data.frame(id = c("01", "02", "04"), value = c(10, 20, 40))
y <- data.frame(month = c("January", "Feb", "May"), amount = c(100, 200, 400))

llm_join(x, y, key1 = "id", key2 = "month")
#     month id value amount
# 1     Feb 02    20    200
# 2 January 01    10    100
# 3    <NA> 04    40     NA

Example 2: Fuzzy number matching

Match approximate or differently-formatted numeric identifiers — the LLM handles rounding, unit conversion, and format differences.

left <- data.frame(
  weight_kg = c(1.0, 2.5, 5.0),
  product   = c("Widget", "Gadget", "Thing")
)
right <- data.frame(
  weight_lb = c("2.2 lb", "5.5 lb", "11 lb"),
  price = c(4.99, 9.99, 19.99)
)

llm_join(left, right, key1 = "weight_kg", key2 = "weight_lb")
#   weight_lb weight_kg product price
# 1     11 lb       5.0   Thing 19.99
# 2    2.2 lb       1.0  Widget  4.99
# 3    5.5 lb       2.5  Gadget  9.99

Example 3: Country name ↔︎ code matching

Match country names to ISO codes — the LLM bridges different naming conventions, languages, and abbreviations.

left <- data.frame(
  country = c("China", "United States", "Germany", "日本"),
  sales = c(1500, 3200, 2100, 800)
)
right <- data.frame(
  code = c("CN", "US", "DE", "JP"),
  region = c("Asia", "Americas", "Europe", "Asia")
)

llm_join(left, right, key1 = "country", key2 = "code")
#   code       country sales   region
# 1   CN         China  1500     Asia
# 2   DE       Germany  2100   Europe
# 3   JP          日本   800     Asia
# 4   US United States  3200 Americas

Or if you don’t want to setup LLM services in R

x <- data.frame(id = c("01", "02", "04"), value = c(10, 20, 40))
y <- data.frame(month = c("January", "Feb", "May"), amount = c(100, 200, 400))

joint_prompt(unique(x["id"]), unique(y["month"])) |> writeClipboard()

Paste the prompts to ask your LLM model, and copy the answer, going back to R and continue run:

joint <- parse_joint(readr::clipboard(), key1 = "id", key2 = "month")

Reduce(\(x, y) merge(x, y, all.x = TRUE), list(x, joint, y))
#     month id value amount
# 1     Feb 02    20    200
# 2 January 01    10    100
# 3    <NA> 04    40     NA

License

MIT License

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.