The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Title: Robust Probabilistic Matching for German Company Names
Version: 0.1.2
Description: A pipeline for matching messy company name strings against a clean dictionary (e.g., 'Orbis'). Implements a cascading strategy: Exact -> Fuzzy ('zoomerjoin') -> 'FTS5' ('SQLite') -> Rarity Weighted. References: Beniamino Green (2025) https://beniamino.org/zoomerjoin/; https://www.sqlite.org/fts5.html.
License: MIT + file LICENSE
Encoding: UTF-8
RoxygenNote: 7.3.2
Imports: data.table, stringi, stringdist, zoomerjoin, DBI, RSQLite, cli, progressr, httr, jsonlite, glue, purrr, readr, dplyr
Suggests: testthat
NeedsCompilation: no
Packaged: 2026-02-08 23:26:30 UTC; giulianetinginfrati
Author: Giulian Etingin-Frati [aut, cre]
Maintainer: Giulian Etingin-Frati <etingin-frati@kof.ethz.ch>
Repository: CRAN
Date/Publication: 2026-02-11 19:50:07 UTC

Internal Azure Chat Completion Wrapper (Custom Endpoint)

Description

Sends a request to a custom Azure-like endpoint (e.g. /openai/v1/responses).

Usage

azure_chat_request(
  system_msg,
  user_msg,
  endpoint,
  api_key,
  deployment,
  api_version = "2024-04-14"
)

Arguments

system_msg

String. The instructions for the LLM.

user_msg

String. The specific case to evaluate.

endpoint

String. Base URL.

api_key

String. API Key.

deployment

String. Model/Deployment name.

api_version

String. API version (unused in this custom path but kept for compatibility).

Value

A character string (the JSON response) or NULL on failure.


Match Company Names against a Dictionary

Description

Runs a cascading matching pipeline: Exact -> Fuzzy (Zoomer) -> FTS5 -> Rarity. Matches found in earlier steps are removed from subsequent steps.

Usage

match_companies(
  queries,
  dictionary,
  query_col = "company_name",
  dict_col = "company_name",
  unique_id_col = "query_id",
  dict_id_col = "orbis_id",
  threshold_jw = 0.8,
  threshold_zoomer = 0.4,
  threshold_rarity = 1,
  n_cores = 1
)

Arguments

queries

Data frame. Must contain columns specified in query_col and unique_id_col.

dictionary

Data frame. Must contain columns specified in dict_col and dict_id_col.

query_col

String. Column name for company names in queries.

dict_col

String. Column name for company names in dictionary.

unique_id_col

String. ID column in queries.

dict_id_col

String. ID column in dictionary.

threshold_jw

Numeric (0-1). Minimum Jaro-Winkler similarity. Default 0.8.

threshold_zoomer

Numeric (0-1). Jaccard threshold for blocking. Default 0.4.

threshold_rarity

Numeric. Minimum score for rarity matching. Default 1.0.

n_cores

Integer. Number of cores (reserved for future parallel implementation).

Value

A data.table containing query_id, dict_id, and match_type.

Examples

# Create sample query data
queries <- data.frame(
  query_id = 1:3,
  company_name = c("BMW", "Siemens AG", "Deutsche Bank")
)

# Create sample dictionary
dictionary <- data.frame(
  orbis_id = c("D001", "D002", "D003"),
  company_name = c("BMW AG", "Siemens Aktiengesellschaft", "Commerzbank AG")
)

# Match companies
results <- match_companies(
  queries = queries,
  dictionary = dictionary,
  query_col = "company_name",
  dict_col = "company_name",
  unique_id_col = "query_id",
  dict_id_col = "orbis_id"
)

print(results)

Normalize Company Names

Description

Standardizes company names by lowercasing, removing legal suffixes, translating characters to ASCII, and removing noise words.

Usage

normalize_company_name(x)

Arguments

x

A character vector of company names.

Value

A character vector of normalized names.

Examples

# Normalize a single company name
normalize_company_name("BMW AG")
normalize_company_name("Siemens GmbH & Co. KG")

# Normalize multiple names
companies <- c("Deutsche Bank AG", "VW Group", "BASF SE")
normalize_company_name(companies)

Validate Matches using LLM (Azure OpenAI)

Description

Sends doubtful matches (not "Perfect" or "Unmatched") to an LLM for verification. Supports resuming from interruptions via chunk files.

Usage

validate_matches_llm(
  data,
  query_name_col,
  dict_name_col,
  output_dir = tempdir(),
  filename_stem = "match_validation",
  batch_size = 20,
  api_key = Sys.getenv("AZURE_API_KEY"),
  endpoint = Sys.getenv("AZURE_ENDPOINT"),
  deployment = Sys.getenv("AZURE_DEPLOYMENT")
)

Arguments

data

Data frame. Must contain the columns specified by query_name_col and dict_name_col.

query_name_col

String. Column containing the user's query name (Employer).

dict_name_col

String. Column containing the dictionary match name (Registry).

output_dir

String. Directory to save temporary chunks and final results. Defaults to tempdir().

filename_stem

String. Base name for output files.

batch_size

Integer. Number of rows to process before saving a chunk.

api_key

String. Azure API Key. Defaults to Sys.getenv("AZURE_API_KEY").

endpoint

String. Azure Endpoint. Defaults to Sys.getenv("AZURE_ENDPOINT").

deployment

String. Deployment name. Defaults to Sys.getenv("AZURE_DEPLOYMENT").

Value

A data frame with added LLM_decision and LLM_reason columns.

Examples

## Not run: 
# Sample matched data
matched_data <- data.frame(
  employer_name = c("BMW", "Siemens"),
  registry_name = c("BMW AG", "SAP SE"),
  dict_id = c("D001", "D002"),
  match_type = c("Fuzzy", "Fuzzy")
)

# Validate using LLM (requires Azure credentials)
validated <- validate_matches_llm(
  data = matched_data,
  query_name_col = "employer_name",
  dict_name_col = "registry_name"
)

print(validated)

## End(Not run)

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.