The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Fast multi-pattern string matching in R with the Aho-Corasick algorithm.
ahocorasick, powered by the Rust aho-corasick
crate, allows you to search for many substrings in one or more strings
with linear complexity in R. It builds a reusable automaton once and
then uses it for detection, counting, locating, and extraction and
replacement.
Aho-Corasick
is a multiple-pattern string matching algorithm designed to search for
many keywords in a text in a single pass. After preprocessing the
pattern set into an automaton, it finds all matches in
O(text_length + number_of_matches), instead of the naive
O(text_length × number_of_patterns) approach. This makes it
especially suitable for high-throughput tasks such as dictionary
matching, keyword extraction, rule-based text filtering, and large-scale
log or document analysis.
For more details on the Aho-Corasick algorithm, please see Aho and Corasick (1975).
Install the released version from CRAN:
install.packages("ahocorasick")Install the released version from R-universe/R-multiverse:
install.packages("ahocorasick", repos = "https://yousa-mirage.r-universe.dev")
# or
install.packages("ahocorasick", repos = "https://community.r-multiverse.org")Install the development version from source. You need Cargo and
rustc on PATH.
# install.packages("pak")
pak::pak("Yousa-Mirage/r-ahocorasick")Start with a character vector of patterns you want to search for,
then compile it into an <ac_automaton> object by
calling ac_build() that can be reused across many
documents.
library(ahocorasick)
patterns <- c("hello", "world", "fish")
doc <- c(
"this is my first hello world. hello!",
"nothing to see",
"fish and chips"
)
ac <- ac_build(patterns)You can set ascii_case_insensitive = TRUE to make the
search case-insensitive for ASCII characters (a-z and
A-Z) only.
ac_extract() returns one data frame per document, with
both the matched text and the pattern value that produced each
match.
patterns <- c("hello", "world", "fish")
doc <- c(
"this is my first hello world. hello!",
"nothing to see",
"fish and chips"
)
ac <- ac_build(patterns)
hits <- ac_extract(ac, doc)
hits[[1]]
#> matches patterns
#> 1 hello hello
#> 2 world world
#> 3 hello helloUse ac_extract_df() when you want one data frame for all
documents.
ac_extract_df(ac, doc)
#> doc_id matches patterns
#> 1 1 hello hello
#> 2 1 world world
#> 3 1 hello hello
#> 4 3 fish fishSet overlapping = TRUE to return overlapping matches.
This is only supported when the automaton was built with the default
match_kind = "standard".
ac_overlap <- ac_build(c("aba", "bab"))
ac_extract_df(ac_overlap, "ababa", overlapping = TRUE)
#> doc_id matches patterns
#> 1 1 aba aba
#> 2 1 bab bab
#> 3 1 aba abaac_locate() returns one data frame per document. Offsets
are 1-based character positions, inclusive on both sides, so they can be
used with substr().
patterns <- c("hello", "world", "fish")
doc <- c(
"this is my first hello world. hello!",
"nothing to see",
"fish and chips"
)
ac <- ac_build(patterns)
hits <- ac_locate(ac, doc)
hits[[1]]
#> pattern_id start end
#> 1 1 18 22
#> 2 2 24 28
#> 3 1 31 35Use ac_locate_df() when you want one data frame for all
documents.
ac_locate_df(ac, doc)
#> doc_id pattern_id start end
#> 1 1 1 18 22
#> 2 1 2 24 28
#> 3 1 1 31 35
#> 4 3 3 1 4Use ac_locate_bytes() when byte offsets are more useful
than R character positions. Byte offsets are 0-based and
byte_end is end-exclusive.
Use ac_detect() when you only need to know whether a
document contains at least one match, and ac_count() when
you need the number of matches.
patterns <- c("hello", "world", "fish")
doc <- c(
"this is my first hello world. hello!",
"nothing to see",
"fish and chips"
)
ac <- ac_build(patterns)
ac_detect(ac, doc)
#> [1] TRUE FALSE TRUE
ac_count(ac, doc)
#> [1] 3 0 1These functions are usually the cheapest way to answer yes/no or frequency questions without materializing the matched strings or offsets.
Use ac_replace() to replace every non-overlapping match
with the replacement for the matched pattern.
patterns <- c("fox", "brown", "quick")
doc <- "The quick brown fox."
ac <- ac_build(patterns)
ac_replace(ac, doc, c("sloth", "grey", "slow"))
#> [1] "The slow grey sloth."Use one replacement string to replace all patterns with the same value.
ac_replace(ac, doc, "")
#> [1] "The ."Replacement uses the automaton’s match semantics. If you want Polars
replace_many(..., leftmost = TRUE)-style priority, build
the automaton with match_kind = "leftmost_first".
ac_leftmost <- ac_build(
c("append", "appendage", "app"),
match_kind = "leftmost_first"
)
ac_replace(ac_leftmost, "append the app to the appendage", c("x", "y", "z"))
#> [1] "x the z to the xage"The ac_*_file() family applies the same automaton to one
or more files. Pass a vector of file paths and each function returns
results aligned with those files.
ac_files <- ac_build(c("hello", "world"))
paths <- c(tempfile(fileext = ".txt"), tempfile(fileext = ".txt"))
writeLines("hello world", paths[[1]])
writeLines("fish and chips", paths[[2]])
ac_detect_file(ac_files, paths)
#> [1] TRUE FALSE
ac_count_file(ac_files, paths)
#> [1] 2 0
ac_extract_file(ac_files, paths)
#> [[1]]
#> matches patterns
#> 1 hello hello
#> 2 world world
#>
#> [[2]]
#> [1] matches patterns
#> <0 rows> (or 0-length row.names)ac_detect_file(), ac_count_file(),
ac_extract_file(), and ac_replace_file()
support stream = TRUE when the automaton was built with the
default match_kind = "standard". This is useful for large
files when you want to avoid reading the whole file into memory at
once.
ac_locate_file() does not support
stream = TRUE. It returns R character offsets instead of
raw byte offsets, and converting streamed byte positions back into
character positions would require another pass over the file. Keeping
file location search non-streaming makes the behavior simpler and easier
to reason about.
These search functions work naturally inside
dplyr::mutate(). Scalar outputs like
ac_detect(), ac_count(), and
ac_replace() become ordinary columns, while
ac_extract() and ac_locate() can be stored as
list-columns.
patterns <- c("hello", "world", "fish")
ac <- ac_build(patterns)
docs <- tibble::tibble(
doc = c(
"this is my first hello world. hello!",
"nothing to see",
"fish and chips"
)
)
docs |>
dplyr::mutate(
detected = ac_detect(ac, doc),
n_matches = ac_count(ac, doc),
replaced = ac_replace(ac, doc, "[match]"),
extracted = ac_extract(ac, doc)
) |>
tidyr::unnest(extracted)
#> # A tibble: 4 × 6
#> doc detected n_matches replaced matches patterns
#> <chr> <lgl> <int> <chr> <chr> <chr>
#> 1 this is my first hello world. he… TRUE 3 this is… hello hello
#> 2 this is my first hello world. he… TRUE 3 this is… world world
#> 3 this is my first hello world. he… TRUE 3 this is… hello hello
#> 4 fish and chips TRUE 1 [match]… fish fishSearch functions let you choose how NA documents
behave.
doc_na <- c("hello", NA_character_, "world")
ac_na <- ac_build(c("hello", "world"))
ac_detect(ac_na, doc_na, na = "false")
#> [1] TRUE FALSE TRUE
ac_count(ac_na, doc_na, na = "zero")
#> [1] 1 0 1
ac_replace(ac_na, doc_na, "[match]", na = "empty")
#> [1] "[match]" "" "[match]"For list-column workflows, ac_locate(..., na = "empty")
and ac_extract(..., na = "empty") treat missing documents
as no matches.
ac_build() exposes three match kinds from the Rust
library.
standard (the default)The default, match_kind = "standard", returns the match
that finishes first. It is also the only mode that supports overlapping
search.
patterns <- c("content", "disco", "disc", "discontent", "winter")
haystack <- "This is the winter of my discontent"
ac <- ac_build(patterns)
ac_extract_df(ac, haystack)
#> doc_id matches patterns
#> 1 1 winter winter
#> 2 1 disc disc
ac_extract_df(ac, haystack, overlapping = TRUE)
#> doc_id matches patterns
#> 1 1 winter winter
#> 2 1 disc disc
#> 3 1 disco disco
#> 4 1 discontent discontent
#> 5 1 content contentleftmost_firstmatch_kind = "leftmost_first" returns the leftmost
non-overlapping match. If several patterns start at the same position,
the earlier pattern wins.
ac <- ac_build(
c("disco", "disc"),
match_kind = "leftmost_first"
)
ac_extract_df(ac, "discontent")
#> doc_id matches patterns
#> 1 1 disco discoleftmost_longestmatch_kind = "leftmost_longest" returns the leftmost
non-overlapping match. If several patterns start at the same position,
the longest pattern wins.
ac <- ac_build(
c("disco", "disc", "discontent"),
match_kind = "leftmost_longest"
)
ac_extract_df(ac, "discontent")
#> doc_id matches patterns
#> 1 1 discontent discontentimplementation controls the underlying automaton
implementation:
"auto" lets the Rust crate choose a heuristic
default."noncontiguous_nfa" is fast to build and uses moderate
memory."contiguous_nfa" uses memory efficiently and is usually
faster to search than a noncontiguous NFA."dfa" can be fastest to search, but may be slow to
build and can use much more memory.ac <- ac_build(
c("disco", "disc"),
implementation = "dfa"
)
ac_info(ac)$implementation
#> [1] "dfa"See the benchmark
article for results on English, Chinese, and mixed-text workloads.
In the current benchmark, ahocorasick is
fastest for detect and count
across the tested cases. The ac_extract_df() is also the
fastest for extract while preserving a
tidy long data-frame result.
As with any benchmark, real-world results will differ based on your particular situation. If performance is important to your application, measure the alternatives yourself!
Packages that developed to handle multiple string matching in R are less than in Python. These are what I found:
AhoCorasickTrie:
Implemented in C++. It only supports 128 ASCII characters and currently
does not support UTF-8/Unicode (such as CJK and emojis).Biostrings:
It is a special tool for bioinformatics, and its object system is
XString / DNAString / XStringSet.
It is not suitable as a drop-in multi-mode retrieval tool for general R
string vectors.r-polars:
Polars’ string expressions also use the Aho-Corasick algorithm to match
(based on the same crate as
this package). If your data is already in a DataFrame, I
recommend you use Polars for matching strings directly.It’s great to see that Rust is providing more and more modern, safe, high-performance open source tools for R (and also for Python) :)
Inspired by the Python package ahocorasick_rs.
Thanks for the excellent Rust aho-corasick
crate.
MIT © 2026 Yousa Mirage
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.