The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
textpress is an R toolkit for building text corpora and
searching them – no custom object classes, just plain data frames from
start to finish. It covers the full arc from URL to retrieved passage
through a consistent four-step API: Fetch,
Read, Process,
Search. Traditional tools (KWIC, BM25, dictionary
matching) sit alongside modern ones (semantic search, LLM-ready
chunking), all composing cleanly with |>.
From CRAN:
install.packages("textpress")Development version:
remotes::install_github("jaytimm/textpress")textpress APIConventions: corpus is a data frame with a
text column plus identifier column(s) passed to
by (default doc_id). All outputs are plain
data frames or data.tables; pipe-friendly.
fetch_*)Find URLs and metadata – not full text. Pass results to
read_urls() to get content.
fetch_urls(query, n_pages, date_filter)
– Search engine query; returns candidate URLs with metadata.fetch_wiki_urls(query, limit) –
Wikipedia article URLs matching a search phrase.fetch_wiki_refs(url, n) – External
citation URLs from a Wikipedia article’s References section.read_*)Scrape and parse URLs into a structured corpus.
read_urls(urls, ...) – Character
vector of URLs → list(text, meta). text is one
row per node (headings, paragraphs, lists); meta is one row
per URL. For Wikipedia, exclude_wiki_refs = TRUE drops
References / See also / Bibliography sections.nlp_*)Prepare text for search or indexing.
nlp_split_paragraphs() – Break
documents into structural blocks.nlp_split_sentences() – Segment blocks
into individual sentences.nlp_tokenize_text() – Normalize text
into a clean token stream.nlp_index_tokens() – Build a weighted
BM25 index for ranked retrieval.nlp_roll_chunks() – Roll sentences
into fixed-size chunks with surrounding context (RAG-style).search_*)Four retrieval modes over the same corpus. Data-first, pipe-friendly.
| Function | Query type | Use case |
|---|---|---|
search_regex(corpus, query) |
Regex pattern | Specific strings, KWIC with inline highlighting. |
search_dict(corpus, terms) |
Term vector | Exact phrases and MWEs; built-in dict_generations,
dict_political. |
search_index(index, query) |
Keywords | BM25 ranked retrieval over a token index. |
search_vector(embeddings, query) |
Numeric vector | Semantic nearest-neighbor search; use
util_fetch_embeddings() to embed. |
textpress is designed to compose cleanly into
retrieval-augmented generation pipelines.
Hybrid retrieval – run search_index()
and search_vector() over the same chunks, then merge with
reciprocal rank fusion (RRF). Chunks that rank well under both term
frequency and meaning rise to the top.
Context assembly – nlp_roll_chunks()
with context_size > 0 gives each chunk a focal sentence
plus surrounding context, so retrieved passages are self-contained when
passed to an LLM.
Agent tool-calling – the consistent API and plain data-frame outputs map naturally to tool use:
| Agent task | Function |
|---|---|
| “Find recent articles on X” | fetch_urls() |
| “Scrape these pages” | read_urls() |
| “Find all mentions of these entities” | search_dict() |
| “Follow citations from this Wikipedia article” | fetch_wiki_refs() |
fetch_urls() + read_urls()fetch_wiki_urls() +
fetch_wiki_refs()search_regex(), KWICsearch_dict(), PMI co-occurrenceMIT © Jason Timm
citation("textpress")Report bugs or request features at https://github.com/jaytimm/textpress/issues
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.