The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

textpress

CRAN version CRAN downloads

textpress is an R toolkit for building text corpora and searching them – no custom object classes, just plain data frames from start to finish. It covers the full arc from URL to retrieved passage through a consistent four-step API: Fetch, Read, Process, Search. Traditional tools (KWIC, BM25, dictionary matching) sit alongside modern ones (semantic search, LLM-ready chunking), all composing cleanly with |>.


Installation

From CRAN:

install.packages("textpress")

Development version:

remotes::install_github("jaytimm/textpress")

The textpress API

Conventions: corpus is a data frame with a text column plus identifier column(s) passed to by (default doc_id). All outputs are plain data frames or data.tables; pipe-friendly.

1. Fetch (fetch_*)

Find URLs and metadata – not full text. Pass results to read_urls() to get content.

2. Read (read_*)

Scrape and parse URLs into a structured corpus.

3. Process (nlp_*)

Prepare text for search or indexing.

4. Search (search_*)

Four retrieval modes over the same corpus. Data-first, pipe-friendly.

Function Query type Use case
search_regex(corpus, query) Regex pattern Specific strings, KWIC with inline highlighting.
search_dict(corpus, terms) Term vector Exact phrases and MWEs; built-in dict_generations, dict_political.
search_index(index, query) Keywords BM25 ranked retrieval over a token index.
search_vector(embeddings, query) Numeric vector Semantic nearest-neighbor search; use util_fetch_embeddings() to embed.

RAG & LLM pipelines

textpress is designed to compose cleanly into retrieval-augmented generation pipelines.

Hybrid retrieval – run search_index() and search_vector() over the same chunks, then merge with reciprocal rank fusion (RRF). Chunks that rank well under both term frequency and meaning rise to the top.

Context assemblynlp_roll_chunks() with context_size > 0 gives each chunk a focal sentence plus surrounding context, so retrieved passages are self-contained when passed to an LLM.

Agent tool-calling – the consistent API and plain data-frame outputs map naturally to tool use:

Agent task Function
“Find recent articles on X” fetch_urls()
“Scrape these pages” read_urls()
“Find all mentions of these entities” search_dict()
“Follow citations from this Wikipedia article” fetch_wiki_refs()

Vignettes


License

MIT © Jason Timm

Citation

citation("textpress")

Issues

Report bugs or request features at https://github.com/jaytimm/textpress/issues

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.