The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

textpress

A lightweight toolkit for text retrieval and NLP with a consistent API: Fetch, Read, Process, and Search. Functions cover the full pipeline from web data to text processing and indexing. Multiple search strategies – regex, BM25, cosine similarity, dictionary matching. Verb_noun naming; pipe-friendly; no heavy dependencies; outputs are plain data frames.


Installation

From CRAN:

install.packages("textpress")

Development version:

remotes::install_github("jaytimm/textpress")

The textpress API map

1. Data acquisition (fetch_*)

These functions talk to the outside world to find locations of information. They return URLs or metadata, not full text.

2. Ingestion (read_*)

Once you have locations, bring the data into R.

3. Processing (nlp_*)

Prepare raw text for analysis or indexing. Designed to be used with the pipe |>.

4. Retrieval (search_*)

Four ways to query your data. Subject-first: first argument is the data (corpus, index, or embeddings); the second is the query/needle. Pipe-friendly.

Function Primary input (needle) Use case
search_regex(corpus, query, …) Character (pattern) Specific strings/patterns, KWIC.
search_dict(corpus, terms, …) Character (vector of terms) Exact phrases/MWEs; no partial-match risk.
search_index(index, query, …) Character (keywords) BM25 ranked retrieval.
search_vector(embeddings, query, …) Numeric (vector/matrix) Semantic neighbors.

search_dict is the exact n-gram matcher: pass a vector of terms (e.g. ); get a table of where they appeared. Optimized for high-speed extraction of thousands of specific terms (MWEs) across large corpora. Add categories later with a left_join on or .

Quick start (all four stages):

library(textpress)
links  <- fetch_urls("R high performance computing")
corpus <- read_urls(links$url)
corpus$doc_id <- seq_len(nrow(corpus))
toks   <- nlp_tokenize_text(corpus, by = "doc_id", include_spans = FALSE)
index  <- nlp_index_tokens(toks)
search_regex(corpus, "parallel|future", by = "doc_id")
search_dict(corpus, terms = c("OpenMP", "Socket"), by = "doc_id")
search_index(index, "distributed computing")
# search_vector(embeddings, query)  # use util_fetch_embeddings() for embeddings

Extension: Using textpress with LLMs & agents

While textpress is a general-purpose text toolkit, its design fits LLM-based workflows (e.g. RAG) and autonomous agents.

Lightweight RAG (retrieval-augmented generation)
You can build a local-first RAG pipeline without a heavy vector DB:

Tool-use for autonomous agents
If you are building an agent (e.g. via or another R framework), textpress functions work well as tools: flat naming and predictable data-frame outputs make them easy for a model to call.


License

MIT © Jason Timm, MA, PhD

Citation

If you use this package in your research, please cite:

citation("textpress")

Issues

Report bugs or request features at https://github.com/jaytimm/textpress/issues

Contributing

Contributions welcome! Please open an issue or submit a pull request.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.