The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
A lightweight toolkit for text retrieval and NLP with a consistent API: Fetch, Read, Process, and Search. Functions cover the full pipeline from web data to text processing and indexing. Multiple search strategies – regex, BM25, cosine similarity, dictionary matching. Verb_noun naming; pipe-friendly; no heavy dependencies; outputs are plain data frames.
From CRAN:
install.packages("textpress")Development version:
remotes::install_github("jaytimm/textpress")fetch_*)These functions talk to the outside world to find locations of information. They return URLs or metadata, not full text.
fetch_urls() — Web (general). Search
engines for a list of relevant links.fetch_wiki_urls() — Wikipedia. Find
specific page titles/URLs.fetch_wiki_refs() — Wikipedia. Extract
the external “References” URLs from a page.read_*)Once you have locations, bring the data into R.
read_urls() — Input: character vector
of URLs. Output: data frame of cleaned text/markdown.nlp_*)Prepare raw text for analysis or indexing. Designed to be used with
the pipe |>.
nlp_split_paragraphs() — Break large
documents into structural blocks.nlp_split_sentences() — Refine blocks
into individual sentences.nlp_tokenize_text() — Normalize text
into a clean token stream.nlp_index_tokens() — Build a weighted
BM25 index for ranked search.nlp_roll_chunks() — Roll units
(e.g. sentences) into fixed-size chunks with optional context
(RAG-style).search_*)Four ways to query your data. Subject-first: first argument is the data (corpus, index, or embeddings); the second is the query/needle. Pipe-friendly.
| Function | Primary input (needle) | Use case |
|---|---|---|
| search_regex(corpus, query, …) | Character (pattern) | Specific strings/patterns, KWIC. |
| search_dict(corpus, terms, …) | Character (vector of terms) | Exact phrases/MWEs; no partial-match risk. |
| search_index(index, query, …) | Character (keywords) | BM25 ranked retrieval. |
| search_vector(embeddings, query, …) | Numeric (vector/matrix) | Semantic neighbors. |
search_dict is the exact n-gram matcher: pass a vector of terms (e.g. ); get a table of where they appeared. Optimized for high-speed extraction of thousands of specific terms (MWEs) across large corpora. Add categories later with a left_join on or .
Quick start (all four stages):
library(textpress)
links <- fetch_urls("R high performance computing")
corpus <- read_urls(links$url)
corpus$doc_id <- seq_len(nrow(corpus))
toks <- nlp_tokenize_text(corpus, by = "doc_id", include_spans = FALSE)
index <- nlp_index_tokens(toks)
search_regex(corpus, "parallel|future", by = "doc_id")
search_dict(corpus, terms = c("OpenMP", "Socket"), by = "doc_id")
search_index(index, "distributed computing")
# search_vector(embeddings, query) # use util_fetch_embeddings() for embeddingsWhile textpress is a general-purpose text toolkit, its design fits LLM-based workflows (e.g. RAG) and autonomous agents.
Lightweight RAG (retrieval-augmented
generation)
You can build a local-first RAG pipeline without a heavy vector DB:
search_index() (BM25) to pull relevant chunks by keyword;
often more accurate for technical data than semantic search alone.nlp_split_paragraphs() and related functions so you send
only relevant snippets to an LLM, cutting token cost and improving
answers.search_dict() to extract known entities or IDs before
calling an LLM, so the model does not hallucinate core facts.Tool-use for autonomous agents
If you are building an agent (e.g. via or another R framework),
textpress functions work well as tools: flat naming and
predictable data-frame outputs make them easy for a model to call.
fetch_urls() — agent “Search” tool.read_urls() — agent “Browse” tool.search_regex() — agent “Find in page” tool.search_dict() — agent “Entity extraction” tool
(deterministic; reduces hallucination).MIT © Jason Timm, MA, PhD
If you use this package in your research, please cite:
citation("textpress")Report bugs or request features at https://github.com/jaytimm/textpress/issues
Contributions welcome! Please open an issue or submit a pull request.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.