The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

textpress

textpress is an R toolkit for building text corpora and searching them – no custom object classes, just plain data frames from start to finish. It covers the full arc from URL to retrieved passage through a consistent four-step API: Fetch, Read, Process, Search. Traditional tools (KWIC, BM25, dictionary matching) sit alongside modern ones (semantic search, LLM-ready chunking), all composing cleanly with |>.

Installation

From CRAN:

install.packages("textpress")

Development version:

remotes::install_github("jaytimm/textpress")

The `textpress` API

Conventions: corpus is a data frame with a text column plus identifier column(s) passed to by (default doc_id). All outputs are plain data frames or data.tables; pipe-friendly.

1. Fetch (`fetch_*`)

Find URLs and metadata – not full text. Pass results to read_urls() to get content.

fetch_urls(query, n_pages, date_filter) – Search engine query; returns candidate URLs with metadata.
fetch_wiki_urls(query, limit) – Wikipedia article URLs matching a search phrase.
fetch_wiki_refs(url, n) – External citation URLs from a Wikipedia article’s References section.

2. Read (`read_*`)

Scrape and parse URLs into a structured corpus.

read_urls(urls, ...) – Character vector of URLs → list(text, meta). text is one row per node (headings, paragraphs, lists); meta is one row per URL. For Wikipedia, exclude_wiki_refs = TRUE drops References / See also / Bibliography sections.

3. Process (`nlp_*`)

Prepare text for search or indexing.

nlp_split_paragraphs() – Break documents into structural blocks.
nlp_split_sentences() – Segment blocks into individual sentences.
nlp_tokenize_text() – Normalize text into a clean token stream.
nlp_index_tokens() – Build a weighted BM25 index for ranked retrieval.
nlp_roll_chunks() – Roll sentences into fixed-size chunks with surrounding context (RAG-style).

4. Search (`search_*`)

Four retrieval modes over the same corpus. Data-first, pipe-friendly.

Function	Query type	Use case
`search_regex(corpus, query)`	Regex pattern	Specific strings, KWIC with inline highlighting.
`search_dict(corpus, terms)`	Term vector	Exact phrases and MWEs; built-in `dict_generations`, `dict_political`.
`search_index(index, query)`	Keywords	BM25 ranked retrieval over a token index.
`search_vector(embeddings, query)`	Numeric vector	Semantic nearest-neighbor search; use `util_fetch_embeddings()` to embed.

RAG & LLM pipelines

textpress is designed to compose cleanly into retrieval-augmented generation pipelines.

Hybrid retrieval – run search_index() and search_vector() over the same chunks, then merge with reciprocal rank fusion (RRF). Chunks that rank well under both term frequency and meaning rise to the top.

Context assembly – nlp_roll_chunks() with context_size > 0 gives each chunk a focal sentence plus surrounding context, so retrieved passages are self-contained when passed to an LLM.

Agent tool-calling – the consistent API and plain data-frame outputs map naturally to tool use:

Agent task	Function
“Find recent articles on X”	`fetch_urls()`
“Scrape these pages”	`read_urls()`
“Find all mentions of these entities”	`search_dict()`
“Follow citations from this Wikipedia article”	`fetch_wiki_refs()`

Vignettes

Web data – fetch_urls() + read_urls()
Basic NLP – sentence splitting, tokenization, span-aware casting
Wikipedia data – fetch_wiki_urls() + fetch_wiki_refs()
Regex search – search_regex(), KWIC
Dictionary search – search_dict(), PMI co-occurrence
Semantic search – RAG pipeline: embeddings, BM25, hybrid RRF retrieval, LLM extraction

License

Citation

citation("textpress")

Issues

Report bugs or request features at https://github.com/jaytimm/textpress/issues

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.