The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Beyond crawling, crawlee provides three helpers to turn collected
text into a retrieval-ready corpus for retrieval-augmented generation
(RAG): cr_chunk(), cr_embed() and
cr_export(). They operate on plain tibbles, so they slot in
right after cr_collect().
pages <- crawler("https://books.toscrape.com/") |>
cr_options(max_requests = 100) |>
cr_on_html(function(ctx) {
ctx$push_data(list(
url = ctx$request$url,
title = ctx$page |> rvest::html_element("title") |> rvest::html_text2(),
text = ctx$page |> rvest::html_element("body") |> rvest::html_text2()
))
ctx$enqueue_links(glob = "*/catalogue/*")
}) |>
cr_run() |>
cr_collect()cr_chunk() splits text into overlapping windows. On a
data frame, name the text column; every other column is carried along as
per-chunk metadata (so each chunk keeps its url and
title).
chunks <- cr_chunk(pages, text = text, size = 1000, overlap = 200, by = "char")
chunks
#> columns: doc_id, chunk_id, chunk, text, n_chars, url, titleUse by = "word" to size chunks in words instead of
characters.
cr_embed() is provider-agnostic:
crawlee never calls an embedding service itself. You pass
embed_fn, a function that maps a character vector to a
numeric matrix (one row per input) or a list of numeric vectors. It is
applied in batches and adds an embedding list-column.
# A real embedder typically calls an HTTP API (any provider) with httr2:
embed_fn <- function(texts) {
# return a length(texts) x d numeric matrix
resp <- httr2::request("https://api.example.com/v1/embeddings") |>
httr2::req_auth_bearer_token(Sys.getenv("EMBEDDINGS_API_KEY")) |>
httr2::req_body_json(list(input = texts)) |>
httr2::req_perform()
do.call(rbind, lapply(httr2::resp_body_json(resp)$data, \(x) unlist(x$embedding)))
}
embedded <- cr_embed(chunks, embed_fn, batch_size = 32)For a quick local experiment you can pass any function — even a trivial one:
cr_export() writes the chunk table (with embeddings) to
a retrieval-friendly format. parquet and jsonl
preserve the embedding vectors natively; csv and
duckdb serialise them to a [...] string.
crawler("https://books.toscrape.com/") |>
cr_options(max_requests = 100) |>
cr_on_html(function(ctx) {
ctx$push_data(list(
url = ctx$request$url,
text = ctx$page |> rvest::html_element("body") |> rvest::html_text2()
))
ctx$enqueue_links(glob = "*/catalogue/*")
}) |>
cr_run() |>
cr_collect() |>
cr_chunk(text = text, size = 1000, overlap = 200) |>
cr_embed(embed_fn) |>
cr_export("corpus.parquet", format = "parquet")From here, load corpus.parquet into your vector store or
do nearest-neighbour search in R to retrieve chunks for a prompt.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.