A RAG pipeline

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

A RAG pipeline

Beyond crawling, crawlee provides three helpers to turn collected text into a retrieval-ready corpus for retrieval-augmented generation (RAG): cr_chunk(), cr_embed() and cr_export(). They operate on plain tibbles, so they slot in right after cr_collect().

1. Crawl and collect text

pages <- crawler("https://books.toscrape.com/") |>
  cr_options(max_requests = 100) |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(
      url   = ctx$request$url,
      title = ctx$page |> rvest::html_element("title") |> rvest::html_text2(),
      text  = ctx$page |> rvest::html_element("body") |> rvest::html_text2()
    ))
    ctx$enqueue_links(glob = "*/catalogue/*")
  }) |>
  cr_run() |>
  cr_collect()

2. Chunk

cr_chunk() splits text into overlapping windows. On a data frame, name the text column; every other column is carried along as per-chunk metadata (so each chunk keeps its url and title).

chunks <- cr_chunk(pages, text = text, size = 1000, overlap = 200, by = "char")
chunks
#> columns: doc_id, chunk_id, chunk, text, n_chars, url, title

Use by = "word" to size chunks in words instead of characters.

3. Embed

cr_embed() is provider-agnostic: crawlee never calls an embedding service itself. You pass embed_fn, a function that maps a character vector to a numeric matrix (one row per input) or a list of numeric vectors. It is applied in batches and adds an embedding list-column.

# A real embedder typically calls an HTTP API (any provider) with httr2:
embed_fn <- function(texts) {
  # return a length(texts) x d numeric matrix
  resp <- httr2::request("https://api.example.com/v1/embeddings") |>
    httr2::req_auth_bearer_token(Sys.getenv("EMBEDDINGS_API_KEY")) |>
    httr2::req_body_json(list(input = texts)) |>
    httr2::req_perform()
  do.call(rbind, lapply(httr2::resp_body_json(resp)$data, \(x) unlist(x$embedding)))
}

embedded <- cr_embed(chunks, embed_fn, batch_size = 32)

For a quick local experiment you can pass any function — even a trivial one:

fake_embed <- function(x) matrix(nchar(x), nrow = length(x), ncol = 1)
embedded <- cr_embed(chunks, fake_embed)

4. Export for retrieval

cr_export() writes the chunk table (with embeddings) to a retrieval-friendly format. parquet and jsonl preserve the embedding vectors natively; csv and duckdb serialise them to a [...] string.

cr_export(embedded, "corpus.parquet", format = "parquet")
cr_export(embedded, "corpus.jsonl", format = "jsonl")
cr_export(embedded, "corpus.duckdb", format = "duckdb", table = "chunks")

End to end

crawler("https://books.toscrape.com/") |>
  cr_options(max_requests = 100) |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(
      url  = ctx$request$url,
      text = ctx$page |> rvest::html_element("body") |> rvest::html_text2()
    ))
    ctx$enqueue_links(glob = "*/catalogue/*")
  }) |>
  cr_run() |>
  cr_collect() |>
  cr_chunk(text = text, size = 1000, overlap = 200) |>
  cr_embed(embed_fn) |>
  cr_export("corpus.parquet", format = "parquet")

From here, load corpus.parquet into your vector store or do nearest-neighbour search in R to retrieve chunks for a prompt.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.