Getting started with crawlee

library(crawlee)

The mental model

crawlee mirrors the architecture of Crawlee in pure R. A crawler owns:

You build a crawler with crawler() and configure it with cr_* verbs that compose through the native pipe (|>).

A minimal crawl

resultado <- crawler("https://example.com") |>
  cr_options(delay = 0.5, max_depth = 2) |>
  cr_use_http() |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(
      url    = ctx$request$url,
      titulo = ctx$page |> rvest::html_element("h1") |> rvest::html_text2()
    ))
    ctx$enqueue_links()
  }) |>
  cr_run() |>
  cr_collect()

The handler context

Every handler receives a context object, conventionally named ctx:

Element Description
ctx$request The current request (url, label, depth, …).
ctx$response The raw httr2 response.
ctx$page The parsed page (xml_document) for HTML/XML, else NULL.
ctx$push_data(data) Append a record (list or data frame) to the dataset.
ctx$enqueue_links(...) Discover and enqueue links from the page.
ctx$log Logging helpers (info(), success(), warn(), error()).

Reproducibility

The request queue deduplicates URLs by a normalised key (see cr_normalize_url()), so the same page is never fetched twice and crawls are deterministic. Persistent, resumable storage backends (DuckDB, Parquet) are on the roadmap. ```