The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Getting started with crawlee

library(crawlee)

The mental model

crawlee mirrors the architecture of Crawlee in pure R. A crawler owns:

You build a crawler with crawler() and configure it with cr_* verbs that compose through the native pipe (|>).

A minimal crawl

resultado <- crawler("https://example.com") |>
  cr_options(delay = 0.5, max_depth = 2) |>
  cr_use_http() |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(
      url    = ctx$request$url,
      titulo = ctx$page |> rvest::html_element("h1") |> rvest::html_text2()
    ))
    ctx$enqueue_links()
  }) |>
  cr_run() |>
  cr_collect()

The handler context

Every handler receives a context object, conventionally named ctx:

Element Description
ctx$request The current request (url, label, depth, …).
ctx$response The raw httr2 response.
ctx$page The parsed page (xml_document) for HTML/XML, else NULL.
ctx$push_data(data) Append a record (list or data frame) to the dataset.
ctx$enqueue_links(...) Discover and enqueue links from the page.
ctx$log Logging helpers (info(), success(), warn(), error()).

Reproducibility

The request queue deduplicates URLs by a normalised key (see cr_normalize_url()), so the same page is never fetched twice and crawls are deterministic. Persistent, resumable storage backends (DuckDB, Parquet) are on the roadmap. ```

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.