The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
crawlee mirrors the architecture of Crawlee in pure R. A crawler owns:
You build a crawler with crawler() and configure it with
cr_* verbs that compose through the native pipe
(|>).
Every handler receives a context object, conventionally named
ctx:
| Element | Description |
|---|---|
ctx$request |
The current request (url, label,
depth, …). |
ctx$response |
The raw httr2 response. |
ctx$page |
The parsed page (xml_document) for HTML/XML, else
NULL. |
ctx$push_data(data) |
Append a record (list or data frame) to the dataset. |
ctx$enqueue_links(...) |
Discover and enqueue links from the page. |
ctx$log |
Logging helpers (info(), success(),
warn(), error()). |
enqueue_links() accepts glob,
include/exclude patterns and a
same_domain flag (on by default), so you only follow the
links you care about:
Requests enqueued with a label are routed to the
matching handler registered with
cr_on_html(..., label = "article").
The request queue deduplicates URLs by a normalised key (see
cr_normalize_url()), so the same page is never fetched
twice and crawls are deterministic. Persistent, resumable storage
backends (DuckDB, Parquet) are on the roadmap. ```
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.