The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
A tidy R interface for reproducible web crawling — inspired by the architecture of Crawlee, implemented in pure R.
crawlee brings the unified-crawler idea to R: a deduplicating, resumable request queue, content-type aware handlers, structured storage and rich console logging via cli. It can crawl HTML pages, sitemaps, RSS and Atom feeds and PDF documents — with reproducibility as a first-class concern.
It is built entirely on the R web-scraping ecosystem (httr2, rvest, xml2, chromote) — no Node.js runtime required.
A crawl is a loop: requests flow through a deduplicating queue to a
fetch engine; each response is dispatched to a handler that extracts
data (push_data()) and discovers more links
(enqueue_links()), which flow back into the queue until it
drains.
# install.packages("pak")
pak::pak("StrategicProjects/crawlee")library(crawlee)
resultado <- crawler("https://example.com") |>
cr_options(delay = 0.5, max_depth = 2, respect_robots = TRUE) |>
cr_use_http() |>
cr_on_html(function(ctx) {
ctx$push_data(list(
url = ctx$request$url,
titulo = ctx$page |> rvest::html_element("h1") |> rvest::html_text2()
))
ctx$enqueue_links(glob = "*/blog/*")
}) |>
cr_run() |>
cr_collect()
resultado
#> # A tibble: 1 × 2
#> url titulo
#> <chr> <chr>
#> 1 https://example.com Example DomainSuggests), loaded only when
used.cr_* verbs
compose with the native pipe and always return tibbles.robots.txt awareness by default.| Milestone | Scope | Status |
|---|---|---|
| M1 | Core: queue, HTTP, HTML handlers, dataset, cli logs | ✅ |
| M2 | Sitemap & RSS discovery, robots.txt enforcement | ✅ |
| M3 | PDF / document handlers (pdftools) |
✅ |
| M4 | Headless browser backend (chromote) |
✅ |
| M5 | RAG helpers (chunking, embeddings, export) | ✅ |
| M6 | Persistent & resumable storage (jsonl/duckdb,
cr_persist()) |
✅ |
| M7 | Parallel fetching (cr_parallel()) |
✅ |
| M8 | Autoscaling (cr_autoscale()) & streaming pool
(cr_stream()) |
✅ |
| M9 | Adaptive streaming + per-host pacing | ✅ |
MIT © crawlee authors.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.