The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

crawlee

A tidy R interface for reproducible web crawling — inspired by the architecture of Crawlee, implemented in pure R.

crawlee brings the unified-crawler idea to R: a deduplicating, resumable request queue, content-type aware handlers, structured storage and rich console logging via cli. It can crawl HTML pages, sitemaps, RSS and Atom feeds and PDF documents — with reproducibility as a first-class concern.

It is built entirely on the R web-scraping ecosystem (httr2, rvest, xml2, chromote) — no Node.js runtime required.

How it works

A crawl is a loop: requests flow through a deduplicating queue to a fetch engine; each response is dispatched to a handler that extracts data (push_data()) and discovers more links (enqueue_links()), which flow back into the queue until it drains.

crawlee request lifecycle

Architecture

crawlee architecture

Installation

# install.packages("pak")
pak::pak("StrategicProjects/crawlee")

Usage

library(crawlee)

resultado <- crawler("https://example.com") |>
  cr_options(delay = 0.5, max_depth = 2, respect_robots = TRUE) |>
  cr_use_http() |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(
      url    = ctx$request$url,
      titulo = ctx$page |> rvest::html_element("h1") |> rvest::html_text2()
    ))
    ctx$enqueue_links(glob = "*/blog/*")
  }) |>
  cr_run() |>
  cr_collect()

resultado
#> # A tibble: 1 × 2
#>   url                 titulo
#>   <chr>               <chr>
#> 1 https://example.com Example Domain

Design principles

Reproducibility first — deduplicating, resumable request queue; runs are meant to be deterministic and re-runnable.
No heavy mandatory dependencies — DuckDB, chromote and pdftools are optional (Suggests), loaded only when used.
Tidy & predictable — cr_* verbs compose with the native pipe and always return tibbles.
A polite web citizen — rate limiting and robots.txt awareness by default.

Roadmap

Milestone	Scope	Status
M1	Core: queue, HTTP, HTML handlers, dataset, cli logs	✅
M2	Sitemap & RSS discovery, robots.txt enforcement	✅
M3	PDF / document handlers (`pdftools`)	✅
M4	Headless browser backend (`chromote`)	✅
M5	RAG helpers (chunking, embeddings, export)	✅
M6	Persistent & resumable storage (jsonl/duckdb, `cr_persist()`)	✅
M7	Parallel fetching (`cr_parallel()`)	✅
M8	Autoscaling (`cr_autoscale()`) & streaming pool (`cr_stream()`)	✅
M9	Adaptive streaming + per-host pacing	✅

License

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.