The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

crawlee 0.1.0

First release. A tidy, native-R, Crawlee-inspired toolkit for reproducible web crawling.

Milestone M9 — adaptive & polite streaming

cr_stream(adaptive = TRUE, min, max) adapts the streaming pool’s in-flight target at run time (AIMD on back-pressure), like cr_autoscale() but for the continuous scheduler.
The streaming engine now paces launches per host (delay / robots.txt Crawl-delay): a host is not hit again until its interval has elapsed, while different hosts keep running in parallel.

Milestone M8 — autoscaling & streaming

cr_autoscale(min, max) adapts the parallel batch concurrency at run time (Crawlee autoscaled-pool style): additive-increase on clean batches, multiplicative-decrease on back-pressure (a transport failure or HTTP 429/500/502/503/504), clamped to [min, max].
cr_stream(concurrency) adds a continuous-pool scheduler (via httr2::req_perform_promise() + /): keeps concurrency requests in flight at all times, dispatching and refilling as each finishes — avoiding the batch engine’s “wait for the slowest” stall.

Milestone M7 — parallel fetching

cr_parallel(concurrency) enables concurrent fetching for the HTTP backend (Crawlee’s autoscaled-pool equivalent): the queue is drained in batches whose network I/O runs concurrently via httr2::req_perform_parallel(), while handlers still run sequentially in R (no shared-state hazard). robots.txt, retries, depth/request limits and queue checkpointing all still apply; delay/Crawl-delay are applied between batches.
The engine was refactored into shared dispatch/error steps used by both the sequential and parallel loops.

Milestone M6 — persistent & resumable storage

cr_persist() ties a crawl to a run directory: the request queue is checkpointed (queue.rds) during the run and restored on the next run, so a crawl resumes where it left off without re-fetching seen URLs.
Persistent [Dataset] backends: cr_dataset(backend = "jsonl") (append-only, schema-flexible) and "duckdb" (SQL-ready). The RequestQueue gained save()/restore()/set_path().
A reproducibility manifest (manifest.rds / manifest.json) records the start URLs, options snapshot and run stats.
cr_close() releases the browser session and DuckDB connection.

Milestone M5 — RAG

cr_chunk() splits text (a character vector or a data-frame column) into overlapping chunks, by character or word, carrying metadata per chunk.
cr_embed() attaches an embedding list-column via a user-supplied, provider-agnostic embedding function, applied in batches. crawlee never calls an external service itself.
cr_export() writes chunks (and embeddings) to Parquet, JSONL, CSV or DuckDB for retrieval.

Milestone M4 — headless browser

cr_use_browser() renders JavaScript-heavy pages with a headless Chrome/Chromium via , with wait and wait_selector controls. Handlers are unchanged (ctx$page, enqueue_links()); the context gains ctx$screenshot(), saved to the [KeyValueStore].
Fetch backends are now unified behind a normalised internal fetched object, so handlers behave identically regardless of HTTP vs browser.

Milestone M3 — documents

Content-type aware dispatch: each response is classified (html, pdf, other) and routed to the matching default handler; explicit request labels still take precedence.
cr_on_pdf() registers a PDF handler. Its context adds pdf_text() (per-page text via ), body_raw()/body_string() and save_body().
KeyValueStore plus cr_store() and ctx$save_body(): persist raw responses (PDFs, images, snapshots) on disk alongside the structured dataset.

Milestone M2 — discovery

cr_from_sitemap() enqueues URLs from a sitemap.xml, recursing into sitemap indexes, transparently handling gzipped sitemaps, with glob filters and a since filter on <lastmod> for incremental crawls.
cr_from_rss() enqueues items from RSS and Atom feeds, carrying item title and date into the request’s user_data.
robots.txt is now enforced when respect_robots = TRUE (the default): a native parser/matcher (User-agent grouping, */$ patterns, longest-match with Allow override, Crawl-delay), cached per host. Disallowed URLs are skipped and reported; Crawl-delay is honoured.

Milestone M1 — core

crawler() builds a stateful, pipe-friendly crawler.
RequestQueue: deduplicating (normalised unique_key), FIFO, resumable request queue with retry rescheduling.
cr_options() configures concurrency, depth, delay, retries, user agent and log verbosity.
cr_use_http() HTTP fetch backend (httr2); cr_use_browser() reserved.
cr_on_html() registers content handlers; handler context exposes push_data() and enqueue_links() (with glob/include/exclude and same-domain filtering).
Dataset append-only store; cr_run() drives the crawl and cr_collect() returns a tibble.
Rich console logging via cli.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.