The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This article covers the two sides of crawling at scale, following Crawlee’s Scaling our crawlers
and Avoid getting blocked guides: going faster
(concurrency) while staying polite (rate limits and
robots.txt).
By default crawlee is conservative and respectful:
robots.txt is honoured
(respect_robots = TRUE): disallowed URLs are skipped, and a
Crawl-delay directive is applied.user_agent so site
owners can identify your crawler;delay adds a pause between
requests;max_requests and
max_depth bound the crawl;max_retries) with backoff.The default engine is sequential. For higher throughput there are three concurrent engines; all keep handlers running sequentially in R (so your dataset and queue are never touched concurrently) — only the network I/O runs in parallel.
cr_parallel()Drains the queue in batches whose network requests run together.
cr_autoscale()Like cr_parallel(), but the batch size adapts at run
time (additive-increase on clean batches, halving on back-pressure such
as HTTP 429/503 or transport failures), staying within
[min, max].
cr_stream()Keeps concurrency requests in flight at all times: the
moment one finishes, its handler runs and the next request is pulled in.
This avoids the batch engines’ “wait for the slowest request in the
batch” stall and shines when response latency varies a lot.
| Engine | When to use |
|---|---|
| sequential (default) | small crawls; strict per-request pacing |
cr_parallel() |
steady throughput with a known good concurrency |
cr_autoscale() |
unknown/variable server capacity — let it find the level |
cr_stream() |
many pages with widely varying latency; maximum throughput |
Concurrency and politeness pull in opposite directions. The batch engines apply
delay/Crawl-delaybetween batches; the streaming engine treats concurrency itself as the throttle and does not enforce per-request pacing. For strict rate limits, prefer a batch engine with adelay.
Any engine composes with cr_persist() for resumable,
checkpointed runs:
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.