Scaling and politeness

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

This article covers the two sides of crawling at scale, following Crawlee’s Scaling our crawlers and Avoid getting blocked guides: going faster (concurrency) while staying polite (rate limits and robots.txt).

Being a good web citizen

By default crawlee is conservative and respectful:

robots.txt is honoured (respect_robots = TRUE): disallowed URLs are skipped, and a Crawl-delay directive is applied.
set a descriptive user_agent so site owners can identify your crawler;
delay adds a pause between requests;
max_requests and max_depth bound the crawl;
failed requests are retried (max_retries) with backoff.

crawler("https://books.toscrape.com/") |>
  cr_options(
    user_agent = "my-research-bot (you@example.com)",
    delay = 0.5, # seconds between requests
    max_requests = 500,
    max_depth = 4,
    respect_robots = TRUE
  ) |>
  cr_on_html(function(ctx) ctx$enqueue_links()) |>
  cr_run()

Going faster

The default engine is sequential. For higher throughput there are three concurrent engines; all keep handlers running sequentially in R (so your dataset and queue are never touched concurrently) — only the network I/O runs in parallel.

Fixed-concurrency batches — `cr_parallel()`

Drains the queue in batches whose network requests run together.

crawler("https://books.toscrape.com/") |>
  cr_parallel(concurrency = 8) |>
  cr_on_html(function(ctx) ctx$enqueue_links()) |>
  cr_run()

Adaptive batches — `cr_autoscale()`

Like cr_parallel(), but the batch size adapts at run time (additive-increase on clean batches, halving on back-pressure such as HTTP 429/503 or transport failures), staying within [min, max].

crawler("https://books.toscrape.com/") |>
  cr_autoscale(min = 2, max = 16) |>
  cr_on_html(function(ctx) ctx$enqueue_links()) |>
  cr_run()

Continuous streaming pool — `cr_stream()`

Keeps concurrency requests in flight at all times: the moment one finishes, its handler runs and the next request is pulled in. This avoids the batch engines’ “wait for the slowest request in the batch” stall and shines when response latency varies a lot.

crawler("https://books.toscrape.com/") |>
  cr_stream(concurrency = 10) |>
  cr_on_html(function(ctx) ctx$enqueue_links()) |>
  cr_run()

Choosing an engine

Engine	When to use
sequential (default)	small crawls; strict per-request pacing
`cr_parallel()`	steady throughput with a known good concurrency
`cr_autoscale()`	unknown/variable server capacity — let it find the level
`cr_stream()`	many pages with widely varying latency; maximum throughput

Concurrency and politeness pull in opposite directions. The batch engines apply delay / Crawl-delay between batches; the streaming engine treats concurrency itself as the throttle and does not enforce per-request pacing. For strict rate limits, prefer a batch engine with a delay.

Combining with persistence

Any engine composes with cr_persist() for resumable, checkpointed runs:

crawler("https://books.toscrape.com/") |>
  cr_autoscale(min = 2, max = 16) |>
  cr_persist("runs/books", dataset = "duckdb") |>
  cr_on_html(function(ctx) ctx$enqueue_links()) |>
  cr_run()

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.

Scaling and politeness

Being a good web citizen

Going faster

Fixed-concurrency batches — cr_parallel()

Adaptive batches — cr_autoscale()

Continuous streaming pool — cr_stream()

Choosing an engine

Combining with persistence

Fixed-concurrency batches — `cr_parallel()`

Adaptive batches — `cr_autoscale()`

Continuous streaming pool — `cr_stream()`