Storage and resumable runs

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Crawlee separates three kinds of storage: the request queue (what to crawl), the dataset (structured results) and the key-value store (binary blobs). crawlee mirrors that split and adds a one-call setup for reproducible, resumable runs.

The dataset

Handlers call ctx$push_data() to append records; cr_collect() returns them as one tibble. By default the dataset lives in memory.

result <- crawler("https://books.toscrape.com/") |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(url = ctx$request$url))
  }) |>
  cr_run() |>
  cr_collect()

For larger or longer crawls, choose a persistent backend with cr_dataset():

"jsonl" — append-only, schema-flexible, one JSON object per line;
"duckdb" — appended to a DuckDB table, ready for SQL.

crawler("https://books.toscrape.com/") |>
  cr_dataset(backend = "duckdb", path = "books.duckdb") |>
  cr_on_html(function(ctx) ctx$push_data(list(url = ctx$request$url))) |>
  cr_run()

Both persistent backends resume from an existing file: re-opening the same path keeps the rows already there.

The key-value store

Use the key-value store for raw, non-tabular content — PDFs, images, page snapshots. ctx$save_body() writes the current response there, and cr_store() sets the directory.

crawler("https://example.com/report.pdf") |>
  cr_store("downloads") |>
  cr_on_pdf(function(ctx) {
    ctx$push_data(list(url = ctx$request$url, pages = length(ctx$pdf_text())))
    ctx$save_body(ext = "pdf") # -> downloads/<sanitised-url>.pdf
  }) |>
  cr_run()

The request queue and reproducibility

The request queue deduplicates by a normalised key (see cr_normalize_url()), so each URL is fetched at most once and a crawl is deterministic. It can also persist its state — pending requests, seen keys, handled count — which is what makes a crawl resumable.

One-call setup: `cr_persist()`

cr_persist(dir) wires everything to a run directory:

the queue is checkpointed to queue.rds during the run;
the dataset uses a persistent backend (dataset.jsonl or dataset.duckdb);
ctx$save_body() writes under kv/;
a manifest (manifest.rds / manifest.json) records the start URLs, an options snapshot and run statistics.

crawl <- crawler("https://books.toscrape.com/") |>
  cr_persist("runs/books", dataset = "duckdb") |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(url = ctx$request$url))
    ctx$enqueue_links(glob = "*/catalogue/*")
  }) |>
  cr_run()

data <- cr_collect(crawl)
cr_close(crawl) # release the DuckDB connection

Resuming

If a run is interrupted, run the exact same pipeline again. Because the state already exists in runs/books, cr_persist() restores it and the crawl continues where it left off — already-fetched URLs are skipped.

# Same code as above: it resumes instead of starting over.
crawler("https://books.toscrape.com/") |>
  cr_persist("runs/books", dataset = "duckdb") |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(url = ctx$request$url))
    ctx$enqueue_links(glob = "*/catalogue/*")
  }) |>
  cr_run()

For the DuckDB backend, call cr_collect() before cr_close() — closing releases the connection.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.

Storage and resumable runs

The dataset

The key-value store

The request queue and reproducibility

One-call setup: cr_persist()

Resuming

One-call setup: `cr_persist()`