The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Crawlee separates three kinds of storage: the request queue (what to crawl), the dataset (structured results) and the key-value store (binary blobs). crawlee mirrors that split and adds a one-call setup for reproducible, resumable runs.
Handlers call ctx$push_data() to append records;
cr_collect() returns them as one tibble. By default the
dataset lives in memory.
result <- crawler("https://books.toscrape.com/") |>
cr_on_html(function(ctx) {
ctx$push_data(list(url = ctx$request$url))
}) |>
cr_run() |>
cr_collect()For larger or longer crawls, choose a persistent
backend with cr_dataset():
"jsonl" — append-only, schema-flexible, one JSON object
per line;"duckdb" — appended to a DuckDB table, ready for
SQL.crawler("https://books.toscrape.com/") |>
cr_dataset(backend = "duckdb", path = "books.duckdb") |>
cr_on_html(function(ctx) ctx$push_data(list(url = ctx$request$url))) |>
cr_run()Both persistent backends resume from an existing file: re-opening the same path keeps the rows already there.
Use the key-value store for raw, non-tabular content — PDFs, images,
page snapshots. ctx$save_body() writes the current response
there, and cr_store() sets the directory.
The request queue deduplicates by a normalised key (see
cr_normalize_url()), so each URL is fetched at most once
and a crawl is deterministic. It can also persist its state — pending
requests, seen keys, handled count — which is what makes a crawl
resumable.
cr_persist()cr_persist(dir) wires everything to a run directory:
queue.rds during the
run;dataset.jsonl or
dataset.duckdb);ctx$save_body() writes under kv/;manifest.rds / manifest.json)
records the start URLs, an options snapshot and run statistics.crawl <- crawler("https://books.toscrape.com/") |>
cr_persist("runs/books", dataset = "duckdb") |>
cr_on_html(function(ctx) {
ctx$push_data(list(url = ctx$request$url))
ctx$enqueue_links(glob = "*/catalogue/*")
}) |>
cr_run()
data <- cr_collect(crawl)
cr_close(crawl) # release the DuckDB connectionIf a run is interrupted, run the exact same pipeline
again. Because the state already exists in
runs/books, cr_persist() restores it and the
crawl continues where it left off — already-fetched URLs are
skipped.
# Same code as above: it resumes instead of starting over.
crawler("https://books.toscrape.com/") |>
cr_persist("runs/books", dataset = "duckdb") |>
cr_on_html(function(ctx) {
ctx$push_data(list(url = ctx$request$url))
ctx$enqueue_links(glob = "*/catalogue/*")
}) |>
cr_run()For the DuckDB backend, call
cr_collect()beforecr_close()— closing releases the connection.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.