---
title: "Storage and resumable runs"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Storage and resumable runs}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", eval = FALSE)
```

```{r setup}
library(crawlee)
```

[Crawlee](https://crawlee.dev) separates three kinds of storage: the **request
queue** (what to crawl), the **dataset** (structured results) and the
**key-value store** (binary blobs). crawlee mirrors that split and adds a
one-call setup for **reproducible, resumable** runs.

## The dataset

Handlers call `ctx$push_data()` to append records; `cr_collect()` returns them
as one tibble. By default the dataset lives in memory.

```{r}
result <- crawler("https://books.toscrape.com/") |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(url = ctx$request$url))
  }) |>
  cr_run() |>
  cr_collect()
```

For larger or longer crawls, choose a **persistent backend** with
`cr_dataset()`:

* `"jsonl"` — append-only, schema-flexible, one JSON object per line;
* `"duckdb"` — appended to a DuckDB table, ready for SQL.

```{r}
crawler("https://books.toscrape.com/") |>
  cr_dataset(backend = "duckdb", path = "books.duckdb") |>
  cr_on_html(function(ctx) ctx$push_data(list(url = ctx$request$url))) |>
  cr_run()
```

Both persistent backends *resume* from an existing file: re-opening the same
path keeps the rows already there.

## The key-value store

Use the key-value store for raw, non-tabular content — PDFs, images, page
snapshots. `ctx$save_body()` writes the current response there, and
`cr_store()` sets the directory.

```{r}
crawler("https://example.com/report.pdf") |>
  cr_store("downloads") |>
  cr_on_pdf(function(ctx) {
    ctx$push_data(list(url = ctx$request$url, pages = length(ctx$pdf_text())))
    ctx$save_body(ext = "pdf") # -> downloads/<sanitised-url>.pdf
  }) |>
  cr_run()
```

## The request queue and reproducibility

The request queue deduplicates by a normalised key (see `cr_normalize_url()`),
so each URL is fetched at most once and a crawl is deterministic. It can also
persist its state — pending requests, seen keys, handled count — which is what
makes a crawl **resumable**.

## One-call setup: `cr_persist()`

`cr_persist(dir)` wires everything to a run directory:

* the queue is checkpointed to `queue.rds` during the run;
* the dataset uses a persistent backend (`dataset.jsonl` or `dataset.duckdb`);
* `ctx$save_body()` writes under `kv/`;
* a manifest (`manifest.rds` / `manifest.json`) records the start URLs, an
  options snapshot and run statistics.

```{r}
crawl <- crawler("https://books.toscrape.com/") |>
  cr_persist("runs/books", dataset = "duckdb") |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(url = ctx$request$url))
    ctx$enqueue_links(glob = "*/catalogue/*")
  }) |>
  cr_run()

data <- cr_collect(crawl)
cr_close(crawl) # release the DuckDB connection
```

### Resuming

If a run is interrupted, **run the exact same pipeline again**. Because the
state already exists in `runs/books`, `cr_persist()` restores it and the crawl
continues where it left off — already-fetched URLs are skipped.

```{r}
# Same code as above: it resumes instead of starting over.
crawler("https://books.toscrape.com/") |>
  cr_persist("runs/books", dataset = "duckdb") |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(url = ctx$request$url))
    ctx$enqueue_links(glob = "*/catalogue/*")
  }) |>
  cr_run()
```

> For the DuckDB backend, call `cr_collect()` before `cr_close()` — closing
> releases the connection.
