Crawling a website

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Crawling a website

This article follows the same path as the Crawlee fundamentals: start from a single page, then teach the crawler to follow links, control its scope, route different page types and discover URLs from a sitemap. The examples target books.toscrape.com, a public sandbox built for practising web scraping.

The model

A crawler owns three things:

a request queue — a deduplicating, resumable list of URLs to visit;
one or more handlers — functions run on each fetched page;
a dataset — the structured records your handlers produce.

You build a crawler with crawler() and configure it with cr_* verbs that compose through the native pipe (|>), then run it with cr_run().

Your first crawler

Fetch a single page and extract a couple of fields. The handler receives a context object (ctx) exposing the parsed page and the action push_data().

result <- crawler("https://books.toscrape.com/") |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(
      url   = ctx$request$url,
      title = ctx$page |> rvest::html_element("title") |> rvest::html_text2()
    ))
  }) |>
  cr_run() |>
  cr_collect()

result

Following links

Real crawls discover new URLs as they go. ctx$enqueue_links() extracts links from the current page and adds them to the queue; the crawler keeps going until the queue drains. Because the queue deduplicates by a normalised URL, each page is visited at most once.

crawler("https://books.toscrape.com/") |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(url = ctx$request$url))
    ctx$enqueue_links() # follow every same-domain link
  }) |>
  cr_options(max_requests = 50) |>
  cr_run()

enqueue_links() only follows same-domain links by default, so a crawl cannot wander off across the whole web.

Controlling scope

You rarely want every link. enqueue_links() takes glob (a shorthand for include), include/exclude patterns and a same_domain flag; the crawler itself enforces max_depth and max_requests.

crawler("https://books.toscrape.com/") |>
  cr_options(max_depth = 3, max_requests = 200) |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(url = ctx$request$url, depth = ctx$request$depth))
    ctx$enqueue_links(
      glob    = "*/catalogue/*", # only follow catalogue pages
      exclude = "*/category/*"
    )
  }) |>
  cr_run() |>
  cr_collect()

Routing different page types

Most sites have a few kinds of page — listings vs. detail pages, say. Give a label when enqueuing and register a handler for that label. Listing pages enqueue detail pages; detail pages extract the data.

books <- crawler("https://books.toscrape.com/") |>
  # listing pages: enqueue book detail pages, labelled "book"
  cr_on_html(function(ctx) {
    ctx$enqueue_links(glob = "*/catalogue/*index.html", label = "book")
    ctx$enqueue_links(glob = "*/page-*.html") # pagination, default handler
  }) |>
  # detail pages
  cr_on_html(label = "book", function(ctx) {
    ctx$push_data(list(
      title = ctx$page |> rvest::html_element("h1") |> rvest::html_text2(),
      price = ctx$page |> rvest::html_element(".price_color") |> rvest::html_text2()
    ))
  }) |>
  cr_run() |>
  cr_collect()

books

A request’s label always wins over the content-kind default, so labelled routing and cr_on_html()/cr_on_pdf() defaults compose cleanly.

Crawling from a sitemap

When a site publishes a sitemap.xml, you can seed the queue directly from it instead of discovering links page by page — cr_from_sitemap() handles sitemap indexes and gzipped sitemaps, and can filter by glob or by <lastmod> date.

crawler() |>
  cr_from_sitemap("https://books.toscrape.com/sitemap.xml", label = "book") |>
  cr_on_html(label = "book", function(ctx) {
    ctx$push_data(list(url = ctx$request$url))
  }) |>
  cr_run() |>
  cr_collect()

The companion cr_from_rss() does the same for RSS and Atom feeds.

Rendering JavaScript pages

If a page builds its content with JavaScript, the plain HTTP backend sees an empty shell. Switch to the headless-browser backend with cr_use_browser() (requires the package and a Chrome/Chromium install). Handlers are unchanged; you additionally get ctx$screenshot().

crawler("https://example.com") |>
  cr_use_browser(wait_selector = ".content") |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(url = ctx$request$url))
    ctx$screenshot()
  }) |>
  cr_run()

Where next

Politeness & speed — robots.txt is respected by default; cr_options(delay = ) rate-limits, and cr_parallel() fetches concurrently.
Documents — cr_on_pdf() extracts text from PDFs; ctx$save_body() stores raw files in a key-value store.
Reproducible, resumable runs — cr_persist(dir) checkpoints the queue and persists the dataset, so an interrupted crawl continues where it left off.
RAG — cr_chunk(), cr_embed() and cr_export() turn crawled text into a retrieval-ready table.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.