The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Crawling a website

library(crawlee)

This article follows the same path as the Crawlee fundamentals: start from a single page, then teach the crawler to follow links, control its scope, route different page types and discover URLs from a sitemap. The examples target books.toscrape.com, a public sandbox built for practising web scraping.

The model

A crawler owns three things:

You build a crawler with crawler() and configure it with cr_* verbs that compose through the native pipe (|>), then run it with cr_run().

Your first crawler

Fetch a single page and extract a couple of fields. The handler receives a context object (ctx) exposing the parsed page and the action push_data().

result <- crawler("https://books.toscrape.com/") |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(
      url   = ctx$request$url,
      title = ctx$page |> rvest::html_element("title") |> rvest::html_text2()
    ))
  }) |>
  cr_run() |>
  cr_collect()

result

Controlling scope

You rarely want every link. enqueue_links() takes glob (a shorthand for include), include/exclude patterns and a same_domain flag; the crawler itself enforces max_depth and max_requests.

crawler("https://books.toscrape.com/") |>
  cr_options(max_depth = 3, max_requests = 200) |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(url = ctx$request$url, depth = ctx$request$depth))
    ctx$enqueue_links(
      glob    = "*/catalogue/*", # only follow catalogue pages
      exclude = "*/category/*"
    )
  }) |>
  cr_run() |>
  cr_collect()

Routing different page types

Most sites have a few kinds of page — listings vs. detail pages, say. Give a label when enqueuing and register a handler for that label. Listing pages enqueue detail pages; detail pages extract the data.

books <- crawler("https://books.toscrape.com/") |>
  # listing pages: enqueue book detail pages, labelled "book"
  cr_on_html(function(ctx) {
    ctx$enqueue_links(glob = "*/catalogue/*index.html", label = "book")
    ctx$enqueue_links(glob = "*/page-*.html") # pagination, default handler
  }) |>
  # detail pages
  cr_on_html(label = "book", function(ctx) {
    ctx$push_data(list(
      title = ctx$page |> rvest::html_element("h1") |> rvest::html_text2(),
      price = ctx$page |> rvest::html_element(".price_color") |> rvest::html_text2()
    ))
  }) |>
  cr_run() |>
  cr_collect()

books

A request’s label always wins over the content-kind default, so labelled routing and cr_on_html()/cr_on_pdf() defaults compose cleanly.

Crawling from a sitemap

When a site publishes a sitemap.xml, you can seed the queue directly from it instead of discovering links page by page — cr_from_sitemap() handles sitemap indexes and gzipped sitemaps, and can filter by glob or by <lastmod> date.

crawler() |>
  cr_from_sitemap("https://books.toscrape.com/sitemap.xml", label = "book") |>
  cr_on_html(label = "book", function(ctx) {
    ctx$push_data(list(url = ctx$request$url))
  }) |>
  cr_run() |>
  cr_collect()

The companion cr_from_rss() does the same for RSS and Atom feeds.

Rendering JavaScript pages

If a page builds its content with JavaScript, the plain HTTP backend sees an empty shell. Switch to the headless-browser backend with cr_use_browser() (requires the package and a Chrome/Chromium install). Handlers are unchanged; you additionally get ctx$screenshot().

crawler("https://example.com") |>
  cr_use_browser(wait_selector = ".content") |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(url = ctx$request$url))
    ctx$screenshot()
  }) |>
  cr_run()

Where next

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.