The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This article follows the same path as the Crawlee fundamentals: start from a single
page, then teach the crawler to follow links,
control its scope, route different
page types and discover URLs from a sitemap. The
examples target books.toscrape.com, a
public sandbox built for practising web scraping.
A crawler owns three things:
You build a crawler with crawler() and configure it with
cr_* verbs that compose through the native pipe
(|>), then run it with cr_run().
Fetch a single page and extract a couple of fields. The handler
receives a context object (ctx) exposing the parsed page
and the action push_data().
Real crawls discover new URLs as they go.
ctx$enqueue_links() extracts links from the current page
and adds them to the queue; the crawler keeps going until the queue
drains. Because the queue deduplicates by a normalised URL, each page is
visited at most once.
crawler("https://books.toscrape.com/") |>
cr_on_html(function(ctx) {
ctx$push_data(list(url = ctx$request$url))
ctx$enqueue_links() # follow every same-domain link
}) |>
cr_options(max_requests = 50) |>
cr_run()enqueue_links() only follows same-domain links by
default, so a crawl cannot wander off across the whole web.
You rarely want every link. enqueue_links()
takes glob (a shorthand for include),
include/exclude patterns and a
same_domain flag; the crawler itself enforces
max_depth and max_requests.
crawler("https://books.toscrape.com/") |>
cr_options(max_depth = 3, max_requests = 200) |>
cr_on_html(function(ctx) {
ctx$push_data(list(url = ctx$request$url, depth = ctx$request$depth))
ctx$enqueue_links(
glob = "*/catalogue/*", # only follow catalogue pages
exclude = "*/category/*"
)
}) |>
cr_run() |>
cr_collect()Most sites have a few kinds of page — listings vs. detail pages, say.
Give a label when enqueuing and register a handler for that
label. Listing pages enqueue detail pages; detail pages extract the
data.
books <- crawler("https://books.toscrape.com/") |>
# listing pages: enqueue book detail pages, labelled "book"
cr_on_html(function(ctx) {
ctx$enqueue_links(glob = "*/catalogue/*index.html", label = "book")
ctx$enqueue_links(glob = "*/page-*.html") # pagination, default handler
}) |>
# detail pages
cr_on_html(label = "book", function(ctx) {
ctx$push_data(list(
title = ctx$page |> rvest::html_element("h1") |> rvest::html_text2(),
price = ctx$page |> rvest::html_element(".price_color") |> rvest::html_text2()
))
}) |>
cr_run() |>
cr_collect()
booksA request’s label always wins over the content-kind
default, so labelled routing and
cr_on_html()/cr_on_pdf() defaults compose
cleanly.
When a site publishes a sitemap.xml, you can seed the
queue directly from it instead of discovering links page by page —
cr_from_sitemap() handles sitemap indexes and gzipped
sitemaps, and can filter by glob or by <lastmod>
date.
crawler() |>
cr_from_sitemap("https://books.toscrape.com/sitemap.xml", label = "book") |>
cr_on_html(label = "book", function(ctx) {
ctx$push_data(list(url = ctx$request$url))
}) |>
cr_run() |>
cr_collect()The companion cr_from_rss() does the same for RSS and
Atom feeds.
If a page builds its content with JavaScript, the plain HTTP backend
sees an empty shell. Switch to the headless-browser backend with
cr_use_browser() (requires the package and a
Chrome/Chromium install). Handlers are unchanged; you additionally get
ctx$screenshot().
robots.txt is
respected by default; cr_options(delay = ) rate-limits, and
cr_parallel() fetches concurrently.cr_on_pdf() extracts text
from PDFs; ctx$save_body() stores raw files in a key-value
store.cr_persist(dir) checkpoints the queue and persists the
dataset, so an interrupted crawl continues where it left off.cr_chunk(),
cr_embed() and cr_export() turn crawled text
into a retrieval-ready table.These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.