The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

crawlee crawlee logo

Lifecycle: experimental R-CMD-check pkgdown Project Status: Active License: MIT R >= 4.1

A tidy R interface for reproducible web crawling — inspired by the architecture of Crawlee, implemented in pure R.

crawlee brings the unified-crawler idea to R: a deduplicating, resumable request queue, content-type aware handlers, structured storage and rich console logging via cli. It can crawl HTML pages, sitemaps, RSS and Atom feeds and PDF documents — with reproducibility as a first-class concern.

It is built entirely on the R web-scraping ecosystem (httr2, rvest, xml2, chromote) — no Node.js runtime required.

How it works

A crawl is a loop: requests flow through a deduplicating queue to a fetch engine; each response is dispatched to a handler that extracts data (push_data()) and discovers more links (enqueue_links()), which flow back into the queue until it drains.

crawlee request lifecycle

Architecture

crawlee architecture

Installation

# install.packages("pak")
pak::pak("StrategicProjects/crawlee")

Usage

library(crawlee)

resultado <- crawler("https://example.com") |>
  cr_options(delay = 0.5, max_depth = 2, respect_robots = TRUE) |>
  cr_use_http() |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(
      url    = ctx$request$url,
      titulo = ctx$page |> rvest::html_element("h1") |> rvest::html_text2()
    ))
    ctx$enqueue_links(glob = "*/blog/*")
  }) |>
  cr_run() |>
  cr_collect()

resultado
#> # A tibble: 1 × 2
#>   url                 titulo
#>   <chr>               <chr>
#> 1 https://example.com Example Domain

Design principles

Roadmap

Milestone Scope Status
M1 Core: queue, HTTP, HTML handlers, dataset, cli logs
M2 Sitemap & RSS discovery, robots.txt enforcement
M3 PDF / document handlers (pdftools)
M4 Headless browser backend (chromote)
M5 RAG helpers (chunking, embeddings, export)
M6 Persistent & resumable storage (jsonl/duckdb, cr_persist())
M7 Parallel fetching (cr_parallel())
M8 Autoscaling (cr_autoscale()) & streaming pool (cr_stream())
M9 Adaptive streaming + per-host pacing

License

MIT © crawlee authors.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.