The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
First release. A tidy, native-R, Crawlee-inspired toolkit for reproducible web crawling.
cr_stream(adaptive = TRUE, min, max) adapts the
streaming pool’s in-flight target at run time (AIMD on back-pressure),
like cr_autoscale() but for the continuous scheduler.delay / robots.txt Crawl-delay):
a host is not hit again until its interval has elapsed, while different
hosts keep running in parallel.cr_autoscale(min, max) adapts the parallel batch
concurrency at run time (Crawlee autoscaled-pool style):
additive-increase on clean batches, multiplicative-decrease on
back-pressure (a transport failure or HTTP 429/500/502/503/504), clamped
to [min, max].cr_stream(concurrency) adds a continuous-pool scheduler
(via httr2::req_perform_promise() + /): keeps
concurrency requests in flight at all times, dispatching
and refilling as each finishes — avoiding the batch engine’s “wait for
the slowest” stall.cr_parallel(concurrency) enables concurrent fetching
for the HTTP backend (Crawlee’s autoscaled-pool equivalent): the queue
is drained in batches whose network I/O runs concurrently via
httr2::req_perform_parallel(), while handlers still run
sequentially in R (no shared-state hazard). robots.txt,
retries, depth/request limits and queue checkpointing all still apply;
delay/Crawl-delay are applied between
batches.dispatch/error steps used by both the
sequential and parallel loops.cr_persist() ties a crawl to a run directory: the
request queue is checkpointed (queue.rds) during the run
and restored on the next run, so a crawl resumes where
it left off without re-fetching seen URLs.cr_dataset(backend = "jsonl") (append-only,
schema-flexible) and "duckdb" (SQL-ready). The
RequestQueue gained
save()/restore()/set_path().manifest.rds /
manifest.json) records the start URLs, options snapshot and
run stats.cr_close() releases the browser session and DuckDB
connection.cr_chunk() splits text (a character vector or a
data-frame column) into overlapping chunks, by character or word,
carrying metadata per chunk.cr_embed() attaches an embedding
list-column via a user-supplied, provider-agnostic embedding function,
applied in batches. crawlee never calls an external service itself.cr_export() writes chunks (and embeddings) to Parquet,
JSONL, CSV or DuckDB for retrieval.cr_use_browser() renders JavaScript-heavy pages with a
headless Chrome/Chromium via , with wait and
wait_selector controls. Handlers are unchanged
(ctx$page, enqueue_links()); the context gains
ctx$screenshot(), saved to the [KeyValueStore].fetched object, so handlers behave identically regardless
of HTTP vs browser.html, pdf, other) and routed to
the matching default handler; explicit request labels still take
precedence.cr_on_pdf() registers a PDF handler. Its context adds
pdf_text() (per-page text via ),
body_raw()/body_string() and
save_body().KeyValueStore plus cr_store() and
ctx$save_body(): persist raw responses (PDFs, images,
snapshots) on disk alongside the structured dataset.cr_from_sitemap() enqueues URLs from a
sitemap.xml, recursing into sitemap indexes, transparently
handling gzipped sitemaps, with glob filters and a since
filter on <lastmod> for incremental crawls.cr_from_rss() enqueues items from RSS and Atom feeds,
carrying item title and date into the request’s
user_data.robots.txt is now enforced when
respect_robots = TRUE (the default): a native
parser/matcher (User-agent grouping, */$
patterns, longest-match with Allow override, Crawl-delay), cached per
host. Disallowed URLs are skipped and reported; Crawl-delay
is honoured.crawler() builds a stateful, pipe-friendly
crawler.RequestQueue: deduplicating (normalised
unique_key), FIFO, resumable request queue with retry
rescheduling.cr_options() configures concurrency, depth, delay,
retries, user agent and log verbosity.cr_use_http() HTTP fetch backend (httr2);
cr_use_browser() reserved.cr_on_html() registers content handlers; handler
context exposes push_data() and
enqueue_links() (with glob/include/exclude and same-domain
filtering).Dataset append-only store; cr_run() drives
the crawl and cr_collect() returns a tibble.cli.These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.