Repository Mirror for your Cloud Server and Webhosting

Type:

Package

Title:

Parse and Test Robots Exclusion Protocol Files and Rules

Version:

0.2.5

Date:

2023-02-07

Author:

Bob Rudis (bob@rud.is) [aut, cre], SEOmoz, Inc [aut]

Maintainer:

Bob Rudis <bob@rud.is>

Description:

The 'Robots Exclusion Protocol' https://www.robotstxt.org/orig.html documents a set of standards for allowing or excluding robot/spider crawling of different areas of site content. Tools are provided which wrap The 'rep-cpp' https://github.com/seomoz/rep-cpp C++ library for processing these 'robots.txt' files.

NeedsCompilation:

yes

URL:

https://github.com/hrbrmstr/spiderbar

BugReports:

https://github.com/hrbrmstr/spiderbar/issues

License:

MIT + file LICENSE

Suggests:

covr, robotstxt, tinytest

Depends:

R (≥ 3.2.0)

Encoding:

UTF-8

Imports:

Rcpp

RoxygenNote:

7.2.3

LinkingTo:

Rcpp

Packaged:

2023-02-09 16:08:55 UTC; hrbrmstr

Repository:

CRAN

Date/Publication:

2023-02-11 10:20:02 UTC

Test URL paths against a `robxp` `robots.txt` object

Description

Provide a character vector of URL paths plus optional user agent and this function will return a logical vector indicating whether you have permission to fetch the content at the respective path.

Usage

can_fetch(obj, path = "/", user_agent = "*")

Arguments

obj

robxp object

path

path to test

user_agent

user agent to test

Value

logical vector indicating whether you have permission to fetch the content

Examples

gh <- paste0(readLines(system.file("extdata", "github-robots.txt",
             package="spiderbar")), collapse="\n")
gh_rt <- robxp(gh)

can_fetch(gh_rt, "/humans.txt", "*") # TRUE
can_fetch(gh_rt, "/login", "*") # FALSE
can_fetch(gh_rt, "/oembed", "CCBot") # FALSE

can_fetch(gh_rt, c("/humans.txt", "/login", "/oembed"))

Retrieve all agent crawl delay values in a `robxp` `robots.txt` object

Description

Retrieve all agent crawl delay values in a robxp robots.txt object

Usage

crawl_delays(obj)

Arguments

obj

robxp object

Value

data frame of agents and their crawl delays

Note

-1 will be returned for any listed agent without a crawl delay setting

Examples

gh <- paste0(readLines(system.file("extdata", "github-robots.txt",
             package="spiderbar")), collapse="\n")
gh_rt <- robxp(gh)
crawl_delays(gh_rt)

imdb <- paste0(readLines(system.file("extdata", "imdb-robots.txt",
               package="spiderbar")), collapse="\n")
imdb_rt <- robxp(imdb)
crawl_delays(imdb_rt)

Custom printer for 'robxp“ objects

Description

Custom printer for 'robxp“ objects

Usage

## S3 method for class 'robxp'
print(x, ...)

Arguments

x

object to print

...

unused

Parse a 'robots.txt' file & create a 'robxp' object

Description

This function takes in a single element character vector and parses it into a 'robxp' object.

Usage

robxp(x)

Arguments

x

either an atomic character vector containing a complete 'robots.txt“ file _or_ a length >1 character vector that will concatenated into a single string _or_ a 'connection' object that will be passed to [readLines()], the result of which will be concatenated into a single string and parsed and the connection will be closed.

Value

a classed object holding an external pointer to parsed robots.txt data

Examples

imdb <- paste0(readLines(system.file("extdata", "imdb-robots.txt",
               package="spiderbar")), collapse="\n")
rt <- robxp(imdb)

Retrieve a character vector of sitemaps from a parsed robots.txt object

Description

Retrieve a character vector of sitemaps from a parsed robots.txt object

Usage

sitemaps(xp)

Arguments

xp

A robxp object

Value

charcter vector of all sitemaps found in the parsed robots.txt file

Examples

imdb <- paste0(readLines(system.file("extdata", "imdb-robots.txt",
               package="rep")), collapse="\n")
rt <- robxp(imdb)
sitemaps(rt)

Parse and Test Robots Exclusion Protocol Files and Rules

Description

The 'Robots Exclusion Protocol' (https://www.robotstxt.org/orig.html) documents a set of standards for allowing or excluding robot/spider crawling of different areas of site content. Tools are provided which wrap The rep-cpp https://github.com/seomoz/rep-cpp C++ library for processing these 'robots.txt“ files.

Author(s)

Bob Rudis (bob@rud.is)

Test URL paths against a robxp robots.txt object

Description

Usage

Arguments

Value

Examples

Retrieve all agent crawl delay values in a robxp robots.txt object

Description

Usage

Arguments

Value

Note

Examples

Custom printer for 'robxp“ objects

Description

Usage

Arguments

Parse a 'robots.txt' file & create a 'robxp' object

Description

Usage

Arguments

Value

Examples

Retrieve a character vector of sitemaps from a parsed robots.txt object

Description

Usage

Arguments

Value

Examples

Parse and Test Robots Exclusion Protocol Files and Rules

Description

Author(s)

Test URL paths against a `robxp` `robots.txt` object

Retrieve all agent crawl delay values in a `robxp` `robots.txt` object