Repository Mirror for your Cloud Server and Webhosting

Title:

Convert Html into Text

Version:

2.2.2

Author:

Sangchul Park [aut, cre]

Maintainer:

Sangchul Park <mail@sangchul.com>

Description:

Convert a html document to plain texts by stripping off all html tags.

License:

GPL-2 | GPL-3 [expanded from: GPL (≥ 2)]

URL:

https://github.com/replicable/htm2txt

BugReports:

https://github.com/replicable/htm2txt/issues

Encoding:

UTF-8

RoxygenNote:

7.2.0

NeedsCompilation:

Packaged:

2022-06-12 09:43:51 UTC; mail

Repository:

CRAN

Date/Publication:

2022-06-12 15:50:09 UTC

Display simple plain texts in a web page at a certain URL

Description

Display simple plain texts in a web page at a certain URL

Usage

browse(URL, ...)

Arguments

URL

A character indicating the URL of a web page.

...

Other gettxt arguments.

Value

None (invisible NULL).

Examples

browse("https://www.wikipedia.org/")

Extract simple plain texts from a web page at a certain URL

Description

Extract simple plain texts from a web page at a certain URL

Usage

gettxt(URL, encoding = "UTF-8", ...)

Arguments

URL

A character indicating the URL of a web page.

encoding

Encoding method (e.g., "UTF-8", "latin1", "bytes", "unknown", etc.).

...

Other htm2txt arguments.

Value

A character containing plain texts converted from the htm document at the URL.

Examples

text = gettxt("https://www.wikipedia.org/")

Convert a html document to plain texts by stripping off all html tags

Description

Convert a html document to plain texts by stripping off all html tags

Usage

htm2txt(htm, list = "\n&#8226; ", pagebreak = "\n\n----------\n\n")

Arguments

htm

A character vector, containing a html document, to be converted into plain texts (other objects are coerced into character vectors).

list

A character that replaces "li" tags (referring to a numbering or bullet for lists). The default is a line change followed by a bullet character and a space.

pagebreak

A character that replaces "hr" tags (referring to a thematic change in the content or a page break).

Value

A character vector containing plain texts converted from the html document.

Examples

text = htm2txt("<html><body>html texts</body></html>")
text = htm2txt(c("Hello<p>World", "Goodbye<br>Friends"))
text = htm2txt("<p>Menu:</p><ul></li>Coffee</li><li>Tea</li></ul>", list = "\n- ")
text = htm2txt("Page 1<hr>Page 2", pagebreak = "\n\n[NEW PAGE]\n\n")