The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This vignette for the unpivotr package demonstrates unpivoting html tables of various kinds.
The HTML files are in the package directory at
system.file("extdata", c("rowspan.html", "colspan.html", "nested.html"), package = "unpivotr")
.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
If a table has cells merged across rows or columns (or both), then
as_cells()
does not attempt to fill the cell contents
across the rows or columns. This is different from other packages,
e.g. rvest
. However, if merged cells cause a table not to
be square, then as_cells()
pads the missing cells with
blanks.
Header (1:2, 1) | Header (1, 2) |
---|---|
cell (2, 2) |
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.
## [[1]]
## # A tibble: 1 × 2
## `Header (1:2, 1)` `Header (1, 2)`
## <chr> <chr>
## 1 Header (1:2, 1) cell (2, 2)
## [[1]]
## # A tibble: 4 × 4
## row col data_type html
## <int> <int> <chr> <chr>
## 1 1 1 html "<th rowspan=\"2\">Header (1:2, 1)</th>"
## 2 2 1 html <NA>
## 3 1 2 html "<th>Header (1, 2)</th>"
## 4 2 2 html "<td>cell (2, 2)</td>"
Header (1, 1:2) | |
---|---|
cell (2, 1) | cell (2, 2) |
## [[1]]
## # A tibble: 1 × 2
## `Header (1, 1:2)` `Header (1, 1:2)`
## <chr> <chr>
## 1 cell (2, 1) cell (2, 2)
## [[1]]
## # A tibble: 4 × 4
## row col data_type html
## <int> <int> <chr> <chr>
## 1 1 1 html "<th colspan=\"2\">Header (1, 1:2)</th>"
## 2 2 1 html "<td>cell (2, 1)</td>"
## 3 1 2 html <NA>
## 4 2 2 html "<td>cell (2, 2)</td>"
rowandcolspan <- system.file("extdata",
"row-and-colspan.html",
package = "unpivotr")
includeHTML(rowandcolspan)
Header (1:2, 1:2) | Header (2, 3) | |
---|---|---|
cell (3, 1) | cell (3, 2) | cell (3, 3) |
## [[1]]
## # A tibble: 1 × 5
## `Header (1:2, 1:2)` `Header (1:2, 1:2)` `Header (2, 3)` `` ``
## <chr> <chr> <chr> <chr> <chr>
## 1 Header (1:2, 1:2) Header (1:2, 1:2) cell (3, 1) cell (3, 2) cell (3, …
## [[1]]
## # A tibble: 10 × 4
## row col data_type html
## <int> <int> <chr> <chr>
## 1 1 1 html "<th colspan=\"2\" rowspan=\"2\">Header (1:2, 1:2)</th…
## 2 2 1 html <NA>
## 3 1 2 html <NA>
## 4 2 2 html <NA>
## 5 1 3 html "<th>Header (2, 3)</th>"
## 6 2 3 html "<td>cell (3, 1)</td>"
## 7 1 4 html <NA>
## 8 2 4 html "<td>cell (3, 2)</td>"
## 9 1 5 html <NA>
## 10 2 5 html "<td>cell (3, 3)</td>"
as_cells()
never descends into cells. If there is a
table inside a cell, then to parse that table use
html_table
again on that cell.
Header (1, 1) | Header (1, 2) | ||||
---|---|---|---|---|---|
cell (2, 1) |
|
## [[1]]
## # A tibble: 3 × 6
## `Header (1, 1)` `Header (1, 2)` `` `` `` ``
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 cell (2, 1) "Header (2, 2)(1, 1)\n … Head… Head… cell… cell…
## 2 Header (2, 2)(1, 1) "Header (2, 2)(1, 2)" <NA> <NA> <NA> <NA>
## 3 cell (2, 2)(2, 1) "cell (2, 2)(2, 1)" <NA> <NA> <NA> <NA>
##
## [[2]]
## # A tibble: 1 × 2
## `Header (2, 2)(1, 1)` `Header (2, 2)(1, 2)`
## <chr> <chr>
## 1 cell (2, 2)(2, 1) cell (2, 2)(2, 1)
## # A tibble: 4 × 4
## row col data_type html
## <int> <int> <chr> <chr>
## 1 1 1 html "<th>Header (1, 1)</th>"
## 2 2 1 html "<td>cell (2, 1)</td>"
## 3 1 2 html "<th>Header (1, 2)</th>"
## 4 2 2 html "<td>\n <table>\n<tr>\n<th>Header (2, 2)(1, 1)…
# The html of the table inside a cell
cell <-
x %>%
dplyr::filter(row == 2, col == 2) %>%
.$html
cell
## [1] "<td>\n <table>\n<tr>\n<th>Header (2, 2)(1, 1)</th>\n <th>Header (2, 2)(1, 2)</th>\n </tr>\n<tr>\n<td>cell (2, 2)(2, 1)</td>\n <td>cell (2, 2)(2, 1)</td>\n </tr>\n</table>\n</td>"
## [[1]]
## # A tibble: 4 × 4
## row col data_type html
## <int> <int> <chr> <chr>
## 1 1 1 html <th>Header (2, 2)(1, 1)</th>
## 2 2 1 html <td>cell (2, 2)(2, 1)</td>
## 3 1 2 html <th>Header (2, 2)(1, 2)</th>
## 4 2 2 html <td>cell (2, 2)(2, 1)</td>
A motivation for using unpivotr::as_cells()
is that it
extracts more than just text – it can extract whatever part of the HTML
you need.
Here, we extract URLs.
Scraping HTML. | ||
Sweet | as? | Yeah, right. |
cell_url <- function(x) {
if (is.na(x)) return(NA)
x %>%
read_html %>%
html_nodes("a") %>%
html_attr("href")
}
cell_text <- function(x) {
if (is.na(x)) return(NA)
x %>%
read_html %>%
html_nodes("a") %>%
html_text()
}
urls %>%
read_html() %>%
as_cells() %>%
.[[1]] %>%
mutate(text = purrr::map(html, cell_text),
url = purrr::map(html, cell_url)) %>%
tidyr::unnest(text, url)
## Warning: `unnest()` has a new interface. See `?unnest` for details.
## ℹ Try `df %>% unnest(c(text, url))`, with `mutate()` if needed.
## # A tibble: 8 × 6
## row col data_type html text url
## <int> <int> <chr> <chr> <chr> <chr>
## 1 1 1 html "<td colspan=\"2\">\n<a href=\"https://www.… Scra… http…
## 2 1 1 html "<td colspan=\"2\">\n<a href=\"https://www.… HTML. http…
## 3 2 1 html "<td><a href=\"https://cran.r-project.org/\… Sweet http…
## 4 1 2 html <NA> <NA> <NA>
## 5 2 2 html "<td><a href=\"https://cran.r-project.org/p… as? http…
## 6 1 3 html <NA> <NA> <NA>
## 7 2 3 html "<td>\n<a href=\"https://cran.r-project.org… Yeah, http…
## 8 2 3 html "<td>\n<a href=\"https://cran.r-project.org… righ… http…