Main
Importing the data
Spreadsheet cells are imported with the tidy_xlsx()
function. By default, this returns all the sheets in the spreadsheet, as elements in a list, wrapped in another list of two elements: data and formats. Since there is only one sheet in this file, the data list has length 1, but it is still a list for the sake of consistency. Hence the sheet is accessed with the list subsetter [[1]]
.
cells <- tidy_xlsx(path)$data[[1]]
Cell formatting isn’t required for this vignette, but if it were, it would be imported via tidy_xlsx(path)$formats
.
formatting <- tidy_xlsx(path)$formats
Importing one of the multiples
The small multiples each have exactly one ‘Fixed Price’ header, so begin by selecting one of those.
fixed_price <- filter(cells, character == "Fixed Price")[1, ]
From that single cell, selct the three rows of the column headers of the first small multiple. The split()
function below separates each row from one another, wrapping them together in a list. They are separated so that they can individually be joined to the data cells later. Another way to do this is to select each row one by one, assigning them to different variables.
col_headers <-
fixed_price %>%
offset_N(cells, n = 1) %>% # Offset up one row to "IF NWPL Rocky Mountains"
extend_E(cells, 3) %>% # Extend to the right edge of the table
extend_S(cells, 2) %>% # Extend down to the third row of the headers
filter(!is.na(content)) %>% # Remove blanks
select(row, col, value = character) %>% # Prepare for joining to data cells
split(.$row) # Separate the row of headers into list elements
col_headers
## $`14`
## # A tibble: 1 x 3
## row col value
## <int> <int> <chr>
## 1 14 17 IF NWPL Rocky Mountains
##
## $`15`
## # A tibble: 2 x 3
## row col value
## <int> <int> <chr>
## 1 15 17 Fixed Price
## 2 15 19 Basis
##
## $`16`
## # A tibble: 4 x 3
## row col value
## <int> <int> <chr>
## 1 16 17 BID
## 2 16 18 OFFER
## 3 16 19 BID
## 4 16 20 OFFER
Now select the data cells, starting from the ‘Fixed Price’ header again.
datacells <-
fixed_price %>%
offset_S(cells, n = 2) %>%
extend_E(cells, 4) %>%
extend_S(cells, # Extend down to a blank row
boundary = ~ is.na(content), # The formula detects blank cells
edge = TRUE) %>% # Require the whole row to be blank
filter(!is.na(content)) %>% # Remove remaining blanks
mutate(value = as.double(content)) %>%# Convert the values to double
select(row, col, value) # Prepare for joining to headers
print(datacells, n = Inf)
## # A tibble: 24 x 3
## row col value
## <int> <int> <dbl>
## 1 18 17 2.0600000
## 2 18 18 2.0800000
## 3 19 17 2.3950000
## 4 19 18 2.4150000
## 5 19 19 -0.5650000
## 6 19 20 -0.5450000
## 7 20 17 2.5940000
## 8 20 18 2.6140000
## 9 20 19 -0.4937500
## 10 20 20 -0.4737500
## 11 21 17 2.5812857
## 12 21 18 2.6012857
## 13 21 19 -0.5850000
## 14 21 20 -0.5650000
## 15 22 17 3.3560000
## 16 22 18 3.3760000
## 17 22 19 -0.2950000
## 18 22 20 -0.2750000
## 19 23 17 2.6340833
## 20 23 18 2.6540833
## 21 23 19 -0.5304167
## 22 23 20 -0.5104167
## 23 17 18 1.9100000
## 24 17 17 1.8900000
Finally, bind the data cells to the column headers (this is the real magic). For more examples of how the compass directions work (the NNW()
and N()
functions below), see the vignette called Compass Directions
.
datacells %>%
NNW(col_headers[[1]]) %>% # This header isn't in every column
NNW(col_headers[[2]]) %>% # Nor is this header
N(col_headers[[3]]) # But this one is
## # A tibble: 24 x 6
## row col value.data i.value i.value.1 value.header
## <int> <int> <chr> <chr> <dbl> <chr>
## 1 18 17 Fixed Price IF NWPL Rocky Mountains 2.06000 BID
## 2 18 18 Fixed Price IF NWPL Rocky Mountains 2.08000 OFFER
## 3 19 17 Fixed Price IF NWPL Rocky Mountains 2.39500 BID
## 4 19 18 Fixed Price IF NWPL Rocky Mountains 2.41500 OFFER
## 5 19 19 Basis IF NWPL Rocky Mountains -0.56500 BID
## 6 19 20 Basis IF NWPL Rocky Mountains -0.54500 OFFER
## 7 20 17 Fixed Price IF NWPL Rocky Mountains 2.59400 BID
## 8 20 18 Fixed Price IF NWPL Rocky Mountains 2.61400 OFFER
## 9 20 19 Basis IF NWPL Rocky Mountains -0.49375 BID
## 10 20 20 Basis IF NWPL Rocky Mountains -0.47375 OFFER
## # ... with 14 more rows
Importing every small multiple at once
The code above, for a single multiple, can easily be adapted to import every one of the small multiples. Here this is done using the purrr()
package to apply the code to each element of a list of ‘Fixed Price’ header cells.
Get all ten ‘Fixed Price’ headers, and separate each into its own list element. Since each cell has a unique combination of row
and col
, that combination can be used to separate the cells into list elements.
fixed_price <-
cells %>%
filter(character == "Fixed Price") %>%
split(paste(.$row, .$col))
Adapt the code for a single multiple a ‘tidy’ function to tidy a general small multiple, starting from the ‘Fixed Price’ header. Here this is done by substituting x
for fixed_price
, and wrapping the three sections of code in a function. Everything else is the same.
tidy <- function(x) {
col_headers <-
x %>%
offset_N(cells, n = 1) %>%
extend_E(cells, 3) %>%
extend_S(cells, 2) %>%
filter(!is.na(content)) %>%
select(row, col, value = character) %>%
split(.$row)
datacells <-
x %>%
offset_S(cells, n = 2) %>%
extend_E(cells, 4) %>%
extend_S(cells,
boundary = ~ is.na(content),
edge = TRUE) %>%
filter(!is.na(content)) %>%
mutate(value = as.double(content)) %>%
select(row, col, value)
datacells %>%
NNW(col_headers[[1]]) %>%
NNW(col_headers[[2]]) %>%
N(col_headers[[3]])
}
Finally, map the ‘tidy’ function to each ‘Fixed Price’ header, and bind the results. into one data frame. The purrr()
package is used here, but this could also be done with the apply()
family of functions.
map_df(fixed_price, tidy) %>%
arrange(col, row) # See that, from row 39, the region changes, as it ought.
## # A tibble: 240 x 6
## row col value.data i.value i.value.1 value.header
## <int> <int> <chr> <chr> <dbl> <chr>
## 1 28 7 Fixed Price IF CIG Rocky Mountains 1.940000 BID
## 2 29 7 Fixed Price IF CIG Rocky Mountains 1.960000 BID
## 3 30 7 Fixed Price IF CIG Rocky Mountains 2.345000 BID
## 4 31 7 Fixed Price IF CIG Rocky Mountains 2.547750 BID
## 5 32 7 Fixed Price IF CIG Rocky Mountains 2.471286 BID
## 6 33 7 Fixed Price IF CIG Rocky Mountains 3.311000 BID
## 7 34 7 Fixed Price IF CIG Rocky Mountains 2.550750 BID
## 8 39 7 Fixed Price AECO / NIT 2.376377 BID
## 9 40 7 Fixed Price AECO / NIT 2.397635 BID
## 10 41 7 Fixed Price AECO / NIT 2.551891 BID
## # ... with 230 more rows
Joining the row headers
So far, only the column headers have been joined, but there are also row headers on the left-hand side of the spreadsheet. The following code incorporates these into the final dataset.
row_headers <-
cells %>%
filter(character == "Cash") %>%
split(paste(.$row, .$col))
row_headers <-
map_df(row_headers,
~ .x %>%
extend_S(cells, boundary = ~ is.na(content)) %>%
extend_E(cells, boundary = ~ is.na(content), edge = TRUE) %>%
filter(!is.na(content)) %>%
# This concatenates the "Dec-20 to Mar-20" cells into one column.
# First it converts Excel dates, via R dates, into text.
mutate(character = ifelse(!is.na(character),
character,
format(as.POSIXct(as.integer(content) * (60*60*24),
origin="1899-12-30",
tz="GMT"), "%b-%C"))) %>%
# Then it concatentates them by row.
select(row, col, character) %>%
spread(col, character, fill = "") %>%
mutate(col = 1, value = str_trim(paste(`2`, `3`, `4`))) %>%
select(row, col, value))
Since the single column of row headers applies to every row of every small multiple (unlike the column headers), a global row_headers
variable can be joined to each small multiple by using a simple W() join. This is incorporated into the definition of tidy()
below (see the bottom line).
tidy <- function(x) {
col_headers <-
x %>%
offset_N(cells, n = 1) %>%
extend_E(cells, 3) %>%
extend_S(cells, 2) %>%
filter(!is.na(content)) %>%
select(row, col, value = character) %>%
split(.$row)
datacells <-
x %>%
offset_S(cells, n = 2) %>%
extend_E(cells, 4) %>%
extend_S(cells,
boundary = ~ is.na(content),
edge = TRUE) %>%
filter(!is.na(content)) %>%
mutate(value = as.double(content)) %>%
select(row, col, value)
datacells %>%
NNW(col_headers[[1]]) %>%
NNW(col_headers[[2]]) %>%
N(col_headers[[3]]) %>%
W(row_headers) # This is the only new line
}
map_df(fixed_price, tidy) %>%
arrange(col, row) # See that, from row 39, the context loops, as it ought.
## # A tibble: 240 x 7
## row col value.data i.value i.value.1 value.header
## <int> <int> <chr> <chr> <dbl> <chr>
## 1 28 7 Fixed Price IF CIG Rocky Mountains 1.940000 BID
## 2 29 7 Fixed Price IF CIG Rocky Mountains 1.960000 BID
## 3 30 7 Fixed Price IF CIG Rocky Mountains 2.345000 BID
## 4 31 7 Fixed Price IF CIG Rocky Mountains 2.547750 BID
## 5 32 7 Fixed Price IF CIG Rocky Mountains 2.471286 BID
## 6 33 7 Fixed Price IF CIG Rocky Mountains 3.311000 BID
## 7 34 7 Fixed Price IF CIG Rocky Mountains 2.550750 BID
## 8 39 7 Fixed Price AECO / NIT 2.376377 BID
## 9 40 7 Fixed Price AECO / NIT 2.397635 BID
## 10 41 7 Fixed Price AECO / NIT 2.551891 BID
## # ... with 230 more rows, and 1 more variables: value <chr>