igapfill: A console-based interface for the R package gapfill

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Inder Tecuapetla

2025-05-09

The functions and methods of igapfill provide an intuitive, user-friendly, console-based interface that allows us to fill missing values of time series of satellite images (TSSI); the main engine for filling these missing values is the spatio-temporal approach implemented in the R package gapfill.

Throughout this note, we will employ the Garnica dataset, also included in igapfill, to showcase the use of this package.

Garnica dataset

This dataset is comprised of two .tif files which store a TSSI recorded over the National Park Cerro de Garnica located in Michoacan, Mexico. The file garnica_250m_16_days_NDVI.tif contains 572 layers of the Normalized Differenced Vegetation Index (NDVI). The accompanying file garnica_250m_16_days_pixel_reliability.tif contains information about the quality of each pixel of the 572 NDVI layers. Both datasets are spatial subsets extracted from larger images belonging to the MOD13Q1 NASA’s product. Due to the temporal resolution of this product -16 days- the 572 layers correspond to a time series of images starting on the 49-th Julian date of 2000 and ending on the 353-rd Julian date of 2024.

We can access to this dataset as follows:

ndvi_path <- system.file("extdata", "garnica_250m_16_days_NDVI.tif", 
                         package = "igapfill")
reliability_path <- system.file("extdata", 
                                "garnica_250m_16_days_pixel_reliability.tif", 
                                package = "igapfill")

spatNDVI <- rast(ndvi_path)
spatRELIABILITY <- rast(reliability_path)

Although by definition the NDVI takes values on the interval (-1,1), NASA uses 1e4 as an scale factor and distributes the NDVI MOD13Q1 product through files that when read in the R environment, the stored information has the INT4S datatype. Thus spatNDVI has values ranging from -1e4 to 1e4.

mvSieve: A sieve with the amount of missing values in TSSI

How much of the information contained in the Garnica NDVI TSSI can be used with high reliability in a scientific study?

mvSieve counts the amount of missing values in a TSSI and hence it gives us a fair estimate of the overall quality of this type of datasets. Now, prior to use mvSieve, let’s see how to use the pixel realiability product to yield a highly reliable NDVI product.

For scientific purposes, it is mandatory to filter out those pixels with low reliability from a TSSSI dataset. Quality layers, such as those provided in spatRELIABILITY, allow us to pursuit this task.

Specifically, for the Garnica dataset, the quality flags are 0 (Good data), 1 (Marginal data), 2 (Snow/clouds) and 3 (No data). It is known that a reliable scientific dataset based on TSSI can be built upon pixels with Good and Marginal data quality flags. In the Appendix, there is code to generate the product garnica_250m_16_days_NDVI_QA that results from filtering out ( masking ) those NDVI pixels whose reliability is either equal to Snow/clouds or No data.¹

The function mvSieve() can now be applied to the NDVI_QA product. Observe that mvSieve’s argument dirs is equal to a character vector pointing to the sub-directories containing the TSSI files. The remaining arguments are straightforward: filesPerDir=23, startPeriod=2001, endPeriod=2024 and colNames is a character vector defined below:²

garnicaNDVIQAdir <- paste0( getwd(), "/garnica/250m_16_days_NDVI_QA" )
dir.create( path = garnicaNDVIQAdir, recursive = TRUE )

DIRS <- list.dirs(path = garnicaNDVIQAdir)[-1]

COLNAMES <- paste0( "DoY-", seq(1,365,by=16) )

missingValueSieve <- mvSieve(dirs=DIRS[-1],
                             filesPerDir = 23,
                             startPeriod = 2001,
                             endPeriod = 2024,
                             colNames = COLNAMES)

missingValueSieve can be construed as a missing values matrix and thanks to the R package heatmaply we can display this kind of matrices in a graphical form. Below is the missing value matrix of the Garnica dataset as a heatmap.

This heatmap allows us to spot a recurrent feature in NDVI TS, i.e. that during the period of vegetation growth, this kind of datasets usually exhibit a fair amount of information loss. In particular, for the Garnica dataset, there is a non-negligible amount of missing values in the interval between the 145-th day of the year (DoY) and the 289-th. That is, for this example, between May and October -arguably the season of vegetation growth- there is an important information loss.

The spatio-temporal approach of gapfill is applied to blocks of \(d \times d\) images. It is clear that in order to reduce significantly the amount of missing values in the entire dataset, gapfill has to be applied sequentially on a series of blocks whose dimensions can be of order \(d=2,3,4,\ldots\).

Based on a heatmap such as the one above, it may be not too difficult to select a block of \(2\times 2\), images from which to start reducing the amount of overall missing values in the dataset. For instance, we can define block0 as the \(2\times 2\) set comprised of the images acquired in the 113-th and 129-th DoY of 2011 and 2012. This block-selection task can become automatic due to the minmaxBlock() function, though.

minmaxBlock: establishing blocks with minimal (or maximal) missingness

For those studies in which the task of establishing a block of images with the smallest (or largest) amount of missing values requires something more than a quick visual inspection we have developed minmaxBlock(). This function has three arguments, sieve, type and rank. In order to set the first argument, we can use a missing values matrix, such as missingValueSieve. The argument type defines whether you seek for a block with minimal or a maximal amount of missing values. The last argument is a numeric controlling the size of the search grid on which the block with the minimal or maximal amount of missing values will be found. In what follows we provide details for the case type="min", the other case is analogous.

minmaxBlock searches for the minimal \(2\times 2\), non-zero, sub-matrix of a general missing values matrix with entries denoted as \(m_{i,j}\). For any given \(2\times 2\) block we have defined blockMissingness as \(\log( \texttt{cumsum} (m_{i,j}) )\) for \(1\leq i,j\leq 2\). The minimal block is defined as that block with the minimum block missingness. We preferred the cumsum function rather than others, such as cumprod or det, because a general missing values matrix could have a large amount of zeros.

The search for the minimal block runs over the values of a non-zero numeric vector of length equal to the value of the rank argument; for simplicity we call this vector as the rank vector. The entries of this vector are non-zero values of the missing values matrix provided by the sieve argument. Consider a given entry of the rank vector and notice that this number can occupy any of the four entries of a \(2 \times 2\) sub-matrix of the original missing values matrix sieve. Correspondingly, based on this value, four different \(2 \times 2\) matrices are defined and the block missingness of each of these matrices is calculated. This process is applied to each value of the rank vector. Having calculated all the possible blocks, and their corresponding missingness, we define the minimal block of the sieve as the \(2\times 2\) block with the smallest missingness.

Below it is established that for the Garnica’s missing values matrix, the \(2\times 2\) minimal block is comprised by the images acquired in the 113-rd and 129-th DoY of 2011 and 2012.

SIEVE <- (missingValueSieve/max(missingValueSieve)) * 100

minmaxBlock(sieve=SIEVE, rank=10)
#> $rows
#> [1] 11 12
#> 
#> $cols
#> [1] 8 9
#> 
#> $block
#>      DoY-113   DoY-129
#> 2011       0 0.0000000
#> 2012       0 0.6956522
#> 
#> $blockMissingness
#> [1] -0.3629055

In the next section we show how to fill the missing values of this block.

Using gapfill to fill missing values in TSSI

Now that we have found a block of images to apply the spatio-temporal approach introduced in the R package gapfill let’s make a sub-directory ( block0 ) in which we make safe copies of the images to be filled.

DIRS <- list.dirs(path = garnicaNDVIQAdir)[-1]

files2011 <- list.files(path = DIRS[12],
                        pattern = ".tif$",
                        full.names = TRUE)

files2012 <- list.files(path = DIRS[13],
                        pattern = ".tif$",
                        full.names = TRUE)

block0DIR <- paste0( getwd(), "/garnica/block0" )

if( !dir.exists(block0DIR) ){
  dir.create( block0DIR )
}

file.copy(from=files2011[8:9], to=block0DIR)
#> [1] TRUE TRUE

file.copy(from=files2012[8:9], to=block0DIR)
#> [1] TRUE TRUE

`create_dirs()`

Speaking of safety, we have found extremely useful to create a set of sub-directories in which we can store all the auxiliary files that are generated through the process of setting arguments for applying gapfill functions; we encourage to the users to avoid saving auxiliary files in hard-to-remember locations of their systems. With create_dirs(), as shown below, we can create this directory tree.

The directory tree with block0 as its root looks as follows

`dimsReport()`

In order to avoid memory overflow and/or to reduce execution time, it is recommended to know the spatial dimensions of the images to be processed with gapfill; obviously, having a fair understanding of your systems characteristics is highly recommended too. In addition to information about the spatial dimensions (number of rows and columns) of the images, dimsReport() returns all the divisors of these dimensions. A direct application of this function returns something as shown below:

`sort_split()`

Depending on system characteristics, the execution time of the gapfill’s functions tends to increase in direct relationship with the images dimension. We have found that an effective way of speeding up the gap-filling process consists of splitting the original images into smaller spatio-temporal chunks.

The sort_split() function takes care of the splitting process through a console-based application. That is, the user is asked to pass a series of arguments. By accepting the default arguments of sort_split(), we ensure that the general workflow of igapfill() is followed but the user has the flexibility of changing this workflow. When its arguments nrow_split and ncol_split are not provided, sort_split() integrates dimsReport() in its execution as it can be seen below:

As seen above, the last message of sort_split() returns the location where the recently generated chunks were saved. Once we have these chunks, we will continue our workflow by applying the gapfill’s functions to each chunk.

`applyGapfill()`

Now that we have our original images splitted in smaller chunks, we can explode parallel computing and apply Gapfill() -the main function of the R package gapfill()- and reduce the amount of missing values in our images.³

Since we have created a directory to host our original images, moreover, we have a directory tree to store all the auxiliary files of this process, now we can simply take advantage of this structure. That is, by setting a vector with absolute directory names we can set the inputDir, outputDir and progressDir arguments easily. The arguments lat and lon are vectors containing numeric values associated to the latitude and longitude coordinates of each pixel within the images to fill; the functions get_LAT() and get_LON() can be used to calculate appropriate values for these arguments.⁴ The arguments days and years are intended to register the days of the year and the years, respectively, involved in the spatio-temporal chunk defined by the images.⁵ Notice that these four parameters contribute to define a 4-dimensional array and, in effect, Gapfill() is applied to this 4-dimensional structure. More details about this 4-dimensional set are given in the documentation of the functions get_3Darray() and get_4Darray().

In spite of having splitted the original images, a large execution time could still be an issue without parallel computing. With the argument numCores we indicate how many processing cores are willing to employ in this gap-filling process. Our argument clipRange is equivalent to gapfill’s argument clip; this vector defines an interval for valid predicted (filling) values.

As it is well-known, the NDVI takes values on the interval \((-1,1)\), this interval is the default value for the clipRange argument. In the code below, the value for clipRange reflects the fact that the Garnica dataset contains NDVI values recorded in INT4S datatype, that is, these NDVI images take values on the interval (-1e4, 1e4). Accordingly, the scale argument is set equal to 1, reflecting that we are using the NDVI images without scaling them; when the default value for clipRange is used then scale must be set accordingly.

allDIRS <- list.dirs(path = block0DIR, full.names = TRUE)#[-1]

block0FILES <- list.files(path = block0DIR, pattern = ".tif$", 
                          full.names = TRUE)

LAT <- get_LAT(stack = stack(block0FILES))

LON <- get_LON(stack = stack(block0FILES))

applyGapfill(inputDir = allDIRS[7],
             outputDir = allDIRS[5],
             progressDir = allDIRS[6],
             lat = LAT,
             lon = LON,
             days = c(113,129),
             years = 2011:2012,
             numCores = 10,
             scale = 1e0,
             clipRange = c(-1e4,1e4))

The output of applyGapfill() is a set of .RData files store at the path given by the argument outputDir. Each file contains a 4-dimensional array object. That is, for all practical purposes, at this stage the process has produced a series of numeric matrices. Thus, rasterizing and mosaicking these matrices is in order.

parallel_mosaic()

In order to obtain the gap-filled images, we must rasterize the matrices contained in each array and then the resulting rasters must be glued together. It is straightforward to obtain values for the arguments inputDirImages, inputDirRData, inputDirMaster, outputDir and progressReportDir due to the auxiliary objects defined above. Below we use scaleFactor = 1e0 and dataType = "INT4S" reflecting that our original images contain NDVI values recorded in INT4S datatype. With the arguments utilized above in sort_split() we ended up with 5 splits, so below we use numCores = 3; in general we recommend that the value for numCores does not exceed the number of splits.

parallel_mosaic(inputDirImages = block0DIR,
                inputDirRData = allDIRS[5],
                inputDirMaster = allDIRS[4],
                outputDir = allDIRS[3],
                progressReportDir = allDIRS[6],
                scaleFactor = 1e0,
                dataType = "INT4S",
                numCores = 3) # numCores must not exceed number of splits

The output of parallel_mosaic() is a set of GeoTiff files stored at the path indicated by the argument outputDir. In order to verify that the images associated with these files have a reduced amount of missing values we can employ the code below.

First, we read the original images and plot them.

initFILES <- list.files(path = block0DIR,
                        pattern = ".tif$",
                        full.names = TRUE)

gapfilledFILES <- list.files(path = allDIRS[3],
                             pattern = ".tif$",
                             full.names = TRUE)

initIMAGES <- rast(initFILES)
gapfilledIMAGES <- rast(gapfilledFILES)

par(mfrow=c(2,2))
plot(initIMAGES, cex.main=0.75,
     main=sapply(c("Day113, 2011", "Day129, 2011", "Day113, 2012", "Day129, 2012"), 
                 function(s) paste("Original:", s)))

Original block0 images

Then, we compared them with their corresponding igapfilled-versions.

plot(gapfilledIMAGES, cex.main=0.75,
     main=sapply(c("Day113, 2011", "Day129, 2011", "Day113, 2012", "Day129, 2012"), 
                 function(s) paste("Gapfill:", s)))

block0 images after igapfill()

igapfill(): A fully console-based user interface

Most of the steps presented above lead to igapfill(), a console-based wrap-up of create_dirs(), dimsReport(), sort_split(), applyGapfill() and parallel_mosaic(). All that is left to the users is the selection of the images that are to be gap-filled, but as shown above this can be easily done with a combination of mvSieve(), minmaxBlock() and R base code.

Although in our application we applied igapfill’s functions to try and fill most of the missing values in the entire Garnica dataset, we conclude this vignette by showing igapfill()’s output when it is applied to a block of \(3 \times 3\) matrices that combined had a blockMissingness of 2.835768. This is block14 in our application, that is, the following results correspond to the 15-th application of igapfill()’s functions to distinct blocks of Garnica’s images.⁶

First, igapfill() takes care of setting up the directory tree and splitting the original images:

Then, the gap-filling process starts and some parameters are requested. Once this process has finished, another set of parameters are requested prior to embark ourselves into rasterizing and mosaicking the final output:

Finally, as done above, we read and plot the original block14´s images:

block14 images before igapfill()

And then, we can plot the final result of applying igapfill() to block14’s images.

block14 images after igapfill()

Appendix

Auxiliary code used in the mvSieve section

# --- Auxiliary code used in mvSieve section ---

garnicaNDVIQAdir <- paste0( getwd(), "/garnica/250m_16_days_NDVI_QA" )

dir.create( path = garnicaNDVIQAdir, recursive = TRUE )

YEARS <- 2000:2024
ndviNAMES <- names(spatNDVI)
reliabilityNAMES <- names(spatRELIABILITY)
ndviQANAMES <- sapply(ndviNAMES, 
                      function(s) paste0( strsplit(s, ".tif")[[1]][1], "_QA.tif" ))

for(i in 1:20){ #### Year 2000 has 20 images only
  
  if(!dir.exists( paste0( garnicaNDVIQAdir, "/", YEARS[1] ) )){
    dir.create( paste0( garnicaNDVIQAdir, "/", YEARS[1] ) )
  }
  
  TEMPndvi <- subset(spatNDVI,i)
  
  TEMPreliability <- subset(spatRELIABILITY,i)
  
  TEMPndvi[ TEMPreliability >= 2] <- NA
  writeRaster(TEMPndvi, 
              filename = paste0( garnicaNDVIQAdir, "/", YEARS[1], "/",
                                 ndviQANAMES[i] ),
              datatype = "INT2S", overwrite = TRUE)
}

for (i in 2:length(YEARS)) {
  
  if(!dir.exists( paste0( garnicaNDVIQAdir, "/", YEARS[i] ) )){
    dir.create( paste0( garnicaNDVIQAdir, "/", YEARS[i] ) )
  }
  
  for(j in 1:23){
    count <- 20 + ((i-2)*23) + j
    TEMPndvi <- subset(spatNDVI,  count)
    TEMPreliability <- subset(spatRELIABILITY,count)
    TEMPndvi[ TEMPreliability >= 2 ] <- NA
    
    writeRaster(TEMPndvi, 
                filename = paste0( garnicaNDVIQAdir, "/", YEARS[i], "/",
                                   ndviQANAMES[count] ),
                datatype = "INT2S", overwrite = TRUE)
  }
}

# --- Auxiliary code used in mSieve section ---
out <- heatmaply((missingValueSieve/max(missingValueSieve)) * 100,
          limits = c(0,100),
          colors = cool_warm,
          dendrogram = "none",
          xlab = "", ylab = "",
          width = 900,
          height = 650,
          main = "% of missing values in NDVI TS of Cerro de Garnica (2001-2024)",
          scale = "none",
          draw_cellnote=TRUE,
          cellnote_textposition = "middle center",
          cellnote_size = 6,
          margins = c(60, 100, 40, 20),
          titleX = FALSE,
          hide_colorbar = TRUE,
          labRow = rownames(missingValueSieve),
          labCol = colnames(missingValueSieve),
          heatmap_layers = theme(axis.line = element_blank()))

html_out <- as.tags(out)

scrollable_out <- tags$div(
  class = "scrollable",
  style = "width: 100%;",
  html_out
)

scrollable_out

We have saved the layers of this product in sub-directories organized by years, the first being 2000 and the last one being 2024.↩︎
Although the TSSI starts in 2000, there are only 20 images for that year. Starting in 2001, every year has 23 NDVI images. Hence in our example startPeriod=2001.↩︎
In general, depending on the amount of missing information in the chosen block the reduction can be total or marginal.↩︎
The length of these vectors must coincide with the number of rows and columns of the spatRaster generated by the images to fill.↩︎
The length of these vectors must coincide with the number of images to process by year and with the number of years to process.↩︎
Recall that we started with block0.↩︎

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.