Masking is a way to tell sequence-processing layers that certain timesteps in an input are missing, and thus should be skipped when processing the data.
Padding is a special form of masking where the masked steps are at the start or the end of a sequence. Padding comes from the need to encode sequence data into contiguous batches: in order to make all sequences in a batch fit a given standard length, it is necessary to pad or truncate some sequences.
Let’s take a close look.
When processing sequence data, it is very common for individual samples to have different lengths. Consider the following example (text tokenized as words):
data <- list(
  c("Hello", "world", "!"),
  c("How", "are", "you", "doing", "today"),
  c("The", "weather", "will", "be", "nice", "tomorrow")
)After vocabulary lookup, the data might be vectorized as integers, e.g.:
The data is a nested list where individual samples have length 3, 5,
and 6, respectively. Since the input data for a deep learning model must
be a single tensor (of shape
e.g. (batch_size, 6, vocab_size) in this case), samples
that are shorter than the longest item need to be padded with some
placeholder value (alternatively, one might also truncate long samples
before padding short samples).
Keras provides a utility function to truncate and pad Python lists to
a common length: pad_sequences.
raw_inputs <- list(
  c(711, 632, 71),
  c(73, 8, 3215, 55, 927),
  c(83, 91, 1, 645, 1253, 927)
)
# By default, this will pad using 0s; it is configurable via the
# "value" parameter.
# Note that you could use "pre" padding (at the beginning) or
# "post" padding (at the end).
# We recommend using "post" padding when working with RNN layers
# (in order to be able to use the
# CuDNN implementation of the layers).
padded_inputs <- pad_sequences(raw_inputs, padding="post")
padded_inputs##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]  711  632   71    0    0    0
## [2,]   73    8 3215   55  927    0
## [3,]   83   91    1  645 1253  927Now that all samples have a uniform length, the model must be informed that some part of the data is actually padding and should be ignored. That mechanism is masking.
There are three ways to introduce input masks in Keras models:
layer_masking layer.layer_embedding layer with
mask_zero=TRUE.mask argument manually when calling layers that
support this argument (e.g. RNN layers).Embedding and
MaskingUnder the hood, these layers will create a mask tensor (2D tensor
with shape (batch, sequence_length)), and attach it to the
tensor output returned by the Masking or
Embedding layer.
embedding <- layer_embedding(input_dim=5000, output_dim=16, mask_zero=TRUE)
masked_output <- embedding(padded_inputs)
masked_output$`_keras_mask`## tf.Tensor(
## [[ True  True  True False False False]
##  [ True  True  True  True  True False]
##  [ True  True  True  True  True  True]], shape=(3, 6), dtype=bool)masking_layer <- layer_masking()
# Simulate the embedding lookup by expanding the 2D input to 3D,
# with embedding dimension of 10.
unmasked_embedding <- op_cast(
    op_tile(op_expand_dims(padded_inputs, axis=-1), c(1L, 1L, 10L)),
    dtype="float32"
)
masked_embedding <- masking_layer(unmasked_embedding)
masked_embedding$`_keras_mask`## tf.Tensor(
## [[ True  True  True False False False]
##  [ True  True  True  True  True False]
##  [ True  True  True  True  True  True]], shape=(3, 6), dtype=bool)As you can see from the printed result, the mask is a 2D boolean
tensor with shape (batch_size, sequence_length), where each
individual FALSE entry indicates that the corresponding
timestep should be ignored during processing.
When using the Functional API or the Sequential API, a mask generated
by an Embedding or Masking layer will be
propagated through the network for any layer that is capable of using
them (for example, RNN layers). Keras will automatically fetch the mask
corresponding to an input and pass it to any layer that knows how to use
it.
For instance, in the following Sequential model, the
LSTM layer will automatically receive a mask, which means
it will ignore padded values:
model <- keras_model_sequential() %>%
  layer_embedding(input_dim=5000, output_dim=16, mask_zero=TRUE) %>%
  layer_lstm(units=32)This is also the case for the following Functional API model:
Layers that can handle masks (such as the LSTM layer)
have a mask argument in their call method.
Meanwhile, layers that produce a mask (e.g. Embedding)
expose a compute_mask(input, previous_mask) method which
you can call.
Thus, you can pass the output of the compute_mask()
method of a mask-producing layer to the call method of a
mask-consuming layer, like this:
MyLayer <- new_layer_class(
  "MyLayer",
  initialize = function(...) {
    super$initialize(...)
    self$embedding <- layer_embedding(
      input_dim=5000, output_dim=16, mask_zero=TRUE
    )
    self$lstm <- layer_lstm(units=32)
  },
  call = function(inputs) {
    inputs %>%
      self$embedding() %>%
      # Note that you could also prepare a `mask` tensor manually.
      # It only needs to be a boolean tensor
      # with the right shape, i.e. (batch_size, timesteps).
      self$lstm(mask=self$embedding$compute_mask(inputs))
  }
)
layer <- MyLayer()
x <- random_integer(c(32, 10), 0, 100)
layer(x)## tf.Tensor(
## [[ 0.00130048 -0.00113367 -0.00715671 ... -0.00107615 -0.00162071
##    0.00135018]
##  [-0.004185    0.00726349  0.00520932 ...  0.00119117  0.00230441
##    0.00174123]
##  [-0.00537032 -0.00164898 -0.00238435 ... -0.00154158 -0.0038603
##   -0.00105811]
##  ...
##  [ 0.00622133 -0.00905907 -0.00599518 ...  0.00025823 -0.00142478
##   -0.00125036]
##  [-0.00523904  0.00336683 -0.00299453 ...  0.00876719  0.00172074
##    0.00903089]
##  [-0.00393721  0.00058538  0.00503809 ... -0.00203075  0.00325885
##   -0.00299755]], shape=(32, 32), dtype=float32)Sometimes, you may need to write layers that generate a mask (like
Embedding), or layers that need to modify the current
mask.
For instance, any layer that produces a tensor with a different time
dimension than its input, such as a Concatenate layer that
concatenates on the time dimension, will need to modify the current mask
so that downstream layers will be able to properly take masked timesteps
into account.
To do this, your layer should implement the
layer.compute_mask() method, which produces a new mask
given the input and the current mask.
Here is an example of a TemporalSplit layer that needs
to modify the current mask.
TemporalSplit <- new_layer_class(
  "TemporalSplit",
  call = function(inputs) {
    # Expect the input to be 3D and mask to be 2D, split the input tensor into 2
    # subtensors along the time axis (axis 1).
    op_split(inputs, 2, axis=2)
  },
  compute_mask = function(inputs, mask = NULL) {
    # Also split the mask into 2 if it presents.
    if (!is.null(mask)) {
      op_split(mask, 2, axis=2)
    } else {
      NULL
    }
  }
)
c(first_half, second_half) %<-% TemporalSplit(masked_embedding)
first_half$`_keras_mask`## tf.Tensor(
## [[ True  True  True]
##  [ True  True  True]
##  [ True  True  True]], shape=(3, 3), dtype=bool)## tf.Tensor(
## [[False False False]
##  [ True  True False]
##  [ True  True  True]], shape=(3, 3), dtype=bool)Here is another example of a CustomEmbedding layer that
is capable of generating a mask from input values:
CustomEmbedding <- new_layer_class(
  "CustomEmbedding",
  initialize = function(input_dim, output_dim, mask_zero=FALSE, ...) {
    super$initialize(...)
    self$input_dim <- as.integer(input_dim)
    self$output_dim <- as.integer(output_dim)
    self$mask_zero <- mask_zero
  },
  build = function(input_shape) {
    self$embeddings <- self$add_weight(
      shape=c(self$input_dim, self$output_dim),
      initializer="random_normal",
      dtype="float32"
    )
  },
  call = function(inputs) {
    inputs <- op_cast(inputs, "int32")
    op_take(self$embeddings, inputs)
  },
  compute_mask = function(inputs, mask=NULL) {
    if (!self$mask_zero) {
      NULL
    } else {
      op_not_equal(inputs, 0)
    }
  }
)
layer <- CustomEmbedding(input_dim = 10, output_dim = 32, mask_zero=TRUE)
x <- random_integer(c(3, 10), 0, 9)
y <- layer(x)
mask <- layer$compute_mask(x)
mask## tf.Tensor(
## [[ True  True  True  True  True  True  True  True  True  True]
##  [ True  True  True  True  True  True  True  True  True  True]
##  [False False  True  True  True  True  True  True  True  True]], shape=(3, 10), dtype=bool)Note: For more details about format limitations related to masking, see the serialization guide.
Most layers don’t modify the time dimension, so don’t need to modify the current mask. However, they may still want to be able to propagate the current mask, unchanged, to the next layer. This is an opt-in behavior. By default, a custom layer will destroy the current mask (since the framework has no way to tell whether propagating the mask is safe to do).
If you have a custom layer that does not modify the time dimension,
and if you want it to be able to propagate the current input mask, you
should set self.supports_masking = True in the layer
constructor. In this case, the default behavior of
compute_mask() is to just pass the current mask
through.
Here’s an example of a layer that is whitelisted for mask propagation:
MyActivation <- new_layer_class(
  "MyActivation",
  initialize = function(...) {
    super$initialize(...)
    self$supports_masking <- TRUE
  },
  call = function(inputs) {
    op_relu(inputs)
  }
)You can now use this custom layer in-between a mask-generating layer
(like Embedding) and a mask-consuming layer (like
LSTM), and it will pass the mask along so that it reaches
the mask-consuming layer.
Some layers are mask consumers: they accept a
mask argument in call and use it to determine
whether to skip certain time steps.
To write such a layer, you can simply add a mask=None
argument in your call signature. The mask associated with
the inputs will be passed to your layer whenever it is available.
Here’s a simple example below: a layer that computes a softmax over the time dimension (axis 1) of an input sequence, while discarding masked timesteps.
TemporalSoftmax <- new_layer_class(
  "TemporalSoftmax",
  initialize = function(...) {
    super$initialize(...)
    self$supports_masking <- TRUE
  },
  call = function(inputs, mask=NULL) {
    if (is.null(mask)) {
      stop("`TemporalSoftmax` layer requires a previous layer to support masking.")
    }
    broadcast_float_mask <- op_expand_dims(op_cast(mask, "float32"), -1)
    inputs_exp <- op_exp(inputs) * broadcast_float_mask
    inputs_sum <- op_sum(inputs_exp * broadcast_float_mask, axis=-1, keepdims=TRUE)
    inputs_exp / inputs_sum
  }
)
inputs <- keras_input(shape = shape(NULL), dtype="int32")
outputs <- inputs %>%
  layer_embedding(input_dim=10, output_dim=32, mask_zero=TRUE) %>%
  layer_dense(1) %>%
  TemporalSoftmax()
model <- keras_model(inputs, outputs)
y <- model(random_integer(c(32, 100), 0, 10))That is all you need to know about padding & masking in Keras. To recap:
Embedding can generate
a mask from input values (if mask_zero=TRUE), and so can
the Masking layer.mask
argument in their call method. This is the case for RNN
layers.mask arguments to layers manually.