Self-Attention in the Decoder

2 min readFeb 14, 2024

Similar functionality: Like the encoder, each position in the decoder’s self-attention layer can attend to all positions up to and including itself. This allows the model to understand the context of previously generated tokens when producing new ones.

Preventing Leftward Information Flow: However, we cannot allow the decoder to “peek” ahead at future tokens during training. This would violate the autoregressive property, where each output token should depend only on previously generated tokens.

Masking Solution: The solution is masking. Specifically, during self-attention in the decoder, we set the attention scores for future tokens to negative infinity (∞) using a mask. This ensures the softmax function assigns zero probability to these scores, effectively ignoring them. Figure right(decoder) likely illustrates this masking process.

Types of Masking:
Causal Masking: This is the most common type, masking out future tokens. It ensures the autoregressive property and can be implemented as a triangular mask.

Subsequent Masking: This masks out all tokens before the current one, allowing the model to attend to previously generated tokens but not future ones.

Impact of Masking:
Maintains Causality: Masking enforces the sequential nature of language generation, where each word depends on the previous ones.

Improved Training: By preventing information leakage, masking ensures the model learns to predict outputs based on past information, improving itsgeneralizability.

Computational Efficiency: Masking reduces the number of attention scores to be computed, making the decoder faster during training and inference.

Further Notes:
The masking process is typically implemented efficiently using masked attention techniques that avoid explicit infinity calculations.
Different masking strategies can be used depending on the specific task and model architecture.

I hope this additional information clarifies the importance of masking in decoder self-attention and its role in preserving the autoregressive property. Feel free to ask if you have any further questions or want to explore specific aspects in more detail!

#attention #llm #decoder #masking #autoregressive #selfattention

GitHub: https://github.com/charanhu
Linkedin: https://www.linkedin.com/in/charanhu/
Twitter: twitter.com/charan_h_u

Self-Attention in the Decoder

Written by Charan H U