The decoder problem that looks impossible at first
The transformer decoder has to satisfy two goals that seem to clash.
First, during inference, it must generate text autoregressively:
predict one token
append it to the sequence
use that sequence to predict the next token
repeatThat is mandatory for language generation, translation, summarization, and next-token prediction. When the model is about to predict the next word, the future words do not exist yet. So it cannot look ahead.
Second, during training, we want speed. Waiting for the model to generate one token at a time would be painfully slow. GPUs are good at parallel computation, so we want to process many positions at once.
This creates the central decoder question:
How can training be parallel while generation remains causally correct?The answer is masked multi-head self-attention, often called causal self-attention.
It gives us the best of both worlds:
- the decoder still learns the left-to-right rule of language generation
- training can process all target positions in parallel
- future tokens are hidden, so there is no data leakage
First, what "autoregressive" really means
An autoregressive model predicts the next token using only the tokens that came before it.
In probability form, a target sequence:
y_1, y_2, y_3, ..., y_Tis modeled as:
P(y_1, y_2, ..., y_T) = P(y_1) P(y_2 | y_1) P(y_3 | y_1, y_2) ... P(y_T | y_1, ..., y_{T-1})That long equation says something simple:
each new token depends only on the pastIf the decoder is generating:
The weather isthen it can try to predict:
niceBut it cannot peek at the rest of the sentence first.
That left-to-right dependency is what makes the decoder autoregressive.
Where masked attention sits inside the decoder
A standard transformer decoder layer has three main parts:
- Masked multi-head self-attention
- Encoder-decoder cross-attention
- Feed-forward network
The first block is the one we care about here.
Why does the first block need masking?
Because it attends over the target sequence itself. If we let each target token look at all other target tokens during training, then position 2 could see position 3, position 4, and beyond. That would let the model cheat.
By contrast, the cross-attention block reads from the encoder output. That is allowed, because the source sentence is fully known.
So the mask belongs specifically to the decoder's self-attention over the target-side tokens.
Inference: why the decoder must generate one token at a time
Let us use a translation example:
English source: How are you?
Nepali target: तपाईं कस्तो हुनुहुन्छDuring inference, the encoder reads the entire source sentence:
How | are | you | ?Then the decoder begins with a special start token:
<START>Now the process looks like this:
Step 1
Decoder input:
<START>Prediction:
तपाईंStep 2
Decoder input:
<START> तपाईंPrediction:
कस्तोStep 3
Decoder input:
<START> तपाईं कस्तोPrediction:
हुनुहुन्छStep 4
Decoder input:
<START> तपाईं कस्तो हुनुहुन्छPrediction:
<END>That is true autoregressive generation.
At every step, the model uses only the already generated target prefix plus the encoder's source representation.
There is no way around this during inference, because the future target tokens do not exist yet.
Why naive autoregressive training would be too slow
Suppose we trained exactly the same way as inference:
- feed
<START> - predict
तपाईं - append it
- predict
कस्तो - append it
- predict
हुनुहुन्छ - repeat
That would work, but it would waste the transformer's biggest strength:
parallel computation across sequence positionsIf we forced training to wait for one predicted token before computing the next, we would make the decoder feel much more like an old recurrent model during optimization.
That is why training usually uses teacher forcing.
Teacher forcing: the trick that makes training fast
Teacher forcing means we do not wait for the model to generate the previous correct token.
Instead, we feed the correct target tokens from the dataset.
For the target sequence:
तपाईं कस्तो हुनुहुन्छ <END>we build a shifted decoder input:
<START> तपाईं कस्तो हुनुहुन्छand the expected outputs become:
तपाईं कस्तो हुनुहुन्छ <END>So position by position:
| Decoder input token | Model should predict |
|---|---|
<START> | तपाईं |
तपाईं | कस्तो |
कस्तो | हुनुहुन्छ |
हुनुहुन्छ | <END> |
This is the key training trick.
We already know the full correct target sentence from the dataset, so we can prepare all decoder inputs at once. That means we can compute the logits for all positions in parallel.
But there is a subtle correction worth making here:
training is parallelized, but the objective is still autoregressiveThe model is still learning:
predict the next token from the pastTeacher forcing only changes how efficiently we compute the training pass.
The cheating problem if we use ordinary self-attention
Now the danger appears.
Suppose we send the full decoder input:
<START> तपाईं कस्तो हुनुहुन्छthrough a normal self-attention layer with no mask.
Self-attention compares every token with every other token.
That means:
<START>could attend toतपाईं,कस्तो, andहुनुहुन्छतपाईंcould attend toकस्तोandहुनुहुन्छकस्तोcould attend toहुनुहुन्छ
But that violates the autoregressive rule.
When the model is trying to predict कस्तो, it should only have access to:
<START> तपाईंnot to:
हुनुहुन्छIf it can see future tokens during training, the loss becomes artificially easy. The model is effectively reading the answer sheet.
That is exactly what people mean by future-token leakage or data leakage in the decoder.
The fix: masked self-attention
Masked self-attention says:
process all positions in parallel
but block attention to the futureSo if the target length is 4, then:
- position 1 can attend to position 1
- position 2 can attend to positions 1 and 2
- position 3 can attend to positions 1, 2, and 3
- position 4 can attend to positions 1, 2, 3, and 4
but never the other way around.
This is why the mask is often drawn as a lower-triangular matrix.
A quick recap of attention scores
Inside self-attention, we compute:
Attention(Q, K, V) = softmax((QK^T) / sqrt(d_k)) VThe matrix:
QK^Tcontains one score for every query-key pair.
If the sequence length is 4, the score matrix has shape:
4 x 4Before masking, it might look like this:
[
[2.1, 1.3, 0.7, 1.8],
[0.5, 2.4, 1.9, 0.8],
[1.1, 0.9, 2.7, 1.6],
[0.4, 1.0, 1.5, 2.2]
]Interpret row 2 like this:
when token 2 queries the sequence,
how much should it attend to tokens 1, 2, 3, and 4?Without masking, token 2 can see tokens 3 and 4, which is illegal for decoder self-attention.
How the mask is applied
We create a mask of the same shape:
[
[0, -inf, -inf, -inf],
[0, 0, -inf, -inf],
[0, 0, 0, -inf],
[0, 0, 0, 0 ]
]Then we add it to the raw attention scores:
masked_scores = scores + maskSo the previous score matrix becomes:
[
[2.1, -inf, -inf, -inf],
[0.5, 2.4, -inf, -inf],
[1.1, 0.9, 2.7, -inf],
[0.4, 1.0, 1.5, 2.2]
]Now apply softmax row by row.
Remember what softmax does:
softmax([a, b, c]) -> positive weights that sum to 1But:
e^(-inf) = 0So any masked position gets probability zero after softmax.
That means row 2 becomes something like:
softmax([0.5, 2.4, -inf, -inf]) ~= [0.13, 0.87, 0.00, 0.00]The future has been hidden without breaking parallel computation.
That is the heart of masked self-attention.
What each position is allowed to know
For the decoder input:
<START> तपाईं कस्तो हुनुहुन्छthe mask enforces this visibility:
| Position | Current token | Allowed context |
|---|---|---|
| 1 | <START> | <START> |
| 2 | तपाईं | <START>, तपाईं |
| 3 | कस्तो | <START>, तपाईं, कस्तो |
| 4 | हुनुहुन्छ | <START>, तपाईं, कस्तो, हुनुहुन्छ |
So when the model produces logits at position 3, it can use:
<START> तपाईं कस्तोbut not:
हुनुहुन्छThat keeps the training setup faithful to inference-time causality.
Why this still counts as parallel training
A common confusion is:
If the decoder is autoregressive, how can training possibly be parallel?Because all positions are computed in one matrix operation.
We do not run four separate forward passes for the four target positions. We run one forward pass over the shifted target sequence:
<START> | तपाईं | कस्तो | हुनुहुन्छThe mask makes each row behave as if it only had past context, even though the GPU computes all rows together.
So the training loop is:
- build the full shifted target sequence
- run masked self-attention over all positions together
- produce logits for all positions together
- compare against all next-token labels together
This is why transformers train so much faster than an architecture that must step through the sequence one token at a time.
Why it is called masked multi-head attention, not just masked attention
The decoder does not use one single attention map.
It uses multiple heads:
head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)Each head gets its own learned projections and its own masked attention calculation. The causal mask is applied inside every head.
That means each head can learn a different pattern while still respecting the no-future rule.
For example, different heads may specialize in:
- local phrase structure
- subject-verb agreement
- long-range dependencies
- punctuation or sentence boundaries
- alignment patterns useful for translation
Then the head outputs are concatenated and projected back:
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W^OSo the full idea is:
multiple attention heads
each head is causal
all heads run in parallelMasked self-attention vs padding masks
Another easy confusion is mixing up two different masks.
1. Causal mask
This blocks future positions.
It enforces:
token t cannot attend to tokens after t2. Padding mask
This blocks padding tokens.
It enforces:
do not treat <PAD> as real contentIn real implementations, decoder attention often uses both ideas together:
- a causal mask for left-to-right generation
- a padding mask for ignoring padded positions
They solve different problems.
Why masking matches inference so well
During inference, future tokens are unavailable because they have not been generated yet.
During training, future tokens are physically present in the batch tensor because the full sentence is known.
The causal mask makes training behave as if those future tokens were not available.
That is the whole trick.
So the relationship is:
| Setting | Are future target tokens present in memory? | Can the model use them? |
|---|---|---|
| Inference | No | No |
| Training without mask | Yes | Yes, which is wrong |
| Training with causal mask | Yes | No, which is correct |
This is why masked self-attention is not a minor implementation detail. It is what makes the decoder training objective faithful to the generation process.
A very practical mental model
Imagine four students taking a sequential exam.
Each student is supposed to answer only using:
- the question in front of them
- the notes from earlier solved questions
Training without a mask is like letting student 2 glance at student 3's answer sheet.
Training with a mask is like seating them behind one-way partitions:
- student 1 sees only station 1
- student 2 sees stations 1 and 2
- student 3 sees stations 1, 2, and 3
- student 4 sees all previous stations
All students can still work at the same time, but no one can copy from the future.
That is exactly what the causal mask does inside the decoder.
The part many beginners miss
People sometimes say:
the decoder is autoregressive at inference but non-autoregressive at trainingThat sentence is useful informally, but it is slightly imprecise.
The better statement is:
the decoder learns an autoregressive objective,
but training computes all positions in parallel using teacher forcing and a causal maskWhy does this distinction matter?
Because the model is never allowed to break the left-to-right rule. The rule stays intact. Only the computation becomes more efficient.
Decoder self-attention and cross-attention are doing different jobs
In the decoder, the two attention blocks play different roles.
Masked self-attention
This answers:
What have I generated so far on the target side?Cross-attention
This answers:
Which source-side information from the encoder is relevant right now?For translation, that means:
- masked self-attention tracks the already generated Nepali prefix
- cross-attention looks back at the English source representation
Both are necessary, but only the first one needs the causal mask.
Why the decoder would fail without this design
Without masked self-attention, training would reward the model for using information it will never have during inference.
That creates a mismatch:
- training becomes artificially easy
- evaluation during generation becomes much harder
- the model learns dependencies it cannot rely on at test time
This kind of train-test mismatch is exactly what good architecture design tries to avoid.
Masked multi-head attention fixes it cleanly.
The shortest possible summary
If you want the entire idea in five lines, it is this:
- The decoder must generate text left to right.
- Inference is therefore inherently autoregressive.
- Training uses teacher forcing so all target positions can be computed together.
- A causal mask blocks every position from seeing future target tokens.
- After softmax, masked future positions get zero attention weight.
That is how transformers train fast without leaking the answer from the future.
Final takeaway
Masked multi-head attention is the mechanism that makes the transformer decoder practical.
It preserves the causal rule required for generation:
use only the past to predict the next tokenwhile still letting modern hardware process whole sequences efficiently during training.
So whenever you see the triangular decoder mask, remember what it is really doing:
parallelizing the computation
without relaxing the logic of autoregressive generationThat is why the decoder can be fast in training, correct in inference, and faithful to the next-token prediction objective in both.