Transformers
Encoder
The encoder is responsible for processing the input sequence and compressing the information into a context or memory for the decoder.
Each encoder layer comprises three main elements:
Multi-Head Attention: catch signal?
Feed-Forward Neural Network: applying nonlinear transformation
Add & Norm:
stabilizing the activations by combining residual connections and layer normalization.
mitigating the vanishing gradient problem in the encoder(decoder)
Decoder
The decoder takes the context from the encoder and generates the output sequence
It is also composed of multiple layers and has many commonalities with the encoder, but with minor changes:
Masked Multi-Head Attention: similar with multi-head attention, but with a masking mechanism to ensure that the prediction for a given word doesn't depend on future words in the sequence
Encoder-Decoder Attention: this layer allows the decoder to focus on relevant parts of the input sequence, leveraging the context provided by the encoder
Feed-Forward Neural Network: refines the attention vectors in perparation for generating the output sequence
Tokenization and Representation
tokenization:
converts sentences into a machine-readable format
each word in the sentence is treated as a distinct token in word-level tokenization
vector representation, such as word embeddings
subword-level approaches such as byte-pair encoding(BPE) or WordPiece often address the limitations of word-level tokenization.
unhappiness split into "un" and "happiness"
Positional Encodings
Transformer 模型本身是无序的,它不像 RNN 或 CNN 那样天然地处理序列的顺序信息。
Since the Transformer model processes all tokens in the input sequence in parallel, it does not have a built-in mechanism to account for the token positions or order.
provide the relative position of the tokens in the sequence
usually added to the unit embedding before they are fed into the Transformer model
InputEmbedding = WordEmbedding + PositionalEncoding
Multi-Head Attention
query, key, and value
Position-Wise Feed-Forward Neural Network
Following the attention mechanism, the next component in the architecture of the Transformer model is the feed-forward neural network.
Position-Wise FFN 处理的是 每个位置的非线性特征变换,从而补充注意力机制的能力
x -> Multi-Head Attention -> Add & Norm -> Position-Wise FFN -> Add & Norm
为什么叫 "Position-Wise"?
因为 每个位置的表示都是独立地送入同一个前馈网络处理的,即:
所以是 "Position-Wise":不涉及位置之间的交互,只是逐位置地非线性变换。
类比理解
你可以把它类比为在图像处理中对每个像素单独使用一个小型神经网络做颜色调整。位置之间不相互干扰,但都使用相同的网络。
既然 Attention 那么强,为什么还需要 FFN?
Attention 的核心是上下文建模,不能替代非线性映射。
FFN 能做复杂的特征变换(比如组合特征、抑制无效信息等),是神经网络表达力的体现。
多层堆叠 Attention 会引入大量参数和计算,但未必带来效果提升;Attention + FFN 的结构更简洁、收敛更快。
与普通 FFN 区别?
普通 FFN 对整体输入处理;Position-Wise FFN 是对序列中每个位置独立、共享参数地应用 FFN。
Layer Normalization
This normalization procedure ensures that each layer’s activations have a zero mean and a unit variance.(说白了就是确保normal)
Masked Multi-Head Attention
the decoder aims to predict the next token (word or character) in the sequence by considering both the encoder’s output and the tokens already seen in the target sequence.
The first layer of the decoder adopts a particular strategy: it only has access to the tokens that come before the token it is currently trying to predict.
The masking is implemented using a particular weight matrix M. In this matrix, entries corresponding to future tokens in the sequence are set to $-\infty$, and those for previous tokens are set to 0.
This masking is applied after calculating the dot product of the Query (Q) and Key (KT) matrices but before applying the softmax function. As a result, the softmax output for future tokens becomes zero, effectively masking them from consideration. This ensures that the decoder cannot peek into future tokens in the sequence, thereby preserving the sequential integrity required for tasks such as language translation.
Encoder-Decoder Attention
The encoder-decoder attention mechanism serves as the bridge that connects the encoder and the decoder, facilitating the transfer of contextual information from the source sequence to the target sequence.
the encoder-decoder attention layer works similarly to standard multi-head attention but with a critical difference: the Queries (Q) come from the current state of the decoder, while the Keys (K) and Values (V) are sourced from the output of the encoder.
Transformer Variants
Normalization Methods
Normalization Position
Activation Functions
Positional Embeddings
Attention Mechanism
Structural Modifications
Last updated