♎
Limited AI
  • Machine Learning
    • Linear Model Cheating Sheet
    • Nonlinear Model Cheating Sheet
    • General Linear Model 1
    • General Linear Model 2
    • General Linear Model 3
    • Tree Based Methods
    • Tree Based Methods Supplement
    • XG,Cat,Light__Boosting
    • KNN&PCA
    • Model Performance
    • Model Evaluation
    • Code Practice
      • KNN
      • Decision Tree Python Code
    • Data and Feature Engineering
      • Handle Bias Data
      • Cold Start Problem
  • Deep Learning
    • Summary v2
    • Basic Neural Network
      • From Linear to Deep
      • Perceptron and Activation Function
      • NN network Details
      • Backpropagation Details
      • Gradient Vanishing vs Gradient Exploding
    • Basic CNN
      • Why CNN
      • Filter/ Convolution Kernel and Its Operation
      • Padding& Stride
      • Layers
      • Extra:From Fully Connected Layers to Convolutions
      • Extra: Multiple Input and Multiple Output Channels
    • Advance CNN
      • Convolutional Neural Networks(LeNet)
      • Deep Convolution Neural Networks(AlexNet)
      • Networks Using Blocks (VGG)
      • Network in Network(NiN)
      • Multi-Branch Networks(GoogLeNet&I mageNet)
      • Residual Networks(ResNet) and ResNeXt
      • Densely Connected Networks(DenseNet)
      • Batch Normalization
    • Basic RNN
      • Seq Model
      • Raw Text to Seq
      • Language Models
      • Recurrent Neural Networks(RNN)
      • Backpropagation Through Time
    • Advance RNN
      • Gated Recurrent Units(GRU)
      • Long Short-Term Memory(LSTM)
      • Bidirectional Recurrent Neural Networks(BRNN)
      • Encoder-Decoder Architecture
      • Seuqence to Sequence Learning(Seq2Seq)
    • Attention Mechanisms and Transformers
      • Queries, Keys, and Values
      • Attention is all you need
        • Attention and Kernel
        • Attention Scoring Functions
        • The Bahdanau Attention Mechanism
        • Multi-Head Attention
        • Self-Attention
        • Attention的实现
      • The Transformer Architecture
        • Extra Reading
        • 最短的最大路径长度
      • Large-Scaling Pretraning with Transformers
        • BERT vs OpenAI GPT vs ELMo
        • Decoder Model框架
        • Bert vs XLNet
        • T5& GPT& Bert比较
        • 编码器-解码器架构 vs GPT 模型
        • Encoder vs Decoder Reference
      • Transformers for Vision
      • Transformer for Multiomodal
    • NLP Pretraining
      • Word Embedding(word2vec)
        • Extra Reading
      • Approximate Training
      • Word Embedding with Global Vectors(GloVe)
        • Extra Reading
        • Supplement
      • Encoder(BERT)
        • BERT
        • Extra Reading
      • Decoder(GPT&XLNet&Lamma)
        • GPT
        • XLNet
          • XLNet架构
          • XLNet特点与其他比较
      • Encoder-Decoder(BART& T5)
        • BART
        • T5
  • GenAI
    • Introduction
      • GenAI Paper Must Read
      • GenAI六个阶段
    • Language Models Pre-training
      • Encoder-Decoder Architecture
      • Encoder Deep Dive
      • Decoder Deep Dive
      • Encoder VS Decoder
      • Attention Mechanism
      • Transformers
    • Example: Llama 3 8B架构
    • Fine-Tuning Generation Models
    • RAG and Adavance RAG
    • AI Agent
  • Statistics and Optimization
    • A/B testing
    • Sampling/ABtesting/GradientMethod
    • Gradient Decent Deep Dive
  • Machine Learning System Design
    • Extra Reading
    • Introduction
  • Responsible AI
    • AI Risk and Uncertainty
      • What is AI risk
      • General Intro for Uncertainty Quantification
      • Calibration
      • Conformal Prediction
        • Review the linear regression
        • Exchangeability
        • Split Conformal Prediction
        • Conformalized Quantile Regression
        • Beyond marginal coverage
        • Split Conformal Classification
        • Full Conformal Coverage
        • Cross-Validation +
        • Conformal Histgram Regression
    • xAI
      • SHAP value
  • Extra Research
    • Paper Reading
    • Reference
Powered by GitBook
On this page
  • Encoder
  • Decoder
  • Tokenization and Representation
  • Positional Encodings
  • Multi-Head Attention
  • Position-Wise Feed-Forward Neural Network
  • Layer Normalization
  • Masked Multi-Head Attention
  • Encoder-Decoder Attention
  • Transformer Variants
  • Normalization Methods
  • Normalization Position
  • Activation Functions
  • Positional Embeddings
  • Attention Mechanism
  • Structural Modifications
  1. GenAI
  2. Language Models Pre-training

Transformers

Encoder

  • The encoder is responsible for processing the input sequence and compressing the information into a context or memory for the decoder.

  • Each encoder layer comprises three main elements:

    • Multi-Head Attention: catch signal?

    • Feed-Forward Neural Network: applying nonlinear transformation

    • Add & Norm:

      • stabilizing the activations by combining residual connections and layer normalization.

      • mitigating the vanishing gradient problem in the encoder(decoder)

Decoder

  • The decoder takes the context from the encoder and generates the output sequence

  • It is also composed of multiple layers and has many commonalities with the encoder, but with minor changes:

    • Masked Multi-Head Attention: similar with multi-head attention, but with a masking mechanism to ensure that the prediction for a given word doesn't depend on future words in the sequence

    • Encoder-Decoder Attention: this layer allows the decoder to focus on relevant parts of the input sequence, leveraging the context provided by the encoder

    • Feed-Forward Neural Network: refines the attention vectors in perparation for generating the output sequence

Tokenization and Representation

tokenization:

  • converts sentences into a machine-readable format

  • each word in the sentence is treated as a distinct token in word-level tokenization

  • vector representation, such as word embeddings

  • subword-level approaches such as byte-pair encoding(BPE) or WordPiece often address the limitations of word-level tokenization.

    • unhappiness split into "un" and "happiness"

Positional Encodings

Transformer 模型本身是无序的,它不像 RNN 或 CNN 那样天然地处理序列的顺序信息。

  • Since the Transformer model processes all tokens in the input sequence in parallel, it does not have a built-in mechanism to account for the token positions or order.

  • provide the relative position of the tokens in the sequence

  • usually added to the unit embedding before they are fed into the Transformer model

InputEmbedding = WordEmbedding + PositionalEncoding

PE(pos,2i)=sin⁡(pos100002idmodel)\text{PE}{(pos, 2i)} = \sin\left(\frac{pos}{10000^{\frac{2i}{d{\text{model}}}}}\right)PE(pos,2i)=sin(10000dmodel2i​pos​)

PE(pos,2i+1)=cos⁡(pos100002idmodel)\text{PE}{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{\frac{2i}{d{\text{model}}}}}\right)PE(pos,2i+1)=cos(10000dmodel2i​pos​)

  • pospospos is the position of the token in the sequence,

  • iii is the dimension index,

  • dmodeld_{\text{model}}dmodel​ is the dimension of the model.

Multi-Head Attention

  • Employs hhh parallel self-attention heads, to enhance the model's representational capacity

    • the original transformer model, h=8h = 8 h=8 heads were used to allow the model to capture various aspects and dependencies within the input data, such as grammar and tense in machine translation tasks

  • query, key, and value

Position-Wise Feed-Forward Neural Network

  • Following the attention mechanism, the next component in the architecture of the Transformer model is the feed-forward neural network.

  • Position-Wise FFN 处理的是 每个位置的非线性特征变换,从而补充注意力机制的能力

  • x -> Multi-Head Attention -> Add & Norm -> Position-Wise FFN -> Add & Norm

为什么叫 "Position-Wise"?

因为 每个位置的表示都是独立地送入同一个前馈网络处理的,即:

  • 你有一个序列(比如长度为 nnn),每个位置是一个向量 xi∈Rdx_i \in R_dxi​∈Rd​

  • 你对每个 xix_ixi​​ 都使用 相同的前馈网络(参数共享) 进行变换。

  • 所以是 "Position-Wise":不涉及位置之间的交互,只是逐位置地非线性变换。

类比理解

你可以把它类比为在图像处理中对每个像素单独使用一个小型神经网络做颜色调整。位置之间不相互干扰,但都使用相同的网络。

既然 Attention 那么强,为什么还需要 FFN?

  • Attention 的核心是上下文建模,不能替代非线性映射。

  • FFN 能做复杂的特征变换(比如组合特征、抑制无效信息等),是神经网络表达力的体现。

  • 多层堆叠 Attention 会引入大量参数和计算,但未必带来效果提升;Attention + FFN 的结构更简洁、收敛更快。

与普通 FFN 区别?

  • 普通 FFN 对整体输入处理;Position-Wise FFN 是对序列中每个位置独立、共享参数地应用 FFN。

Layer Normalization

  • In a manner akin to ResNets, the Transformer model employs a residual connection where the input XXX is added to the output ZZZ

  • This normalization procedure ensures that each layer’s activations have a zero mean and a unit variance.(说白了就是确保normal)

  • For each hidden unit hih_ihi​, the layer normalization is formulated as: hi=gσ(hi−μ)h_i = \frac{g}{\sigma}(h_i - \mu)hi​=σg​(hi​−μ) where ggg is the gain variable (often set to 1), μ\muμ is the mean calculated as μ=1H∑i=1Hhi\mu = \frac{1}{H} \sum_{i=1}^{H} h_iμ=H1​∑i=1H​hi​ and σ\sigmaσ is the standard deviation computed as σ=1H∑i=1H(hi−μ)2\sigma = \sqrt{ \frac{1}{H} \sum_{i=1}^{H} (h_i - \mu)^2 }σ=H1​∑i=1H​(hi​−μ)2​

  • The layer normalization technique minimizes covariate shift\textit{covariate shift}covariate shift, i.e., the gradient dependencies between layers, thus accelerating convergence by reducing the required iterations

Masked Multi-Head Attention

  • the decoder aims to predict the next token (word or character) in the sequence by considering both the encoder’s output and the tokens already seen in the target sequence.

  • The first layer of the decoder adopts a particular strategy: it only has access to the tokens that come before the token it is currently trying to predict.

  • The masking is implemented using a particular weight matrix M. In this matrix, entries corresponding to future tokens in the sequence are set to $-\infty$, and those for previous tokens are set to 0.

  • This masking is applied after calculating the dot product of the Query (Q) and Key (KT) matrices but before applying the softmax function. As a result, the softmax output for future tokens becomes zero, effectively masking them from consideration. This ensures that the decoder cannot peek into future tokens in the sequence, thereby preserving the sequential integrity required for tasks such as language translation.

maskedAttention(Q,K,V)=softmax(QK⊤+Mdk)V\textit{maskedAttention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax} \left( \frac{\mathbf{Q} \mathbf{K}^\top + \mathbf{M}}{\sqrt{d_k}} \right) \mathbf{V}maskedAttention(Q,K,V)=softmax(dk​​QK⊤+M​)V

Encoder-Decoder Attention

  • The encoder-decoder attention mechanism serves as the bridge that connects the encoder and the decoder, facilitating the transfer of contextual information from the source sequence to the target sequence.

  • the encoder-decoder attention layer works similarly to standard multi-head attention but with a critical difference: the Queries (Q) come from the current state of the decoder, while the Keys (K) and Values (V) are sourced from the output of the encoder.

Transformer Variants

Normalization Methods

Normalization Position

Activation Functions

Positional Embeddings

Attention Mechanism

Structural Modifications

PreviousAttention MechanismNextExample: Llama 3 8B架构

Last updated 2 months ago

More details in

https://app.gitbook.com/o/8FWDZYZclCPDejXXnS5n/s/KLjYi16C9OsVCRdrTcrH/~/changes/235/deep-learning/attention-mechanisms-and-transformers