♎
Limited AI
  • Machine Learning
    • Linear Model Cheating Sheet
    • Nonlinear Model Cheating Sheet
    • General Linear Model 1
    • General Linear Model 2
    • General Linear Model 3
    • Tree Based Methods
    • Tree Based Methods Supplement
    • XG,Cat,Light__Boosting
    • KNN&PCA
    • Model Performance
    • Model Evaluation
    • Code Practice
      • KNN
      • Decision Tree Python Code
    • Data and Feature Engineering
      • Handle Bias Data
      • Cold Start Problem
  • Deep Learning
    • Summary v2
    • Basic Neural Network
      • From Linear to Deep
      • Perceptron and Activation Function
      • NN network Details
      • Backpropagation Details
      • Gradient Vanishing vs Gradient Exploding
    • Basic CNN
      • Why CNN
      • Filter/ Convolution Kernel and Its Operation
      • Padding& Stride
      • Layers
      • Extra:From Fully Connected Layers to Convolutions
      • Extra: Multiple Input and Multiple Output Channels
    • Advance CNN
      • Convolutional Neural Networks(LeNet)
      • Deep Convolution Neural Networks(AlexNet)
      • Networks Using Blocks (VGG)
      • Network in Network(NiN)
      • Multi-Branch Networks(GoogLeNet&I mageNet)
      • Residual Networks(ResNet) and ResNeXt
      • Densely Connected Networks(DenseNet)
      • Batch Normalization
    • Basic RNN
      • Seq Model
      • Raw Text to Seq
      • Language Models
      • Recurrent Neural Networks(RNN)
      • Backpropagation Through Time
    • Advance RNN
      • Gated Recurrent Units(GRU)
      • Long Short-Term Memory(LSTM)
      • Bidirectional Recurrent Neural Networks(BRNN)
      • Encoder-Decoder Architecture
      • Seuqence to Sequence Learning(Seq2Seq)
    • Attention Mechanisms and Transformers
      • Queries, Keys, and Values
      • Attention is all you need
        • Attention and Kernel
        • Attention Scoring Functions
        • The Bahdanau Attention Mechanism
        • Multi-Head Attention
        • Self-Attention
        • Attention的实现
      • The Transformer Architecture
        • Extra Reading
        • 最短的最大路径长度
      • Large-Scaling Pretraning with Transformers
        • BERT vs OpenAI GPT vs ELMo
        • Decoder Model框架
        • Bert vs XLNet
        • T5& GPT& Bert比较
        • 编码器-解码器架构 vs GPT 模型
        • Encoder vs Decoder Reference
      • Transformers for Vision
      • Transformer for Multiomodal
    • NLP Pretraining
      • Word Embedding(word2vec)
        • Extra Reading
      • Approximate Training
      • Word Embedding with Global Vectors(GloVe)
        • Extra Reading
        • Supplement
      • Encoder(BERT)
        • BERT
        • Extra Reading
      • Decoder(GPT&XLNet&Lamma)
        • GPT
        • XLNet
          • XLNet架构
          • XLNet特点与其他比较
      • Encoder-Decoder(BART& T5)
        • BART
        • T5
  • GenAI
    • Introduction
      • GenAI Paper Must Read
      • GenAI六个阶段
    • Language Models Pre-training
      • Encoder-Decoder Architecture
      • Encoder Deep Dive
      • Decoder Deep Dive
      • Encoder VS Decoder
      • Attention Mechanism
      • Transformers
    • Example: Llama 3 8B架构
    • Fine-Tuning Generation Models
    • RAG and Adavance RAG
    • AI Agent
  • Statistics and Optimization
    • A/B testing
    • Sampling/ABtesting/GradientMethod
    • Gradient Decent Deep Dive
  • Machine Learning System Design
    • Extra Reading
    • Introduction
  • Responsible AI
    • AI Risk and Uncertainty
      • What is AI risk
      • General Intro for Uncertainty Quantification
      • Calibration
      • Conformal Prediction
        • Review the linear regression
        • Exchangeability
        • Split Conformal Prediction
        • Conformalized Quantile Regression
        • Beyond marginal coverage
        • Split Conformal Classification
        • Full Conformal Coverage
        • Cross-Validation +
        • Conformal Histgram Regression
    • xAI
      • SHAP value
  • Extra Research
    • Paper Reading
    • Reference
Powered by GitBook
On this page
  • Attention Mechanism (注意力机制)
  • Attention Formulation (注意力机制公式)
  • Scaled Dot-Product Attention (缩放点积注意力)
  • Self-Attention (自注意力)
  1. GenAI
  2. Language Models Pre-training

Attention Mechanism

Attention Mechanism (注意力机制)

The attention mechanism helps address problems found in the RNN-based encoder-decoder setup. As illustrated in Fig. 2.2, an attention mechanism is like a memory bank. When queried, it produces an output based on stored keys and values (Bahdanau et al., 2014).

注意力机制有助于解决 RNN 结构中编码器-解码器的相关问题。如图 2.2 所示,注意力机制类似于一个记忆库。当查询时,它根据存储的键(keys)和值(values)生成输出(Bahdanau et al., 2014)。

Attention Formulation (注意力机制公式)

Let us consider the memory unit consisting of nnn key-value pairs (k1,v1),…,(kn,vn)(\mathbf{k}_1, \mathbf{v}_1), \dots, (\mathbf{k}_n, \mathbf{v}_n)(k1​,v1​),…,(kn​,vn​) with ki∈Rdk\mathbf{k}_i \in \mathbb{R}^{d_k}ki​∈Rdk​ and vi∈Rdv\mathbf{v}_i \in \mathbb{R}^{d_v}vi​∈Rdv​ . The attention layer receives an input as query q∈Rdq\mathbf{q} \in \mathbb{R}^{d_q}q∈Rdq​ and returns an output o∈Rdv\mathbf{o} \in \mathbb{R}^{d_v}o∈Rdv​ with the same shape as the value v\mathbf{v}v .

我们考虑一个存储单元,由 nnn 组键值对 (k1,v1),…,(kn,vn)(\mathbf{k}_1, \mathbf{v}_1), \dots, (\mathbf{k}_n, \mathbf{v}_n)(k1​,v1​),…,(kn​,vn​) 组成,其中 ki∈Rdk\mathbf{k}_i \in \mathbb{R}^{d_k}ki​∈Rdk​ , vi∈Rdv\mathbf{v}_i \in \mathbb{R}^{d_v}vi​∈Rdv​ 。注意力层接受一个查询向量 q∈Rdq\mathbf{q} \in \mathbb{R}^{d_q}q∈Rdq​ ,并返回一个输出 o∈Rdv\mathbf{o} \in \mathbb{R}^{d_v}o∈Rdv​ ,其形状与值向量 v\mathbf{v}v 相同。

The attention layer measures the similarity between the query and the key using a score function α\alphaα , which returns scores a1,…,ana_1, \dots, a_na1​,…,an​ for keys k1,…,kn\mathbf{k}_1, \dots, \mathbf{k}_nk1​,…,kn​ given by:

注意力层使用一个评分函数 α\alphaα 来计算查询向量与键向量之间的相似性,返回键 k1,…,kn\mathbf{k}_1, \dots, \mathbf{k}_nk1​,…,kn​ 的评分 a1,…,ana_1, \dots, a_na1​,…,an​ :

ai=α(q,ki)a_i = \alpha(\mathbf{q}, \mathbf{k}_i)ai​=α(q,ki​)

Attention weights are computed as a softmax function on the scores:

注意力权重通过对评分进行 Softmax 计算得到:

b=softmax(a)\mathbf{b} = \text{softmax}(\mathbf{a})b=softmax(a)

Each element of b\mathbf{b}b is computed as follows:

向量 b\mathbf{b}b 的每个元素计算如下:

bi=exp⁡(ai)∑jexp⁡(aj)b_i = \frac{\exp(a_i)}{\sum_j \exp(a_j)}bi​=∑j​exp(aj​)exp(ai​)​

The output is the weighted sum of the attention weights and the values:

最终输出是注意力权重与值的加权求和:

o=∑i=1nbivi\mathbf{o} = \sum_{i=1}^{n} b_i \mathbf{v}_io=i=1∑n​bi​vi​

Scaled Dot-Product Attention (缩放点积注意力)

The score function α(q,k)\alpha(\mathbf{q}, \mathbf{k})α(q,k) exists in various forms, leading to multiple types of attention mechanisms. The dot product-based scoring function is the simplest, requiring no tunable parameters. A variation, the scaled dot product, normalizes this by dk\sqrt{d_k}dk​​ to mitigate the impact of increasing dimensions (Luong et al., 2015; Vaswani et al., 2017).

评分函数 α(q,k)\alpha(\mathbf{q}, \mathbf{k})α(q,k) 具有多种形式,导致了不同类型的注意力机制。基于点积的评分函数是最简单的,并不需要可调参数。一个变体是缩放点积注意力,它通过除以 dk\sqrt{d_k}dk​​ 进行归一化,以减小维度增加带来的影响(Luong et al., 2015; Vaswani et al., 2017)。

α(q,k)=q⋅kdk\alpha(\mathbf{q}, \mathbf{k}) = \frac{\mathbf{q} \cdot \mathbf{k}}{\sqrt{d_k}}α(q,k)=dk​​q⋅k​

Self-Attention (自注意力)

In self-attention, each input vector xi\mathbf{x}_ixi​ is projected onto three distinct vectors: query qi\mathbf{q}_iqi​, key ki\mathbf{k}_iki​, and value vi\mathbf{v}_ivi​ . These projections are performed via learnable weight matrices WQ,WK,WV\mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_VWQ​,WK​,WV​, resulting in:

在自注意力机制中,每个输入向量 xi\mathbf{x}_ixi​ 被投影到三个不同的向量:查询向量 qi\mathbf{q}_iqi​、键向量 ki\mathbf{k}_iki​ 和值向量 vi\mathbf{v}_ivi​ 。这些投影由可学习的权重矩阵 WQ,WK,WV\mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_VWQ​,WK​,WV​ 进行变换:

qi=xiWQ,ki=xiWK,vi=xiWV\mathbf{q}_i = \mathbf{x}_i \mathbf{W}_Q, \quad \mathbf{k}_i = \mathbf{x}_i \mathbf{W}_K, \quad \mathbf{v}_i = \mathbf{x}_i \mathbf{W}_Vqi​=xi​WQ​,ki​=xi​WK​,vi​=xi​WV​

These weight matrices are initialized randomly and optimized during training.

这些权重矩阵在训练过程中随机初始化并进行优化。

The simplified matrix representation with each of the query, key, and value matrices as a single computation is given by:

整个注意力机制的矩阵形式表达如下:

attention(Q,K,V)=softmax(QKTdk)V\text{attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax} \left( \frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d_k}} \right) \mathbf{V}attention(Q,K,V)=softmax(dk​​QKT​)V

PreviousEncoder VS DecoderNextTransformers

Last updated 2 months ago