♎
Limited AI
  • Machine Learning
    • Linear Model Cheating Sheet
    • Nonlinear Model Cheating Sheet
    • General Linear Model 1
    • General Linear Model 2
    • General Linear Model 3
    • Tree Based Methods
    • Tree Based Methods Supplement
    • XG,Cat,Light__Boosting
    • KNN&PCA
    • Model Performance
    • Model Evaluation
    • Code Practice
      • KNN
      • Decision Tree Python Code
    • Data and Feature Engineering
      • Handle Bias Data
      • Cold Start Problem
  • Deep Learning
    • Summary v2
    • Basic Neural Network
      • From Linear to Deep
      • Perceptron and Activation Function
      • NN network Details
      • Backpropagation Details
      • Gradient Vanishing vs Gradient Exploding
    • Basic CNN
      • Why CNN
      • Filter/ Convolution Kernel and Its Operation
      • Padding& Stride
      • Layers
      • Extra:From Fully Connected Layers to Convolutions
      • Extra: Multiple Input and Multiple Output Channels
    • Advance CNN
      • Convolutional Neural Networks(LeNet)
      • Deep Convolution Neural Networks(AlexNet)
      • Networks Using Blocks (VGG)
      • Network in Network(NiN)
      • Multi-Branch Networks(GoogLeNet&I mageNet)
      • Residual Networks(ResNet) and ResNeXt
      • Densely Connected Networks(DenseNet)
      • Batch Normalization
    • Basic RNN
      • Seq Model
      • Raw Text to Seq
      • Language Models
      • Recurrent Neural Networks(RNN)
      • Backpropagation Through Time
    • Advance RNN
      • Gated Recurrent Units(GRU)
      • Long Short-Term Memory(LSTM)
      • Bidirectional Recurrent Neural Networks(BRNN)
      • Encoder-Decoder Architecture
      • Seuqence to Sequence Learning(Seq2Seq)
    • Attention Mechanisms and Transformers
      • Queries, Keys, and Values
      • Attention is all you need
        • Attention and Kernel
        • Attention Scoring Functions
        • The Bahdanau Attention Mechanism
        • Multi-Head Attention
        • Self-Attention
        • Attention的实现
      • The Transformer Architecture
        • Extra Reading
        • 最短的最大路径长度
      • Large-Scaling Pretraning with Transformers
        • BERT vs OpenAI GPT vs ELMo
        • Decoder Model框架
        • Bert vs XLNet
        • T5& GPT& Bert比较
        • 编码器-解码器架构 vs GPT 模型
        • Encoder vs Decoder Reference
      • Transformers for Vision
      • Transformer for Multiomodal
    • NLP Pretraining
      • Word Embedding(word2vec)
        • Extra Reading
      • Approximate Training
      • Word Embedding with Global Vectors(GloVe)
        • Extra Reading
        • Supplement
      • Encoder(BERT)
        • BERT
        • Extra Reading
      • Decoder(GPT&XLNet&Lamma)
        • GPT
        • XLNet
          • XLNet架构
          • XLNet特点与其他比较
      • Encoder-Decoder(BART& T5)
        • BART
        • T5
  • GenAI
    • Introduction
      • GenAI Paper Must Read
      • GenAI六个阶段
    • Language Models Pre-training
      • Encoder-Decoder Architecture
      • Encoder Deep Dive
      • Decoder Deep Dive
      • Encoder VS Decoder
      • Attention Mechanism
      • Transformers
    • Example: Llama 3 8B架构
    • Fine-Tuning Generation Models
    • RAG and Adavance RAG
    • AI Agent
  • Statistics and Optimization
    • A/B testing
    • Sampling/ABtesting/GradientMethod
    • Gradient Decent Deep Dive
  • Machine Learning System Design
    • Extra Reading
    • Introduction
  • Responsible AI
    • AI Risk and Uncertainty
      • What is AI risk
      • General Intro for Uncertainty Quantification
      • Calibration
      • Conformal Prediction
        • Review the linear regression
        • Exchangeability
        • Split Conformal Prediction
        • Conformalized Quantile Regression
        • Beyond marginal coverage
        • Split Conformal Classification
        • Full Conformal Coverage
        • Cross-Validation +
        • Conformal Histgram Regression
    • xAI
      • SHAP value
  • Extra Research
    • Paper Reading
    • Reference
Powered by GitBook
On this page
  1. Deep Learning
  2. Attention Mechanisms and Transformers

Queries, Keys, and Values

Queries, Keys, and Values | 查询、键和值

1. What are Queries, Keys, and Values? | 什么是查询、键和值?

Queries, Keys, and Values (Q, K, V) are central concepts in the attention mechanism, particularly in self-attention and the Transformer model. They help a model determine which parts of the input sequence to focus on when generating an output. In tasks like machine translation, text summarization, or image captioning, attention mechanisms use these components to dynamically assign importance to different parts of the input sequence.

  • Query (Q): Represents the item for which the attention weights are calculated. It's the element in the sequence that is currently being processed.

  • Key (K): Represents all the potential items in the sequence that the query could attend to. Keys are used to compare with the query to determine relevance.

  • Value (V): Represents the actual content that will be output if the corresponding key is attended to. The value holds the information from the input that is passed to the next layer based on the attention scores.

In a nutshell, queries are matched against keys to calculate attention scores, which are then applied to the corresponding values to generate a weighted sum that forms the output. 查询 (Query)、键 (Key) 和值 (Value) 是 注意力机制,尤其是 自注意力机制 和 Transformer 模型 中的核心概念。它们帮助模型确定在生成输出时应关注输入序列的哪些部分。在机器翻译、文本摘要或图像描述等任务中,注意力机制使用这些组件来动态分配对输入序列不同部分的重要性。

  • 查询 (Q):表示需要计算注意力权重的元素,它是当前正在处理的序列中的元素。

  • 键 (K):表示序列中查询可能关注的所有潜在元素。键用于与查询进行比较,以确定相关性。

  • 值 (V):表示如果对应的键被关注时,实际输出的内容。值包含输入中的信息,基于注意力得分传递到下一层。

简而言之,查询与键进行匹配 以计算注意力得分,然后将这些得分应用于对应的 值,生成一个加权和作为输出。

2. How Do Queries, Keys, and Values Work? | 查询、键和值是如何工作的?

In the attention mechanism, queries, keys, and values are represented as vectors, and their relationship is calculated using the following steps:

  1. Calculate Similarity (Relevance) Between Queries and Keys: The first step is to measure how much attention the query should give to each key. This is done by computing the dot product of the query vector with each key vector. The result of this calculation gives a score that indicates the relevance or similarity between the query and each key. score(Q,K)=Q⋅K\text{score}(Q, K) = Q \cdot Kscore(Q,K)=Q⋅K

  2. Apply Softmax to Get Attention Weights: After calculating the scores, a softmax function is applied to transform these scores into probabilities, called attention weights. These weights determine how much attention the query should give to each key (and hence its corresponding value). αij=exp⁡(score(Qi,Kj))∑jexp⁡(score(Qi,Kj))\alpha_{ij} = \frac{\exp(\text{score}(Q_i, K_j))}{\sum_j \exp(\text{score}(Q_i, K_j))}αij​=∑j​exp(score(Qi​,Kj​))exp(score(Qi​,Kj​))​

  3. Weighted Sum of Values: The attention weights are then applied to the corresponding values. The output for the query is computed as a weighted sum of the values, where the weights are the attention scores. output=∑jαijVj\text{output} = \sum_j \alpha_{ij} V_joutput=∑j​αij​Vj​

This process allows the model to dynamically focus on different parts of the input sequence, improving its ability to capture important information from various parts of the input. 在 注意力机制 中,查询、键和值以向量的形式表示,它们的关系通过以下步骤计算:

  1. 计算查询与键之间的相似性(相关性): 第一步是衡量查询应对每个键给予多少注意力。这通过计算查询向量与每个键向量的 点积 来实现。计算结果给出了一个得分,表示查询与每个键之间的相关性或相似性。 score(Q,K)=Q⋅K\text{score}(Q, K) = Q \cdot Kscore(Q,K)=Q⋅K

  2. 应用 Softmax 获得注意力权重: 计算得分后,应用 softmax 函数将这些得分转换为概率,称为 注意力权重。这些权重决定了查询应对每个键(以及相应的值)给予多少注意力。 αij=exp⁡(score(Qi,Kj))∑jexp⁡(score(Qi,Kj))\alpha_{ij} = \frac{\exp(\text{score}(Q_i, K_j))}{\sum_j \exp(\text{score}(Q_i, K_j))}αij​=∑j​exp(score(Qi​,Kj​))exp(score(Qi​,Kj​))​

  3. 值的加权和: 然后将注意力权重应用于相应的值。查询的输出通过值的 加权和 计算,其中权重是注意力得分。 output=∑jαijVj\text{output} = \sum_j \alpha_{ij} V_joutput=∑j​αij​Vj​

这一过程使模型能够动态关注输入序列的不同部分,从而提高其从输入中捕捉重要信息的能力。

3. Applications of Queries, Keys, and Values | 查询、键和值的应用

  • Machine Translation | 机器翻译: In machine translation, queries, keys, and values allow the model to focus on the most relevant words in the input sentence when generating the translation. 机器翻译:在机器翻译中,查询、键和值使模型在生成翻译时能够关注输入句子中最相关的词。

  • Text Summarization | 文本摘要: Seq2Seq models with attention use queries, keys, and values to focus on the important parts of the input text while generating a summary. 文本摘要:带有注意力机制的 Seq2Seq 模型使用查询、键和值在生成摘要时关注输入文本的重要部分。

  • Speech Recognition | 语音识别: In speech recognition, the attention mechanism helps the model focus on the most relevant parts of the audio signal when transcribing speech into text. 语音识别:在语音识别中,注意力机制帮助模型在将语音转录为文本时关注音频信号中最相关的部分。

  • Image Captioning | 图像描述: For image captioning tasks, the attention mechanism allows the model to focus on different regions of an image while generating descriptive captions. 图像描述:在图像描述任务中,注意力机制使模型在生成描述性文本时能够关注图像的不同区域。

4. Self-Attention and Transformer Models | 自注意力机制与 Transformer 模型

Self-attention is a specific type of attention mechanism used in Transformer models where queries, keys, and values come from the same input sequence. Each element in the sequence can attend to other elements within the same sequence, allowing the model to capture relationships between different parts of the sequence.

In the Transformer architecture, self-attention is applied multiple times (multi-head self-attention) to capture different types of dependencies within the sequence. This approach allows for more efficient parallelization and has become a fundamental building block for models like BERT and GPT. 自注意力机制 是 Transformer 模型 中使用的一种特殊注意力机制,其中查询、键和值都来自相同的输入序列。序列中的每个元素都可以关注序列中的其他元素,从而使模型能够捕捉序列中不同部分之间的关系。

在 Transformer 架构 中,自注意力机制多次应用(多头自注意力)以捕捉序列中不同类型的依赖关系。这种方法允许更高效的并行化,并已成为 BERT 和 GPT 等模型的基础构建块。

查询、键和值 是注意力机制的核心组成部分,它们使模型能够确定在处理输入序列时应关注哪些部分。查询 表示当前正在处理的元素,键 表示查询与之比较的元素,值 是基于注意力权重传递到下一层的内容。查询、键和值广泛应用于机器翻译、文本摘要、语音识别和图像描述等任务中。自注意力机制 被用于 Transformer 模型中,使得序列中的每个元素能够关注序列中的其他元素。

PreviousAttention Mechanisms and TransformersNextAttention is all you need

Last updated 8 months ago