Attention Mechanism
Attention Mechanism (注意力机制)
The attention mechanism helps address problems found in the RNN-based encoder-decoder setup. As illustrated in Fig. 2.2, an attention mechanism is like a memory bank. When queried, it produces an output based on stored keys and values (Bahdanau et al., 2014).
注意力机制有助于解决 RNN 结构中编码器-解码器的相关问题。如图 2.2 所示,注意力机制类似于一个记忆库。当查询时,它根据存储的键(keys)和值(values)生成输出(Bahdanau et al., 2014)。
Attention Formulation (注意力机制公式)
Let us consider the memory unit consisting of n key-value pairs (k1,v1),…,(kn,vn) with ki∈Rdk and vi∈Rdv . The attention layer receives an input as query q∈Rdq and returns an output o∈Rdv with the same shape as the value v .
我们考虑一个存储单元,由 n 组键值对 (k1,v1),…,(kn,vn) 组成,其中 ki∈Rdk , vi∈Rdv 。注意力层接受一个查询向量 q∈Rdq ,并返回一个输出 o∈Rdv ,其形状与值向量 v 相同。
The attention layer measures the similarity between the query and the key using a score function α , which returns scores a1,…,an for keys k1,…,kn given by:
注意力层使用一个评分函数 α 来计算查询向量与键向量之间的相似性,返回键 k1,…,kn 的评分 a1,…,an :
Attention weights are computed as a softmax function on the scores:
注意力权重通过对评分进行 Softmax 计算得到:
Each element of b is computed as follows:
向量 b 的每个元素计算如下:
The output is the weighted sum of the attention weights and the values:
最终输出是注意力权重与值的加权求和:
Scaled Dot-Product Attention (缩放点积注意力)
The score function α(q,k) exists in various forms, leading to multiple types of attention mechanisms. The dot product-based scoring function is the simplest, requiring no tunable parameters. A variation, the scaled dot product, normalizes this by dk to mitigate the impact of increasing dimensions (Luong et al., 2015; Vaswani et al., 2017).
评分函数 α(q,k) 具有多种形式,导致了不同类型的注意力机制。基于点积的评分函数是最简单的,并不需要可调参数。一个变体是缩放点积注意力,它通过除以 dk 进行归一化,以减小维度增加带来的影响(Luong et al., 2015; Vaswani et al., 2017)。
Self-Attention (自注意力)
In self-attention, each input vector xi is projected onto three distinct vectors: query qi, key ki, and value vi . These projections are performed via learnable weight matrices WQ,WK,WV, resulting in:
在自注意力机制中,每个输入向量 xi 被投影到三个不同的向量:查询向量 qi、键向量 ki 和值向量 vi 。这些投影由可学习的权重矩阵 WQ,WK,WV 进行变换:
These weight matrices are initialized randomly and optimized during training.
这些权重矩阵在训练过程中随机初始化并进行优化。
The simplified matrix representation with each of the query, key, and value matrices as a single computation is given by:
整个注意力机制的矩阵形式表达如下:
Last updated