Attention Mechanism

Attention Mechanism (注意力机制)

The attention mechanism helps address problems found in the RNN-based encoder-decoder setup. As illustrated in Fig. 2.2, an attention mechanism is like a memory bank. When queried, it produces an output based on stored keys and values (Bahdanau et al., 2014).

注意力机制有助于解决 RNN 结构中编码器-解码器的相关问题。如图 2.2 所示,注意力机制类似于一个记忆库。当查询时,它根据存储的键(keys)和值(values)生成输出(Bahdanau et al., 2014)。

Attention Formulation (注意力机制公式)

Let us consider the memory unit consisting of nn key-value pairs (k1,v1),,(kn,vn)(\mathbf{k}_1, \mathbf{v}_1), \dots, (\mathbf{k}_n, \mathbf{v}_n) with kiRdk\mathbf{k}_i \in \mathbb{R}^{d_k} and viRdv\mathbf{v}_i \in \mathbb{R}^{d_v} . The attention layer receives an input as query qRdq\mathbf{q} \in \mathbb{R}^{d_q} and returns an output oRdv\mathbf{o} \in \mathbb{R}^{d_v} with the same shape as the value v\mathbf{v} .

我们考虑一个存储单元,由 nn 组键值对 (k1,v1),,(kn,vn)(\mathbf{k}_1, \mathbf{v}_1), \dots, (\mathbf{k}_n, \mathbf{v}_n) 组成,其中 kiRdk\mathbf{k}_i \in \mathbb{R}^{d_k}viRdv\mathbf{v}_i \in \mathbb{R}^{d_v} 。注意力层接受一个查询向量 qRdq\mathbf{q} \in \mathbb{R}^{d_q} ,并返回一个输出 oRdv\mathbf{o} \in \mathbb{R}^{d_v} ,其形状与值向量 v\mathbf{v} 相同。

The attention layer measures the similarity between the query and the key using a score function α\alpha , which returns scores a1,,ana_1, \dots, a_n for keys k1,,kn\mathbf{k}_1, \dots, \mathbf{k}_n given by:

注意力层使用一个评分函数 α\alpha 来计算查询向量与键向量之间的相似性,返回键 k1,,kn\mathbf{k}_1, \dots, \mathbf{k}_n 的评分 a1,,ana_1, \dots, a_n

ai=α(q,ki)a_i = \alpha(\mathbf{q}, \mathbf{k}_i)

Attention weights are computed as a softmax function on the scores:

注意力权重通过对评分进行 Softmax 计算得到:

b=softmax(a)\mathbf{b} = \text{softmax}(\mathbf{a})

Each element of b\mathbf{b} is computed as follows:

向量 b\mathbf{b} 的每个元素计算如下:

bi=exp(ai)jexp(aj)b_i = \frac{\exp(a_i)}{\sum_j \exp(a_j)}

The output is the weighted sum of the attention weights and the values:

最终输出是注意力权重与值的加权求和:

o=i=1nbivi\mathbf{o} = \sum_{i=1}^{n} b_i \mathbf{v}_i

Scaled Dot-Product Attention (缩放点积注意力)

The score function α(q,k)\alpha(\mathbf{q}, \mathbf{k}) exists in various forms, leading to multiple types of attention mechanisms. The dot product-based scoring function is the simplest, requiring no tunable parameters. A variation, the scaled dot product, normalizes this by dk\sqrt{d_k} to mitigate the impact of increasing dimensions (Luong et al., 2015; Vaswani et al., 2017).

评分函数 α(q,k)\alpha(\mathbf{q}, \mathbf{k}) 具有多种形式,导致了不同类型的注意力机制。基于点积的评分函数是最简单的,并不需要可调参数。一个变体是缩放点积注意力,它通过除以 dk\sqrt{d_k} 进行归一化,以减小维度增加带来的影响(Luong et al., 2015; Vaswani et al., 2017)。

α(q,k)=qkdk\alpha(\mathbf{q}, \mathbf{k}) = \frac{\mathbf{q} \cdot \mathbf{k}}{\sqrt{d_k}}

Self-Attention (自注意力)

In self-attention, each input vector xi\mathbf{x}_i is projected onto three distinct vectors: query qi\mathbf{q}_i, key ki\mathbf{k}_i, and value vi\mathbf{v}_i . These projections are performed via learnable weight matrices WQ,WK,WV\mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V, resulting in:

自注意力机制中,每个输入向量 xi\mathbf{x}_i 被投影到三个不同的向量:查询向量 qi\mathbf{q}_i、键向量 ki\mathbf{k}_i 和值向量 vi\mathbf{v}_i 。这些投影由可学习的权重矩阵 WQ,WK,WV\mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V 进行变换:

qi=xiWQ,ki=xiWK,vi=xiWV\mathbf{q}_i = \mathbf{x}_i \mathbf{W}_Q, \quad \mathbf{k}_i = \mathbf{x}_i \mathbf{W}_K, \quad \mathbf{v}_i = \mathbf{x}_i \mathbf{W}_V

These weight matrices are initialized randomly and optimized during training.

这些权重矩阵在训练过程中随机初始化并进行优化。

The simplified matrix representation with each of the query, key, and value matrices as a single computation is given by:

整个注意力机制的矩阵形式表达如下:

attention(Q,K,V)=softmax(QKTdk)V\text{attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax} \left( \frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d_k}} \right) \mathbf{V}

Last updated