Encoder-Decoder Architecture

o

Encoder- Decoder Architecture

  1. pivotal advancement in natural language processing, particularly in sequence- to sequence tasks such has machine translation, abstractive summarization, and question answering.

  2. This framework is built upon two primary components: an encoder and a decoder.

Encoder

The input text is tokenized into units(words or sub-words), which are then embedded into feature vectors x1,x2,,xTx_1, x_2, \cdots, x_T.

A unidirectional encoder updates its hidden state hth_t at each time tt using ht1h_{t-1} and xtx_t as given by:

ht=f(ht1,xt) h_t = f(h_{t-1}, x_t)

The final state hth_t of the encoder is known as the context variable or the context vector, and it encodes the information of the entire input sequence and is given by:

c=m(h1,,hT) c= m(h_1, \cdots, h_T)

where mm is the mapping function and, in the simplest case, maps the context variable to the last hidden state

c=m(h1,,hT)=hTc = m(h_1, \cdots, h_T) = h_T

Adding more complexity to the architecture, the encoders can be bidirectional; thus the hidden state would not only depend on the previous hidden state ht1h_{t-1} and input xtx_t, but also on the following state ht+1h_{t+1}

hidden state是一个已知的所有的context vector,

Decoder

Upon obtaining the context vector from the encoder, the decoder starts to generate the output sequence y=(y1,y2,,yU)y = (y_1, y_2, \cdots, y_U) , where UU may differ from TT. Similar to the encoder, the decoder's hidden state at any time tt is given by

st=g(st1,yt1,c)s_{t^{'}} = g(s_{t-1}, y_{t^{'} - 1}, c)

The hidden state of decoder flows to an output layer and the conditional distribution of the nxt token at tt^{'} is given by

P(ytyt1,,y1,c)=softmax(st1,yt1,c)P(y_{t'} | y_{t^{'} - 1}, \cdots, y_1, c) = \text{softmax} (s_{t-1}, y_{t^{'} - 1} , c)

Encoder-Decoder Model Training and Loss Function

The encoder-decoder model is trained end-to-end through supervised learning. The standard loss function employed is the categorical cross-entropy between the predicted output sequence and the actual output. This can be represented as:

编码-解码模型通过监督学习进行端到端训练。标准损失函数是预测输出序列和实际输出之间的分类交叉熵。其数学表达式如下:

L=t=1Ulogp(ytyt1,,y1,c)\mathcal{L} = - \sum_{t=1}^{U} \log p(y_t | y_{t-1}, \dots, y_1, \mathbf{c})

Optimization of the model parameters typically employs gradient descent variants, such as the Adam or RMSprop algorithms.

模型参数的优化通常采用梯度下降的变体,例如 AdamRMSprop 算法。

Issue

  • Recurrent Neu- ral Networks (RNNs), the foundational architecture for many encoder-decoder mod- els, have shortcomings, such as susceptibility to vanishing and exploding gradients (Hochreiter, 1998).

  • Additionally, the sequential dependency intrinsic to RNNs com- plicates parallelization, thereby imposing computational constraints.

Reference

  • Large Language Models: A Deep Dive Chapter 2

Last updated