Encoder-Decoder Architecture
o
Last updated
o
Last updated
pivotal advancement in natural language processing, particularly in sequence- to sequence tasks such has machine translation, abstractive summarization, and question answering.
This framework is built upon two primary components: an encoder and a decoder.
The input text is tokenized into units(words or sub-words), which are then embedded into feature vectors .
A unidirectional encoder updates its hidden state at each time using and as given by:
The final state of the encoder is known as the context variable or the context vector, and it encodes the information of the entire input sequence and is given by:
where is the mapping function and, in the simplest case, maps the context variable to the last hidden state
Adding more complexity to the architecture, the encoders can be bidirectional; thus the hidden state would not only depend on the previous hidden state and input , but also on the following state
hidden state是一个已知的所有的context vector,
Upon obtaining the context vector from the encoder, the decoder starts to generate the output sequence , where may differ from . Similar to the encoder, the decoder's hidden state at any time is given by
The hidden state of decoder flows to an output layer and the conditional distribution of the nxt token at is given by
The encoder-decoder model is trained end-to-end through supervised learning. The standard loss function employed is the categorical cross-entropy between the predicted output sequence and the actual output. This can be represented as:
编码-解码模型通过监督学习进行端到端训练。标准损失函数是预测输出序列和实际输出之间的分类交叉熵。其数学表达式如下:
Optimization of the model parameters typically employs gradient descent variants, such as the Adam or RMSprop algorithms.
模型参数的优化通常采用梯度下降的变体,例如 Adam 或 RMSprop 算法。
Recurrent Neu- ral Networks (RNNs), the foundational architecture for many encoder-decoder mod- els, have shortcomings, such as susceptibility to vanishing and exploding gradients (Hochreiter, 1998).
Additionally, the sequential dependency intrinsic to RNNs com- plicates parallelization, thereby imposing computational constraints.
Large Language Models: A Deep Dive Chapter 2