# Encoder-Decoder Architecture

## Encoder- Decoder Architecture

1. pivotal advancement in natural language processing, particularly in sequence- to sequence tasks such has machine translation, abstractive summarization, and question answering.
2. This framework is built upon two primary components: an encoder and a decoder.

### Encoder

The input text is tokenized into units(words or sub-words), which are then embedded into feature vectors $$x\_1, x\_2, \cdots, x\_T$$.&#x20;

A unidirectional encoder updates its hidden state $$h\_t$$ at each time $$t$$ using $$h\_{t-1}$$ and $$x\_t$$ as given by:

$$h\_t = f(h\_{t-1}, x\_t)$$

The final state $$h\_t$$ of the encoder is known as the context variable or the context vector, and it encodes the information of the entire input sequence and is given by:

$$c= m(h\_1, \cdots, h\_T)$$

where $$m$$ is the mapping function and, in the simplest case, maps the context variable to the last hidden state

$$c = m(h\_1, \cdots, h\_T) = h\_T$$

Adding more complexity to the architecture, the encoders can be bidirectional; thus the hidden state would not only depend on the previous hidden state $$h\_{t-1}$$ and input $$x\_t$$, but also on the following state $$h\_{t+1}$$

<figure><img src="/files/avQFO8VdfQfXIeZhWSJJ" alt=""><figcaption></figcaption></figure>

hidden state是一个已知的所有的context vector，

### Decoder

Upon obtaining the context vector from the encoder, the decoder starts to generate the output sequence $$y = (y\_1, y\_2, \cdots, y\_U)$$, where $$U$$ may differ from $$T$$. Similar to the encoder, the decoder's hidden state at any time $$t$$ is given by&#x20;

$$s\_{t^{'}} = g(s\_{t-1}, y\_{t^{'} - 1}, c)$$

The hidden state of decoder flows to an output layer and the conditional distribution of the nxt token at $$t^{'}$$ is given by

$$P(y\_{t'} | y\_{t^{'} - 1}, \cdots, y\_1, c) = \text{softmax} (s\_{t-1}, y\_{t^{'} - 1} , c)$$

<br>

### Encoder-Decoder Model Training and Loss Function

The encoder-decoder model is trained end-to-end through supervised learning. The standard loss function employed is the categorical cross-entropy between the predicted output sequence and the actual output. This can be represented as:

> 编码-解码模型通过监督学习进行端到端训练。标准损失函数是预测输出序列和实际输出之间的**分类交叉熵**。其数学表达式如下：

$$\mathcal{L} = - \sum\_{t=1}^{U} \log p(y\_t | y\_{t-1}, \dots, y\_1, \mathbf{c})$$

Optimization of the model parameters typically employs gradient descent variants, such as the Adam or RMSprop algorithms.

> 模型参数的优化通常采用梯度下降的变体，例如 **Adam** 或 **RMSprop** 算法。

## Issue

* Recurrent Neu- ral Networks (RNNs), the foundational architecture for many encoder-decoder mod- els, have shortcomings, such as susceptibility to <mark style="color:blue;">vanishing and exploding gradients</mark> (Hochreiter, 1998).&#x20;
* Additionally, the sequential dependency intrinsic to RNNs com- plicates parallelization, thereby imposing <mark style="color:blue;">computational constraints.</mark>

## Reference

* Large Language Models: A Deep Dive Chapter 2


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://ai.younglimit.com/genai/language-models-pre-training/encoder-decoder-architecture.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
