♎
Limited AI
  • Machine Learning
    • Linear Model Cheating Sheet
    • Nonlinear Model Cheating Sheet
    • General Linear Model 1
    • General Linear Model 2
    • General Linear Model 3
    • Tree Based Methods
    • Tree Based Methods Supplement
    • XG,Cat,Light__Boosting
    • KNN&PCA
    • Model Performance
    • Model Evaluation
    • Code Practice
      • KNN
      • Decision Tree Python Code
    • Data and Feature Engineering
      • Handle Bias Data
      • Cold Start Problem
  • Deep Learning
    • Summary v2
    • Basic Neural Network
      • From Linear to Deep
      • Perceptron and Activation Function
      • NN network Details
      • Backpropagation Details
      • Gradient Vanishing vs Gradient Exploding
    • Basic CNN
      • Why CNN
      • Filter/ Convolution Kernel and Its Operation
      • Padding& Stride
      • Layers
      • Extra:From Fully Connected Layers to Convolutions
      • Extra: Multiple Input and Multiple Output Channels
    • Advance CNN
      • Convolutional Neural Networks(LeNet)
      • Deep Convolution Neural Networks(AlexNet)
      • Networks Using Blocks (VGG)
      • Network in Network(NiN)
      • Multi-Branch Networks(GoogLeNet&I mageNet)
      • Residual Networks(ResNet) and ResNeXt
      • Densely Connected Networks(DenseNet)
      • Batch Normalization
    • Basic RNN
      • Seq Model
      • Raw Text to Seq
      • Language Models
      • Recurrent Neural Networks(RNN)
      • Backpropagation Through Time
    • Advance RNN
      • Gated Recurrent Units(GRU)
      • Long Short-Term Memory(LSTM)
      • Bidirectional Recurrent Neural Networks(BRNN)
      • Encoder-Decoder Architecture
      • Seuqence to Sequence Learning(Seq2Seq)
    • Attention Mechanisms and Transformers
      • Queries, Keys, and Values
      • Attention is all you need
        • Attention and Kernel
        • Attention Scoring Functions
        • The Bahdanau Attention Mechanism
        • Multi-Head Attention
        • Self-Attention
        • Attention的实现
      • The Transformer Architecture
        • Extra Reading
        • 最短的最大路径长度
      • Large-Scaling Pretraning with Transformers
        • BERT vs OpenAI GPT vs ELMo
        • Decoder Model框架
        • Bert vs XLNet
        • T5& GPT& Bert比较
        • 编码器-解码器架构 vs GPT 模型
        • Encoder vs Decoder Reference
      • Transformers for Vision
      • Transformer for Multiomodal
    • NLP Pretraining
      • Word Embedding(word2vec)
        • Extra Reading
      • Approximate Training
      • Word Embedding with Global Vectors(GloVe)
        • Extra Reading
        • Supplement
      • Encoder(BERT)
        • BERT
        • Extra Reading
      • Decoder(GPT&XLNet&Lamma)
        • GPT
        • XLNet
          • XLNet架构
          • XLNet特点与其他比较
      • Encoder-Decoder(BART& T5)
        • BART
        • T5
  • GenAI
    • Introduction
      • GenAI Paper Must Read
      • GenAI六个阶段
    • Language Models Pre-training
      • Encoder-Decoder Architecture
      • Encoder Deep Dive
      • Decoder Deep Dive
      • Encoder VS Decoder
      • Attention Mechanism
      • Transformers
    • Example: Llama 3 8B架构
    • Fine-Tuning Generation Models
    • RAG and Adavance RAG
    • AI Agent
  • Statistics and Optimization
    • A/B testing
    • Sampling/ABtesting/GradientMethod
    • Gradient Decent Deep Dive
  • Machine Learning System Design
    • Extra Reading
    • Introduction
  • Responsible AI
    • AI Risk and Uncertainty
      • What is AI risk
      • General Intro for Uncertainty Quantification
      • Calibration
      • Conformal Prediction
        • Review the linear regression
        • Exchangeability
        • Split Conformal Prediction
        • Conformalized Quantile Regression
        • Beyond marginal coverage
        • Split Conformal Classification
        • Full Conformal Coverage
        • Cross-Validation +
        • Conformal Histgram Regression
    • xAI
      • SHAP value
  • Extra Research
    • Paper Reading
    • Reference
Powered by GitBook
On this page
  • Encoder- Decoder Architecture
  • Encoder
  • Decoder
  • Encoder-Decoder Model Training and Loss Function
  • Issue
  • Reference
  1. GenAI
  2. Language Models Pre-training

Encoder-Decoder Architecture

o

PreviousLanguage Models Pre-trainingNextEncoder Deep Dive

Last updated 2 months ago

Encoder- Decoder Architecture

  1. pivotal advancement in natural language processing, particularly in sequence- to sequence tasks such has machine translation, abstractive summarization, and question answering.

  2. This framework is built upon two primary components: an encoder and a decoder.

Encoder

The input text is tokenized into units(words or sub-words), which are then embedded into feature vectors x1,x2,⋯ ,xTx_1, x_2, \cdots, x_Tx1​,x2​,⋯,xT​.

A unidirectional encoder updates its hidden state hth_tht​ at each time ttt using ht−1h_{t-1}ht−1​ and xtx_txt​ as given by:

ht=f(ht−1,xt) h_t = f(h_{t-1}, x_t) ht​=f(ht−1​,xt​)

The final state hth_tht​ of the encoder is known as the context variable or the context vector, and it encodes the information of the entire input sequence and is given by:

c=m(h1,⋯ ,hT) c= m(h_1, \cdots, h_T)c=m(h1​,⋯,hT​)

where mmm is the mapping function and, in the simplest case, maps the context variable to the last hidden state

c=m(h1,⋯ ,hT)=hTc = m(h_1, \cdots, h_T) = h_Tc=m(h1​,⋯,hT​)=hT​

Adding more complexity to the architecture, the encoders can be bidirectional; thus the hidden state would not only depend on the previous hidden state ht−1h_{t-1}ht−1​ and input xtx_txt​, but also on the following state ht+1h_{t+1}ht+1​

hidden state是一个已知的所有的context vector,

Decoder

Upon obtaining the context vector from the encoder, the decoder starts to generate the output sequence y=(y1,y2,⋯ ,yU)y = (y_1, y_2, \cdots, y_U) y=(y1​,y2​,⋯,yU​), where UUU may differ from TTT. Similar to the encoder, the decoder's hidden state at any time ttt is given by

st′=g(st−1,yt′−1,c)s_{t^{'}} = g(s_{t-1}, y_{t^{'} - 1}, c)st′​=g(st−1​,yt′−1​,c)

The hidden state of decoder flows to an output layer and the conditional distribution of the nxt token at t′t^{'}t′ is given by

P(yt′∣yt′−1,⋯ ,y1,c)=softmax(st−1,yt′−1,c)P(y_{t'} | y_{t^{'} - 1}, \cdots, y_1, c) = \text{softmax} (s_{t-1}, y_{t^{'} - 1} , c)P(yt′​∣yt′−1​,⋯,y1​,c)=softmax(st−1​,yt′−1​,c)

Encoder-Decoder Model Training and Loss Function

The encoder-decoder model is trained end-to-end through supervised learning. The standard loss function employed is the categorical cross-entropy between the predicted output sequence and the actual output. This can be represented as:

编码-解码模型通过监督学习进行端到端训练。标准损失函数是预测输出序列和实际输出之间的分类交叉熵。其数学表达式如下:

L=−∑t=1Ulog⁡p(yt∣yt−1,…,y1,c)\mathcal{L} = - \sum_{t=1}^{U} \log p(y_t | y_{t-1}, \dots, y_1, \mathbf{c})L=−∑t=1U​logp(yt​∣yt−1​,…,y1​,c)

Optimization of the model parameters typically employs gradient descent variants, such as the Adam or RMSprop algorithms.

模型参数的优化通常采用梯度下降的变体,例如 Adam 或 RMSprop 算法。

Issue

  • Recurrent Neu- ral Networks (RNNs), the foundational architecture for many encoder-decoder mod- els, have shortcomings, such as susceptibility to vanishing and exploding gradients (Hochreiter, 1998).

  • Additionally, the sequential dependency intrinsic to RNNs com- plicates parallelization, thereby imposing computational constraints.

Reference

  • Large Language Models: A Deep Dive Chapter 2