♎
Limited AI
  • Machine Learning
    • Linear Model Cheating Sheet
    • Nonlinear Model Cheating Sheet
    • General Linear Model 1
    • General Linear Model 2
    • General Linear Model 3
    • Tree Based Methods
    • Tree Based Methods Supplement
    • XG,Cat,Light__Boosting
    • KNN&PCA
    • Model Performance
    • Model Evaluation
    • Code Practice
      • KNN
      • Decision Tree Python Code
    • Data and Feature Engineering
      • Handle Bias Data
      • Cold Start Problem
  • Deep Learning
    • Summary v2
    • Basic Neural Network
      • From Linear to Deep
      • Perceptron and Activation Function
      • NN network Details
      • Backpropagation Details
      • Gradient Vanishing vs Gradient Exploding
    • Basic CNN
      • Why CNN
      • Filter/ Convolution Kernel and Its Operation
      • Padding& Stride
      • Layers
      • Extra:From Fully Connected Layers to Convolutions
      • Extra: Multiple Input and Multiple Output Channels
    • Advance CNN
      • Convolutional Neural Networks(LeNet)
      • Deep Convolution Neural Networks(AlexNet)
      • Networks Using Blocks (VGG)
      • Network in Network(NiN)
      • Multi-Branch Networks(GoogLeNet&I mageNet)
      • Residual Networks(ResNet) and ResNeXt
      • Densely Connected Networks(DenseNet)
      • Batch Normalization
    • Basic RNN
      • Seq Model
      • Raw Text to Seq
      • Language Models
      • Recurrent Neural Networks(RNN)
      • Backpropagation Through Time
    • Advance RNN
      • Gated Recurrent Units(GRU)
      • Long Short-Term Memory(LSTM)
      • Bidirectional Recurrent Neural Networks(BRNN)
      • Encoder-Decoder Architecture
      • Seuqence to Sequence Learning(Seq2Seq)
    • Attention Mechanisms and Transformers
      • Queries, Keys, and Values
      • Attention is all you need
        • Attention and Kernel
        • Attention Scoring Functions
        • The Bahdanau Attention Mechanism
        • Multi-Head Attention
        • Self-Attention
        • Attention的实现
      • The Transformer Architecture
        • Extra Reading
        • 最短的最大路径长度
      • Large-Scaling Pretraning with Transformers
        • BERT vs OpenAI GPT vs ELMo
        • Decoder Model框架
        • Bert vs XLNet
        • T5& GPT& Bert比较
        • 编码器-解码器架构 vs GPT 模型
        • Encoder vs Decoder Reference
      • Transformers for Vision
      • Transformer for Multiomodal
    • NLP Pretraining
      • Word Embedding(word2vec)
        • Extra Reading
      • Approximate Training
      • Word Embedding with Global Vectors(GloVe)
        • Extra Reading
        • Supplement
      • Encoder(BERT)
        • BERT
        • Extra Reading
      • Decoder(GPT&XLNet&Lamma)
        • GPT
        • XLNet
          • XLNet架构
          • XLNet特点与其他比较
      • Encoder-Decoder(BART& T5)
        • BART
        • T5
  • GenAI
    • Introduction
      • GenAI Paper Must Read
      • GenAI六个阶段
    • Language Models Pre-training
      • Encoder-Decoder Architecture
      • Encoder Deep Dive
      • Decoder Deep Dive
      • Encoder VS Decoder
      • Attention Mechanism
      • Transformers
    • Example: Llama 3 8B架构
    • Fine-Tuning Generation Models
    • RAG and Adavance RAG
    • AI Agent
  • Statistics and Optimization
    • A/B testing
    • Sampling/ABtesting/GradientMethod
    • Gradient Decent Deep Dive
  • Machine Learning System Design
    • Extra Reading
    • Introduction
  • Responsible AI
    • AI Risk and Uncertainty
      • What is AI risk
      • General Intro for Uncertainty Quantification
      • Calibration
      • Conformal Prediction
        • Review the linear regression
        • Exchangeability
        • Split Conformal Prediction
        • Conformalized Quantile Regression
        • Beyond marginal coverage
        • Split Conformal Classification
        • Full Conformal Coverage
        • Cross-Validation +
        • Conformal Histgram Regression
    • xAI
      • SHAP value
  • Extra Research
    • Paper Reading
    • Reference
Powered by GitBook
On this page
  • Neural Networks without Hidden States¶
  • Recurrent Neural Networks with Hidden States
  • 基于循环神经网络的字符级语言模型
  • 通过时间反向传播
  1. Deep Learning
  2. Basic RNN

Recurrent Neural Networks(RNN)

PreviousLanguage ModelsNextBackpropagation Through Time

Last updated 9 months ago

Review

  • 我们介绍了nnn元语法模型, 其中单词xtx_txt​在时间步ttt的条件概率仅取决于前面n−1n-1n−1个单词。 对于时间步t−(n−1)t-(n-1)t−(n−1)之前的单词, 如果我们想将其可能产生的影响合并到xtx_txt​上, 需要增加nnn,然而模型参数的数量也会随之呈指数增长, 因为词表∣V∣|\mathcal{V}|∣V∣需要存储∣V∣n|\mathcal{V}|^n∣V∣n个数字,

  • 因此与其将P(xt∣xt−1,⋯ ,x1)P(x_t | x_{t-1}, \cdots, x_1)P(xt​∣xt−1​,⋯,x1​)模型化, 不如使用隐变量模型P(xt∣xt−1,⋯ ,x1)≈P(xt∣ht−1)P(x_t|x_{t-1},\cdots, x_1) \approx P(x_t |h_{t-1})P(xt​∣xt−1​,⋯,x1​)≈P(xt​∣ht−1​)

  • 其中ht−1h_{t-1}ht−1​是隐状态(hidden state), 也称为隐藏变量(hidden variable), 它存储了到时间步t−1t-1t−1的序列信息。 通常,我们可以基于当前输入xtx_txt​和先前隐状态ht−1h_{t-1}ht−1​ 来计算时间步ttt处的任何时间的隐状态:ht−1=f(xt,ht−1)h_{t-1} = f(x_t, h_{t-1})ht−1​=f(xt​,ht−1​)

  • 对于函数fff,隐变量模型不是近似值。 毕竟hth_tht​是可以仅仅存储到目前为止观察到的所有数据, 然而这样的操作可能会使计算和存储的代价都变得昂贵。

Neural Networks without Hidden States

  • Let’s take a look at an MLP with a single hidden layer.

  • Let the hidden layer’s activation function be ϕ\phiϕ. Given a minibatch of examples X∈Rn×dX \in \mathbb{R}^{n \times d }X∈Rn×d with batch size nnn and ddd inputs, the hidden layer output H∈Rn×hH \in \mathbb{R}^{n \times h}H∈Rn×h is calculated as

H=ϕ(XWxh+bh)H = \phi(XW_{xh} + b_{h})H=ϕ(XWxh​+bh​)

  • In previous equation, We have the weight parameter Wxh∈Rd×hW_{xh} \in \mathbb{R}^{d \times h}Wxh​∈Rd×h, the bias parameter bh∈R1×hb_h \in \mathbb{R}^{1\times h}bh​∈R1×h, and the number of hidden units hℎh, for the hidden layer. So armed, we apply broadcasting uring the summation. Next, the hidden layer output HHH is used as input of the output layer, which is given by

O=HWhq+bqO = H W_{hq} +b _{q}O=HWhq​+bq​

  • where O∈Rn×qO \in \mathbb{R}^{n \times q}O∈Rn×q is the output variable, Whq∈Rh×qW_{hq} \in \mathbb{R}^{h \times q}Whq​∈Rh×q is the weight parameter, and bq×R1×qb_q \times \mathbb{R}^{1 \times q}bq​×R1×q is the bias parameter of the output layer. If it is a classification problem, we can use softmax(O)softmax(O)softmax(O) to compute the probability distribution of the output categories.

  • This is entirely analogous to the regression problem we solved previously in previous section, we omit details. Suffice it to say that we can pick feature-label pairs at random and learn the parameters of our network via automatic differentiation and stochastic gradient descent.

Recurrent Neural Networks with Hidden States

  • Assume that we have a minibatch of inputs Xt∈Rn×dX_t \in \mathbb{R}^{n \times d}Xt​∈Rn×d at time step ttt. In other words, for a minibatch of nnn sequence examples, each row of XtX_tXt​ corresponds to one example at time step ttt from the sequence.

  • Next, denote by Ht∈Rn×hH_t \in \mathbb{R}^{n \times h}Ht​∈Rn×h the hidden layer output of time step Ht−1H_{t-1}Ht−1​. Unlike with MLP, here we save the hidden layer output Ht−1H_{t-1}Ht−1​ from the previous time step and introduce a new weight parameter Whh∈Rh×hW_{hh} \in \mathbb{R}^{h \times h}Whh​∈Rh×h to describe how to use the hidden layer output of the previous time step in the current time step.

  • Specifically, the calculation of the hidden layer output of the current time step is determined by the input of the current time step together with the hidden layer output of the previous time step:

    • Ht=ϕ(XtWxh+Ht−1Whh+bh)H_t = \phi(X_tW_{xh} + H_{t-1}W_{hh} + b_h)Ht​=ϕ(Xt​Wxh​+Ht−1​Whh​+bh​)

    • From the relationship between hidden layer outputs Ht−1H_{t-1}Ht−1​ and Ht−1H_{t-1}Ht−1​ of adjacent time steps, we know that these variables captured and retained the sequence’s historical information up to their current time step, just like the state or memory of the neural network’s current time step.

    • Therefore, such a hidden layer output is called a hidden state. Since the hidden state uses the same definition of the previous time step in the current time step, the computation of this is recurrent.

  • 在本例中,模型参数是WxhW_{xh}Wxh​和WhhW_{hh}Whh​的拼接, 以及bhb_hbh​的偏置,所有这些参数都来自之前的公式。当前时间步ttt的隐状态HtH_tHt​ 将参与计算下一时间步t+1t+1t+1的隐状态Ht+1H_{t+1}Ht+1​。 而且HtH_{t}Ht​还将送入全连接输出层, 用于计算当前时间步ttt的输出OtO_{t}Ot​。

基于循环神经网络的字符级语言模型

  • 设小批量大小为1,批量中的文本序列为“machine”。 为了简化后续部分的训练,我们考虑使用 字符级语言模型(character-level language model), 将文本词元化为字符而不是单词。

通过时间反向传播

  • 它要求我们将循环神经网络的计算图一次展开一个时间步, 以获得模型变量和参数之间的依赖关系。 然后,基于链式法则,应用反向传播来计算和存储梯度。 由于序列可能相当长,因此依赖关系也可能相当长。

Analysis of Gradients in RNNs循环神经网络的梯度分析

  • Full Computation

  • Truncating Time Steps

  • Randomized Truncation

  • Comparing Strategies

Backpropagation Through Time in Detail

我们的目标是根据过去的和当前的词元预测下一个词元, 因此我们将原始序列移位一个词元作为标签。 Bengio等人首先提出使用神经网络进行语言建模 ()。

循环神经网络中的前向传播相对简单。 通过时间反向传播(backpropagation through time,BPTT) ()实际上是循环神经网络中反向传播技术的一个特定应用。

¶
Bengio et al., 2003
Werbos, 1990
输入为machin,输出为achine