Recurrent Neural Networks(RNN)

Review

  • 我们介绍了nn元语法模型, 其中单词xtx_t在时间步tt的条件概率仅取决于前面n1n-1个单词。 对于时间步t(n1)t-(n-1)之前的单词, 如果我们想将其可能产生的影响合并到xtx_t上, 需要增加nn,然而模型参数的数量也会随之呈指数增长, 因为词表V|\mathcal{V}|需要存储Vn|\mathcal{V}|^n个数字,

  • 因此与其将P(xtxt1,,x1)P(x_t | x_{t-1}, \cdots, x_1)模型化, 不如使用隐变量模型P(xtxt1,,x1)P(xtht1)P(x_t|x_{t-1},\cdots, x_1) \approx P(x_t |h_{t-1})

  • 其中ht1h_{t-1}隐状态(hidden state), 也称为隐藏变量(hidden variable), 它存储了到时间步t1t-1的序列信息。 通常,我们可以基于当前输入xtx_t和先前隐状态ht1h_{t-1} 来计算时间步tt处的任何时间的隐状态:ht1=f(xt,ht1)h_{t-1} = f(x_t, h_{t-1})

  • 对于函数ff,隐变量模型不是近似值。 毕竟hth_t是可以仅仅存储到目前为止观察到的所有数据, 然而这样的操作可能会使计算和存储的代价都变得昂贵。

Neural Networks without Hidden States

  • Let’s take a look at an MLP with a single hidden layer.

  • Let the hidden layer’s activation function be ϕ\phi. Given a minibatch of examples XRn×dX \in \mathbb{R}^{n \times d } with batch size nn and dd inputs, the hidden layer output HRn×hH \in \mathbb{R}^{n \times h} is calculated as

H=ϕ(XWxh+bh)H = \phi(XW_{xh} + b_{h})

  • In previous equation, We have the weight parameter WxhRd×hW_{xh} \in \mathbb{R}^{d \times h}, the bias parameter bhR1×hb_h \in \mathbb{R}^{1\times h}, and the number of hidden units h, for the hidden layer. So armed, we apply broadcasting uring the summation. Next, the hidden layer output HH is used as input of the output layer, which is given by

O=HWhq+bqO = H W_{hq} +b _{q}

  • where ORn×qO \in \mathbb{R}^{n \times q} is the output variable, WhqRh×qW_{hq} \in \mathbb{R}^{h \times q} is the weight parameter, and bq×R1×qb_q \times \mathbb{R}^{1 \times q} is the bias parameter of the output layer. If it is a classification problem, we can use softmax(O)softmax(O) to compute the probability distribution of the output categories.

  • This is entirely analogous to the regression problem we solved previously in previous section, we omit details. Suffice it to say that we can pick feature-label pairs at random and learn the parameters of our network via automatic differentiation and stochastic gradient descent.

Recurrent Neural Networks with Hidden States

  • Assume that we have a minibatch of inputs XtRn×dX_t \in \mathbb{R}^{n \times d} at time step tt. In other words, for a minibatch of nn sequence examples, each row of XtX_t corresponds to one example at time step tt from the sequence.

  • Next, denote by HtRn×hH_t \in \mathbb{R}^{n \times h} the hidden layer output of time step Ht1H_{t-1}. Unlike with MLP, here we save the hidden layer output Ht1H_{t-1} from the previous time step and introduce a new weight parameter WhhRh×hW_{hh} \in \mathbb{R}^{h \times h} to describe how to use the hidden layer output of the previous time step in the current time step.

  • Specifically, the calculation of the hidden layer output of the current time step is determined by the input of the current time step together with the hidden layer output of the previous time step:

    • Ht=ϕ(XtWxh+Ht1Whh+bh)H_t = \phi(X_tW_{xh} + H_{t-1}W_{hh} + b_h)

    • From the relationship between hidden layer outputs Ht1H_{t-1} and Ht1H_{t-1} of adjacent time steps, we know that these variables captured and retained the sequence’s historical information up to their current time step, just like the state or memory of the neural network’s current time step.

    • Therefore, such a hidden layer output is called a hidden state. Since the hidden state uses the same definition of the previous time step in the current time step, the computation of this is recurrent.

  • 在本例中,模型参数是WxhW_{xh}WhhW_{hh}的拼接, 以及bhb_h的偏置,所有这些参数都来自之前的公式。当前时间步tt的隐状态HtH_t 将参与计算下一时间步t+1t+1的隐状态Ht+1H_{t+1}。 而且HtH_{t}还将送入全连接输出层, 用于计算当前时间步tt的输出OtO_{t}

基于循环神经网络的字符级语言模型

  • 我们的目标是根据过去的和当前的词元预测下一个词元, 因此我们将原始序列移位一个词元作为标签。 Bengio等人首先提出使用神经网络进行语言建模 (Bengio et al., 2003)。

  • 设小批量大小为1,批量中的文本序列为“machine”。 为了简化后续部分的训练,我们考虑使用 字符级语言模型(character-level language model), 将文本词元化为字符而不是单词。

  • 输入为machin,输出为achine

通过时间反向传播

  • 循环神经网络中的前向传播相对简单。 通过时间反向传播(backpropagation through time,BPTT) (Werbos, 1990)实际上是循环神经网络中反向传播技术的一个特定应用。

  • 它要求我们将循环神经网络的计算图一次展开一个时间步, 以获得模型变量和参数之间的依赖关系。 然后,基于链式法则,应用反向传播来计算和存储梯度。 由于序列可能相当长,因此依赖关系也可能相当长。

Analysis of Gradients in RNNs循环神经网络的梯度分析

  • Full Computation

  • Truncating Time Steps

  • Randomized Truncation

  • Comparing Strategies

Backpropagation Through Time in Detail

Last updated