Recurrent Neural Networks(RNN)
Review
我们介绍了n元语法模型, 其中单词xt在时间步t的条件概率仅取决于前面n−1个单词。 对于时间步t−(n−1)之前的单词, 如果我们想将其可能产生的影响合并到xt上, 需要增加n,然而模型参数的数量也会随之呈指数增长, 因为词表∣V∣需要存储∣V∣n个数字,
因此与其将P(xt∣xt−1,⋯,x1)模型化, 不如使用隐变量模型P(xt∣xt−1,⋯,x1)≈P(xt∣ht−1)
其中ht−1是隐状态(hidden state), 也称为隐藏变量(hidden variable), 它存储了到时间步t−1的序列信息。 通常,我们可以基于当前输入xt和先前隐状态ht−1 来计算时间步t处的任何时间的隐状态:ht−1=f(xt,ht−1)
对于函数f,隐变量模型不是近似值。 毕竟ht是可以仅仅存储到目前为止观察到的所有数据, 然而这样的操作可能会使计算和存储的代价都变得昂贵。
Neural Networks without Hidden States¶
Let’s take a look at an MLP with a single hidden layer.
Let the hidden layer’s activation function be ϕ. Given a minibatch of examples X∈Rn×d with batch size n and d inputs, the hidden layer output H∈Rn×h is calculated as
H=ϕ(XWxh+bh)
In previous equation, We have the weight parameter Wxh∈Rd×h, the bias parameter bh∈R1×h, and the number of hidden units h, for the hidden layer. So armed, we apply broadcasting uring the summation. Next, the hidden layer output H is used as input of the output layer, which is given by
O=HWhq+bq
where O∈Rn×q is the output variable, Whq∈Rh×q is the weight parameter, and bq×R1×q is the bias parameter of the output layer. If it is a classification problem, we can use softmax(O) to compute the probability distribution of the output categories.
This is entirely analogous to the regression problem we solved previously in previous section, we omit details. Suffice it to say that we can pick feature-label pairs at random and learn the parameters of our network via automatic differentiation and stochastic gradient descent.
Recurrent Neural Networks with Hidden States
Assume that we have a minibatch of inputs Xt∈Rn×d at time step t. In other words, for a minibatch of n sequence examples, each row of Xt corresponds to one example at time step t from the sequence.
Next, denote by Ht∈Rn×h the hidden layer output of time step Ht−1. Unlike with MLP, here we save the hidden layer output Ht−1 from the previous time step and introduce a new weight parameter Whh∈Rh×h to describe how to use the hidden layer output of the previous time step in the current time step.
Specifically, the calculation of the hidden layer output of the current time step is determined by the input of the current time step together with the hidden layer output of the previous time step:
Ht=ϕ(XtWxh+Ht−1Whh+bh)
From the relationship between hidden layer outputs Ht−1 and Ht−1 of adjacent time steps, we know that these variables captured and retained the sequence’s historical information up to their current time step, just like the state or memory of the neural network’s current time step.
Therefore, such a hidden layer output is called a hidden state. Since the hidden state uses the same definition of the previous time step in the current time step, the computation of this is recurrent.

在本例中,模型参数是Wxh和Whh的拼接, 以及bh的偏置,所有这些参数都来自之前的公式。当前时间步t的隐状态Ht 将参与计算下一时间步t+1的隐状态Ht+1。 而且Ht还将送入全连接输出层, 用于计算当前时间步t的输出Ot。
基于循环神经网络的字符级语言模型
我们的目标是根据过去的和当前的词元预测下一个词元, 因此我们将原始序列移位一个词元作为标签。 Bengio等人首先提出使用神经网络进行语言建模 (Bengio et al., 2003)。
设小批量大小为1,批量中的文本序列为“machine”。 为了简化后续部分的训练,我们考虑使用 字符级语言模型(character-level language model), 将文本词元化为字符而不是单词。

输入为machin,输出为achine
通过时间反向传播
循环神经网络中的前向传播相对简单。 通过时间反向传播(backpropagation through time,BPTT) (Werbos, 1990)实际上是循环神经网络中反向传播技术的一个特定应用。
它要求我们将循环神经网络的计算图一次展开一个时间步, 以获得模型变量和参数之间的依赖关系。 然后,基于链式法则,应用反向传播来计算和存储梯度。 由于序列可能相当长,因此依赖关系也可能相当长。
Analysis of Gradients in RNNs循环神经网络的梯度分析
Full Computation
Truncating Time Steps
Randomized Truncation
Comparing Strategies
Backpropagation Through Time in Detail
Last updated