Recurrent Neural Networks(RNN)
Last updated
Last updated
我们介绍了元语法模型, 其中单词在时间步的条件概率仅取决于前面个单词。 对于时间步之前的单词, 如果我们想将其可能产生的影响合并到上, 需要增加,然而模型参数的数量也会随之呈指数增长, 因为词表需要存储个数字,
因此与其将模型化, 不如使用隐变量模型
其中是隐状态(hidden state), 也称为隐藏变量(hidden variable), 它存储了到时间步的序列信息。 通常,我们可以基于当前输入和先前隐状态 来计算时间步处的任何时间的隐状态:
对于函数,隐变量模型不是近似值。 毕竟是可以仅仅存储到目前为止观察到的所有数据, 然而这样的操作可能会使计算和存储的代价都变得昂贵。
Let’s take a look at an MLP with a single hidden layer.
Let the hidden layer’s activation function be . Given a minibatch of examples with batch size and inputs, the hidden layer output is calculated as
In previous equation, We have the weight parameter , the bias parameter , and the number of hidden units , for the hidden layer. So armed, we apply broadcasting uring the summation. Next, the hidden layer output is used as input of the output layer, which is given by
where is the output variable, is the weight parameter, and is the bias parameter of the output layer. If it is a classification problem, we can use to compute the probability distribution of the output categories.
This is entirely analogous to the regression problem we solved previously in previous section, we omit details. Suffice it to say that we can pick feature-label pairs at random and learn the parameters of our network via automatic differentiation and stochastic gradient descent.
Specifically, the calculation of the hidden layer output of the current time step is determined by the input of the current time step together with the hidden layer output of the previous time step:
Therefore, such a hidden layer output is called a hidden state. Since the hidden state uses the same definition of the previous time step in the current time step, the computation of this is recurrent.
设小批量大小为1,批量中的文本序列为“machine”。 为了简化后续部分的训练,我们考虑使用 字符级语言模型(character-level language model), 将文本词元化为字符而不是单词。
它要求我们将循环神经网络的计算图一次展开一个时间步, 以获得模型变量和参数之间的依赖关系。 然后,基于链式法则,应用反向传播来计算和存储梯度。 由于序列可能相当长,因此依赖关系也可能相当长。
Full Computation
Truncating Time Steps
Randomized Truncation
Comparing Strategies
Assume that we have a minibatch of inputs at time step . In other words, for a minibatch of sequence examples, each row of corresponds to one example at time step from the sequence.
Next, denote by the hidden layer output of time step . Unlike with MLP, here we save the hidden layer output from the previous time step and introduce a new weight parameter to describe how to use the hidden layer output of the previous time step in the current time step.
From the relationship between hidden layer outputs and of adjacent time steps, we know that these variables captured and retained the sequence’s historical information up to their current time step, just like the state or memory of the neural network’s current time step.
在本例中,模型参数是和的拼接, 以及的偏置,所有这些参数都来自之前的公式。当前时间步的隐状态 将参与计算下一时间步的隐状态。 而且还将送入全连接输出层, 用于计算当前时间步的输出。
我们的目标是根据过去的和当前的词元预测下一个词元, 因此我们将原始序列移位一个词元作为标签。 Bengio等人首先提出使用神经网络进行语言建模 ()。
循环神经网络中的前向传播相对简单。 通过时间反向传播(backpropagation through time,BPTT) ()实际上是循环神经网络中反向传播技术的一个特定应用。