Understanding LSTM, GRU, BiLSTM Layer Normalization

Layer Normalization can prevent model over-fitting and speed up the model training. In this tutorial, we will introduce this topic.

Layer Normalization

Layer Normalization is widely used in LSTM, GRU and BiLSTM. Here is a tutorial:

An Explain to Layer Normalization in Neural Networks – Machine Learning Tutorial

In tensorflow, we can use tf.contrib.layers.layer_norm() function to normalize a layer.

Layer Normalization Explained for Beginners – Deep Learning Tutorial

We will use the output of BiLSTM as an example.

Suppose the output of a BiLSTM is: \(h_t\). Its layer normalization can be computed as follows:

the bilstm layer normalization

Here \(\mu_t\) and \(\sigma_t\) are the mean and variance of \(h_t\), \(g\) and \(b\) are scale and offset parameters.

As to LSTM and GRU, we also can compute their layer normalization as above, because all of them are time sequence model.

