Gated recurrent unit explained

Gated recurrent units (GRUs) are a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho et al.^[1] The GRU is like a long short-term memory (LSTM) with a gating mechanism to input or forget certain features,^[2] but lacks a context vector or output gate, resulting in fewer parameters than LSTM.^[3] GRU's performance on certain tasks of polyphonic music modeling, speech signal modeling and natural language processing was found to be similar to that of LSTM.^[4] ^[5] GRUs showed that gating is indeed helpful in general, and Bengio's team came to no concrete conclusion on which of the two gating units was better.^[6]

Architecture

There are several variations on the full gated unit, with gating done using the previous hidden state and the bias in various combinations, and a simplified form called minimal gated unit.^[7]

The operator

\odot

denotes the Hadamard product in the following.

Fully gated unit

Initially, for

t=0

, the output vector is

h₀=0

$\beginz_t &= \sigma(W_ x_t + U_ h_ + b_z) \\r_t &= \sigma(W_ x_t + U_ h_ + b_r) \\\hat_t &= \phi(W_ x_t + U_ (r_t \odot h_) + b_h) \\h_t &= (1-z_t) \odot h_ + z_t \odot \hat_t\end$

Variables (

denotes the number of input features and

the number of output features):

x_t\inR^d

: input vector

h_t\inR^e

: output vector

\hat{h}_t\inR^e

: candidate activation vector

z_t\in(0,1)^e

: update gate vector

r_t\in(0,1)^e

: reset gate vector

W\inR^e

U\inR^e

and

b\inR^e

: parameter matrices and vector which need to be learned during training

Activation functions

\sigma

: The original is a logistic function.

\phi

The original is a hyperbolic tangent.Alternative activation functions are possible, provided that

\sigma(x)\isin[0,1]

Alternate forms can be created by changing

z_t

and

r_t

^[8]

Type 1, each gate depends only on the previous hidden state and the bias.

\begin{align} z_t&=\sigma(U_zh_t-1+b_z)\\ r_t&=\sigma(U_rh_t-1+b_r)\\ \end{align}

Type 2, each gate depends only on the previous hidden state.

\begin{align} z_t&=\sigma(U_zh_t-1)\\ r_t&=\sigma(U_rh_t-1)\\ \end{align}

Type 3, each gate is computed using only the bias.

\begin{align} z_t&=\sigma(b_z)\\ r_t&=\sigma(b_r)\\ \end{align}

Minimal gated unit

The minimal gated unit (MGU) is similar to the fully gated unit, except the update and reset gate vector is merged into a forget gate. This also implies that the equation for the output vector must be changed:^[9]

\begin{align} f_t&=\sigma(W_fx_t+U_fh_t-1+b_f)\\ \hat{h}_t&=\phi(W_hx_t+U_h(f_t\odoth_t-1)+b_h)\\ h_t&=(1-f_t)\odoth_t-1+f_t\odot\hat{h}_{t
\end{align}}

Variables

x_t

: input vector

h_t

: output vector

\hat{h}_t

: candidate activation vector

f_t

: forget vector

and

: parameter matrices and vector

Light gated recurrent unit

The light gated recurrent unit (LiGRU)^[4] removes the reset gate altogether, replaces tanh with the ReLU activation, and applies batch normalization (BN):

\begin{align} z_t&=\sigma(\operatorname{BN}(W_zx_t)+U_zh_t-1)\\ \tilde{h}_t&=\operatorname{ReLU}(\operatorname{BN}(W_hx_t)+U_hh_t-1)\\ h_t&=z_t\odoth_t-1+(1-z_t)\odot\tilde{h}_{t
\end{align}}

LiGRU has been studied from a Bayesian perspective.^[10] This analysis yielded a variant called light Bayesian recurrent unit (LiBRU), which showed slight improvements over the LiGRU on speech recognition tasks.

Notes and References

Cho . Kyunghyun . van Merrienboer . Bart . Bahdanau . DZmitry . Bougares . Fethi . Schwenk . Holger . Bengio . Yoshua . 2014 . Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation . Association for Computational Linguistics . 1406.1078.
Book: Felix Gers . Jürgen Schmidhuber . Fred Cummins . 9th International Conference on Artificial Neural Networks: ICANN '99 . Learning to forget: Continual prediction with LSTM . 1999 . 850–855 . 1999. Jürgen Schmidhuber . Felix Gers . 10.1049/cp:19991218 . 0-85296-721-7 .
Web site: Recurrent Neural Network Tutorial, Part 4 – Implementing a GRU/LSTM RNN with Python and Theano – WildML . Wildml.com . 2015-10-27 . May 18, 2016 . 2021-11-10 . https://web.archive.org/web/20211110112626/http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/ . dead .
1803.10225 . Light Gated Recurrent Units for Speech Recognition . Ravanelli . Mirco. Brakel . Philemon . Omologo . Maurizio . Bengio . Yoshua . Yoshua Bengio . IEEE Transactions on Emerging Topics in Computational Intelligence . 2018. 2 . 2 . 92–102 . 10.1109/TETCI.2017.2762739 . 4402991 .
1803.01686 . On extended long short-term memory and dependent bidirectional recurrent neural network . Su . Yuahang . Kuo . Jay . Neurocomputing . 2019. 356 . 151–161 . 10.1016/j.neucom.2019.04.044 . 3675055 .
1412.3555. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. Chung . Junyoung . Gulcehre . Caglar . Cho . KyungHyun . Bengio . Yoshua . cs.NE . 2014 .
1412.3555. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. Chung . Junyoung . Gulcehre . Caglar . Cho . KyungHyun . Bengio . Yoshua . cs.NE . 2014 .
Dey. Rahul. Salem. Fathi M.. 2017-01-20. Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks. 1701.05923 . cs.NE.
Heck. Joel. Salem. Fathi M.. 2017-01-12. Simplified Minimal Gated Unit Variations for Recurrent Neural Networks. 1701.03452 . cs.NE.
A Bayesian Interpretation of the Light Gated Recurrent Unit . Bittar . Alexandre . Garner . Philip N. . May 2021 . IEEE . ICASSP 2021 . 2965–2969 . Toronto, ON, Canada . 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . 10.1109/ICASSP39728.2021.9414259.