[NLP] LSTM

This post is dedicated to deliver Long-Short Term Memory (LSTM) Model and how this concept has been raised. The review is based on the paper, “Long Short-Term Memory,” proposed by Dr. Hochreiter and Schmidhuber in 1997.

Issues Regarding RNN

Although RNN has introduced a brand-new technique that is still in use, there are some issues that reside in the model. Although RNN is able to solve the problems based on the information given in previous sequences, the information may get lost at the end. This issue was able to be resolved with simple train and test set with change in hyperparameters, but with more complex dataset, the issue was unlikely to be resolved.

To be specific, there were two cases, according to the paper, of the loss of information may happen using the traditional methods:

Blow up,
Vanish.

In addition, there are two conflicts that are mentioned. One is “Input Weight Conflict.” If each unit receives an input from all the other inputs at each time step, such an unit is likely to collect useless information, rendering its original state unusable. Another is “Output Weight Conflict.” If irrelevant units are read at all other units at each time step, the huge influx of useless information is like to be happen. These two conflicts suggest that the model is to learn and collect only usable information.

Constant Error Carousel

The problem that is introduced previously can be explained through a mathematical equation:

\[\frac{\partial \varrho_v (t-q)}{\partial \varrho_u(t)} = \begin{cases} f'_v(net_v(t-1))w_{vu}, &q = 1 \\ f'_v(net_v(t-1))\sum^n_{l=1} \frac{\partial \varrho_v (t-q)}{\partial \varrho_v(t) w_{vl}}, & q > 1 \end{cases}\]

For an error from time $t$ and unit $u$ to unit $v$ in front of step $q$. $net_v$ represents the network input to unit $v$.

If the summation is made from $l_0 = u$ to $l_q = v$, the equation is as follows:

\[\frac{\partial \varrho_v (t-q)}{\partial \varrho_u(t)} = \sum^n_{l_1 = 1} ... \sum^n_{l_{q-1} = 1} \prod^q_{m=1} f'_{l_m}(net_{l_m}(t-m))w_{l_m}l_{l_{m-1}}\]

This suggests that the sum of $n^{q-1}$ terms, $\prod^q_{m=1} f’{l_m}(net{l_m}(t-m))w_{l_m}l_{l_{m-1}}$ determines the error back flow. The summation may not necessarily be positive for this values.

If $\left\vert f’{l_m}(net{l_m}(t-m))w_{l_m}l_{l_{m-1}} \right\vert$ is set to values $> 1$, the error blows up. However, when this value is $< 1$, the error vanishes.

LSTM model suggests such an error to be set to 1, which is shown in the mathematical equation below:

\[f'_{j}(net_j(t))w_{jj} = 1\]

This is also known as Constant Error Carousel, an identity function having a weight of 1. This is one of the ideas applied in the model.

Basic Structure of LSTM

The basic structure of LSTM is described through this diagram.

A Simple Diagram of LSTM, retrieved from Google

LSTM Insights

LSTM can be divided into FOUR gates and SIX parameters.

Cell State

The first concept is cell state.

Cell State

This is the part of the model in which the information flows. The addition or deletion of the information is done at the “gates”.

Forget Gate

One of the gates is Forget Gate. This gate decides whether the information from the past cells is forgotten or not.

Forget Gate

The input is $h_{t-1}$, hidden state from the previous cell, and $x_t$ the input at current sequence. The value is calculated with sigmoid function. If the value is 1, all the information is passed. If the value is 0, all the information is lost. This is considered as “short-term memory”.

In NLP, this process can be considered as forgetting all the other information but singularity.

Input Gate

Another gate is the Input Gate. This is for storing the information at the current sequence.

Input Gate

In NLP, this process can be considered as adding information about grammatical gender of the noun to replace the forgotten noun.

$b_n$ stands for bias for each calculation.
$W_i * [h_{t-1}, x_t]$ is same as $W_{xh_f}x_t + W_{hh_f}h_{t-1}$, the dot product of the weight with the hidden state $h$ and input $x$

Update

This is the process of updating Cell State to the new state.

Update

This value, $C_t$, is the new cell state, the result of addition between $f_t * C_{t-1}$, the result from forget gate, and $i_t * \widetilde{C}_t$, the output of Input Gate.

In NLP, this process can be considered as adding the information about the singularity of the noun (which is from the forget layer) and adding the grammatical gender of it (from the input layer).

Output Gate

This is the output value from the cell.

Output Gate

The output of the hidden state $h_t$ is calcualted after transforming $C_t$ with $tanh$ activation function.

Further Improvements & Conclusion

Indeed, there are some rooms of improvements that the paper suggests.

Inability in solving certain problems, such as strongly delayed XOR problems. This occurs with the implementation of truncated backpropagation version,
Additional memory required for input and output gate,
Smaller the data, the more likely the model will perform just like Back Propagation Feed Forward Neural Network.

With testing against some traditional models, including but not limited to Real-Time Recurrent Learning (RTRL) algorithm. The results after testing with several dataset, including but not limited to “embedded Reber grammar” and “2-sequence problem,” demonstrate how this algorithm is beneficial compared to the previous models.

The actual paper can be found here.

jaehwanc

Issues Regarding RNN

Constant Error Carousel

Basic Structure of LSTM

Cell State

Forget Gate

Input Gate

Update

Output Gate

Further Improvements & Conclusion