Sequence-to-Sequence model is comprised of interconnections of LSTM models that are described in this article. This article demonstrates how the end-to-end approach to sequential learning that minimizes inferences on the sequences.

Problems with the Existing Networks

Deep Neural Networks before 2014 when this paper was beneficial for solving complex issues, with its flexibility in parameters choosing and supervised backpropagations. Indeed, there were some problems that were proposed, one including that:

DNNs can only solve the problems with inputs and outputs under vectors with fixed dimension,

With this limitation, the sequences whose lengths are not known were not easily solved with the exising model.

Basic structure of Seq2Seq

To address such an issue, LSTM architecture has been applied. LSTM is used to read input sequence per one timestep to obtain fixed-dimensional vector representation. Then, another LSTM is used to extract sequence from that vector.

The diagram for such a network is as follows:

Seq2Seq

The model gets an input of sentence “ABC” and produces “WXYZ.” The last token is End of Sentence token. LSTM reads the input sentence reversely, so that short term dependencies may occur to make the optimization easier.

Inside of it

The simplest strategy for sequence learning is to have the input sentence into a fixed-sized vector using RNN. Then, the vector is to be mapped to the target sequence. However, there was a problem of long-term dependency in which the information may vanish. However, LSTM is able to be used in problems with long range temporal dependencies.

The goal of the LSTM is to estimate the conditional probability,

\[p(y_1, ... y_{T'} | x_1, ..., x_T)\]

in which $(x_1, … , x_T)$ is an input sequence and $y_1, … , y_{T’}$ is the output sequence, whose length $T’$ may differ from $T$.

In the proposed model, LSTM is used for input and output sequences separately. This was to increase the number of model parameters and train LSTM on different language pairs. In addition, the input words are reversely mapped for better performance.

Training

The following parameters / hyperparameters are used:

  • LSTM parameters initialized between (-0.08, 0.08) with uniform distribution,
  • learning rate of 0.7 with optimization method SGD,
  • Batch size of 128,
  • Scaling of the norm of the gradient for preventing explosion,
  • All the sentences in a minibatch to have similar length to speed up the process.

Results

The results are compared with BLEU score. The experiment was done for the translation from English to French, with the test set of WMT’14 English to French test set.

Result

The tables are the results. The words that were out of existing vocabulary list were not able to be handled. The result demonstrates how LSTM is able to compete with the existing best result.

The actual paper can be found here.