[NLP] GPT1, 2, and 3

The introduction of transformer mechanism has allowed more rooms for NLP processes to improve. This article is dedicated to introduce how this mechanism has allowed improvements in the field by explaining Generative Pre-Trained Transformer (GPT) models.

GPT 1

Before GPT-1, the issues mainly arose from using labeled data. Most of natural language data available are unstructued, so providing them with annotations required excessive amount of time and money. To deal with this issue, using the unlabled text data was necessary. However, there were some issues regarding utilizing unlabled text data, including:

The efficiency of any optimization function is not defined,
Although the training process can be done with unlabelled data, transferring the trained data to the target texts (testing data) is not defined.

The decision was made to utilize semi-supervied and unsupervised methods for the process. This is known as GPT-1.

Training #1 - Unsupervised Pre-training

This process can be explained as using unlabelled text token $u = {u_1, u_2, … , u_n}$ to maximize the objective function in usual language model, Likelihood $L_1(u)$. Likelihood function is known as:

\[L_1(u) = \sum_i logP(u_i | u_{i-k}, ..., u_{i-1}; \theta)\]

Here, $k$ is the size of context window, $u = (u_1, u_2, … u_k)$ stands for context vector for tokens. $\theta$ is the parameter, and $P$ is the neural network that is used. This function is calculating the calculation of the possibility of the upcoming word given a word, which is a tokenized with $u_i$.

For GPT, transformer only insists of decoder block, which does not require encoder-decoder attention.

Going through the process, first, a matrix for masked self-attention $h_0$ is required. This is calculed as $x W_e + W_p$, where $x$ is the input, $W_e$ is the embedding matrix, and $W_p$ is the positional embedding matrix.

GPT is composed of stacks of decoders. The paper has stacked 12 layers of decoders, so that $i^{th}$ decoder block’s hidden state $h_i$ gets the input of hidden state of previous decoder block, $h_{i-1}$.

The output probability is calculated after dot product of hidden state of last $n^{th}$ decoder block and transposed embedding matrix $W_e^T$, with softmax activation function.

With the conditional probability that is calculated at the output, the likelihood function $L_1(u)$ can be optimized.

Training #2 - Supervised Fine-Tuning

After pre-training, this step is for adjusting the parameters for target task.

There are two components for the training:

input token sequence, ${x_1, …, x_m}$
labeled dataset $C$, wrote down as $y$.

First, to get the last decoder block hidden state $h_l^m$ with respect to $C$, input tokens are put into the pretrained model. Then, put $h_l^m$ into linear layer with parameter $W_y$ to calculate the softmax probability.

\[P(y|x^1, ... , x^m) = softmax(h_l^m W_y)\]

This provides with token probaility distribution, so label $y$ can be trained.

In the paper, the objective function in unsupervised pre-training $L_1$ is added as auxilary objective. For regularization, it is applied as $L_1(C)$ with weight $\lambda$ multiplied. The overall objective function is as follows:

\[L_3(C) = L_2(C) + \lambda * L_1(C)\]

Results / Notes

The paper mentions different tasks that this model can be involved in:

Different tasks include:

Classification, where no further manipulation for fine-tuning is required,
Entailment, task for defining true or false of hyperpothesis based on premise, concatenating premise and hypothesis token required,
Similarity for two sentences, have to consider the order of sentences, so both cases should be considered for fine-tuning,
QA / Multiple Choice, with context document $z$, question $q$ and answer ${a_k}$ should be concatenated.

The tests were taken for these different tasks, and the results are as follows:

For Natural Language Interence task, the only dataset that the GPT model was deficient in was RTE database, which is small dataset compared to others.

For QA task, GPT outperformed all the other exisiting methods.

For other remaining tasks, GPT also outperformed others.

This is the results of number of pretraining transformer layers displayed in a graph. With the increasing number of layers, the performance rate increased.

These are different results based on variations:

Transformer - increase in overall performance,
Adding Auxilary Objective - increased overall performance but not much in tasks like QQP / RTE,
pre-training - pre-training increased the performance.

Overall paper can be found here.

GPT 2

The focus of GPT 2 was to use larger database provided by WebText, with no supervision. The database is of 40GB of data.

Characteristics

There are some notable characteristics that this model is comprised of.

Byte-Pair Encoding (BPE)

This is for algorithm for dividing subwords. The words are divided with each characters, then combined for higher prevalence.

Input Encoding

This is for letting the model know the position and the word. This is the addition of embedded token (from BPE) and positional embedding.

Decoder / Self-Attention

The decoders go through self-attention for each stack.

Test Results

There are several tasks that were taken to calculate the efficiency / accuracy of the model.

For seven of eight existing tasks, the model reached state of the art.

However, for some tasks like summarization, the performance was not better compared to others.

Output

For the output, the probability of upcoming word is calculated, and the model predicts the words that are to be provided as the output.

The overall paper can be founed here

GPT 3

GPT 3 has the same architecture as the previous GPT 2 model. This model is comprised of parameters of more than 100 times the previous model.

This graph demonstrates how GPT 3 is advantageous with increased number of parameters in three different conditions:

Zero-shot, in which no human interference is given regarding the texts,
One-shot, in which only one example is provided,
Few-shot, in which a few examples are provided.

In three of the tasks, GPT 3 performed well compared to others with less number of parameters.

In addition to the improvements of the model, this model included several APIs that can be used by the public, including the famous chatbot system, ChatGPT.

The overall paper can be found here.

Conclusion

GPT truly stands out in the world of NLP for its performances and methods to improve the performance. With the modern techniques, the NLP is one of the fastest growing fields in the machine learning world and is to be grown further for the better world.

jaehwanc