[NLP] BERT

Bidirectional Encoder Representations from Transformers, also known as BERT, is a pre-trained model that was introduced in 2018 by Google.

Background

BERT model is comprised of transformers, with the dataset derived from Wikipedia and BookCorpus. For the performance, the model referred other cases, which consisted of hyperparameter tuning of pre-trained model with no label with the tasks with existing labels.

There are two models with different sizes:

BERT-Base, 12 stacks of encoders from transformers, $d_{model} = 768$, 12 Self-Attention Heads,
BERT-Large, 24 stacks of encoders from transformers, $d_{model} = 1024$, 16 Self-Attention Heads.

The former model consists of the number of hyperparameters same as GPT-1 model to make a comparison. The latter model is for the best performance of the model.

Components

This sector demonstrates the components of BERT model, including the word embedding, tokenizer, and positional embedding.

Embeddings

Considering the existence of $d_{model}$, BERT utilizes embedding layers with specific size. The output embedding refers itself from all the input given. The transformer of each layer comprises of Multi-head self-attention and Position-wise Feed Forward Neural Network.

Tokenizers

For tokenizing the words, there are two cases for the execution:

When the token exists in the word list, do not divide the word into subwords
When the token does not exist in the word list, the word is divided into subwords. If the divided subword is not the start of the word, “##” is added in front of the subword.

For instance, when the word “embeddings” is given, this word is not in the list of words. Therefore, it is divided into subwords “em,” “##bed, “##ding,” “##s.” The word list is saved in a text file, named vocabulary.txt.

Positional Embedding

For assigning the positions for the words, BERT model applies positional embedding, which can be trained. The length of embedding vectors is 512, the maximum length of sentence for the input.

For each embedding vector, positional embedding vector is added.

Segment Embedding

This embedding is for the input with two sentences. For dividing the sentences, segment embedding layer is added.

For the first sentence, value of 0 is added. For the second sentence, value of 1 is added.

BERT Model Embedding

This diagram is the overall demonstration of the embeddings that are explained and how they are given as input.

Attention Mask

Attention Mask is a sequence input for separating the real word and the padding token for not putting mask for the padding token. Here, a value of 1 is added when it is a word.

Pre-training

This is the overall pre-training mechanism for other models also. The difference between GPT and BERT is the bi-directionality in the training process. This is obtained by masked language model.

Masked Language Model

For the training, around 15% of the input text is masked. After the masking, those words are processed with following logic:

80% of them - changed to [MASK]
10% of them - randomly changed
10% of them - remains the same

This is done for the problem of not matching of pre-training and fine-tuning processes for [MASK] token not appearing in fine-tuning process.

Next Sentence Prediction

Another feature that BERT consists of is checking the continuality of the sentences A and B.

NSP

For the end of each sentence, a [SEP] token is provided. At the output layer of [CLS] token, the model is to solve the problem for checking whether the second sentence is the sentence following the previous one.

Tasks

There are different tasks that BERT model can perform.

Single Text Classification

This is for analyzing the text. The examples may be semantic analysis and news topic classification. For this task, [CLS] token is added at the beginning of the text for performing the classification.

Tagging

Another task is tagging. This is for analyzing each word. At each input word, there is a dense layer for the output for predicting the tags.

Text Pair Classification / Regression

Text Pair

This is for analyzing two sentences given as input. As mentioned previously, the dense layer at the beginning demonstrates the relationship between the former and the latter sentences.

Question Answering

Another task with two sentences as an input is the question answering task. For the second sentence, the context that is related to the answers of the question is provided. Then, the answer for the question is predicted based on the second sentence.

Conclusion

BERT model is another instance of large language model that is generated for the comparison with GPT. With different structure, this model is able to perrform for different tasks with acceptable level.

The overall paper can be found here.

jaehwanc