This post is to introduce other attempts to create language models.

Basic Introduction

Large Language Model Meta AI (LLaMa) model is publicly released by Meta in late February, 2023. With the researches done in “big” tech firms including Google with Transformer, Tensorflow and BERT, Meta is also attempting to provide a breakthrough in such a field.

The number of parameters range from 7B to 65B, with the performance compatible to some of the best models that were available at that time. According to the paper that was published simultaneously, the goal was to provide best performance model for various inference budgets.

Unlike other models, the data used for training is publicly availble. However, the weights are not, which ban from other users to utilize the model for further commercial uses.

Data

There are different sources that the data were gathered. The ratio is as follows:

Source Percentage
English CommonCrawl 67%
C4 15%
Github 4.5%
Wikipedia 4.5%
Gutenberg and Book3 4.5%
ArXiv 2.5%
Stack Exchange 2%

The sources are used for specific purposes: Github / Stack Exchange for programming related problems, Gutenberg and Book3 for contents of the books, etc.

The data is tokenized using byte-pair encoding, just like GPT-2. After tokenization, the two small models (7B / 14B) comprise of 1 trillion tokens, while the two large models (32B / 64B) comprise of 1.4 trillion tokens.

Architecture

The model is based on transformer architecture. However, this model applied other techniques / models, including:

  1. RMSNorm normalizing function, which is known as layer normalizing function for normalizing each layer of the transformer.
  2. SwiGLU activation function, instead of ReLU for activation function,
  3. Rotary Positional Embeddings, instead of absolute positional embeddings.

The size of hyperparaters, including the layers / heads / dimensions differ by the size of the parameters.

The model applied AdamW optimizer, with $\beta_1 = 0.9$ and $\beta_2 = 0.95$. The learning rate is updated cosinusoidally, with final learning rate is 0.1 of that of maximum learning rate. The weight decay is 0.1 and gradient clapping is 1.0.

For the efficiency of the training process, multi-head attention is used. In addition, the backward propagation is calculated manually. In addition, a concept called model and sequence parallelism is applied (the calculation is done parallel). Also, the overlapping is done over GPUs.

Result

The tasks were done for different test cases.

Test Results

These are the graphs for accuracy with respect to number of tokens. According to the paper, Chinchilla is the best performing language model by far, and the results demonstrate that the largest model outperforms it.

In addition, there were different tests taken on the “toxicity.”

Toxicity

These are the scores calculated by PerplexityAPI for toxicity. The respectful prompts indicate that the input prompt included phrases like “please.” This table indicates that the toxicity increases with the size of the LLaMa model.

TruthfulQA

Another noteworthy aspect is the TrustfulQA. The problem of existing language models is the possibility of hallucinating the answers. The column on the left is truthfulness, and the column on the right is truthful informative with the method of calculating from this paper. The model is still indicating that there are informations that may not be authentic.

Conclusion

The paper indicates that the researchers are planning on training the model on larger scale, with more scalling resulting in better performance in the paper and previous works.

The overall paper can be found here.

태그: ,

카테고리:

업데이트: