Reading Notes — BERT

4 min readJul 6, 2024

The original paper link is here. This reading note will specifically focus on the model architecture.

BERT, Bidirectional Encoder Representations from Transformers, is a new language representation model. BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. The pre-trained BERT model can be fined-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference.

Introduction

There are two existing strategies for applying pre-trained language representations to downstream tasks: feature-based and fine-tuning. The two approaches share the same objective function during pre-training, where they use unidirectional language models to learn general language representations.

The feature-based approach, such as ELMo, uses task-specific architectures that include the pre-trained representations as additional features.
The fine-tuning approach, such as the GPT, introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pre-trained parameters.

The current techniques restrict the power of the pre-trained representations, especially for the fine-tuning representations due to the unidirectionality constraint. An example is the left-to-right architecture used in OpenAI GPT.

BERT alleviates the unidirectionality constraint by using a “masked language model” pre-training objective. The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. In addition, we also use a “next sentence prediction” task that jointly pretrains text-pair representations.

Bidirectional Encoder Representations from Transformers

There are two steps in the framework: pre-training and fine-tuning. During pre-training, the model is trained on unlabeled data over different pre-training tasks. For fine-tuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks. Each downstream task has separate fine-tuned models, even though they are initialized with the same pre-trained parameters.

A distinctive feature of BERT is its unified architecture across different tasks. There is minimal difference between the pre-trained architecture and the final downstream architecture.

Model architecture: BERT’s model architecture is a multi-layer bidirectional Transformer encoder. The number of layers (i.e., Transformer blocks) is denoted as L, the hidden size as H, and the number of self-attention heads as A. The base model has L=12, H=768, A=12, Total Parameters = 110M. The larger model has L=24, H=1024, A=16, Total Parameters = 340M.

Input/Output Prepresentation: The authors use WordPiece embeddings with a 30,000 token vocabulary. The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks.

For a given token, its input representation is constructed by summing the corresponding token, segment and position embeddings. We denote input embedding as E, the final hidden vector of the special [CLS] token as C and the final hidden vector for the i-th input token as T_i.

Two tasks the model is trained on

● The masked language model (MLM) randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. Unlike left-to-right language model pre-training, the MLM objective enables the representation to fuse the left and the right context, which allows us to pretrain a deep bidirectional Transformer.

In all the experiments, the authors mask 15% of all WordPiece tokens in each sequence at random. A downside is that we are creating a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear during fine-tuning. To mitigate this, we do not always replace “masked” words with the actual [MASK] token. The training data generator chooses 15% of the token positions at random for prediction. If the i-th token is chosen, we replace the i-th token with (1) the [MASK] token 80% of the time (2) a random token 10% of the time. (3) the unchanged i-th token 10% of the time.
Then , T_i, the final hidden vector for the i-th input token will be used to predict the original token with cross entropy loss.

● A “next sentence prediction” task that jointly pretrains text-pair representations. Specifically, when choosing the sentences A and B for each pretraining example, 50% of the time B is the actual next sentence that follows A (labeled as IsNext) and 50% of the time it is a random sentence from the corpus (labeled as NotNext). The final hidden vector of the special [CLS] token is used for next sentence prediction (NSP).

Fine-tuning BERT

For each task, we simply plug in the task-specific inputs and outputs into BERT and fine-tune all the parameters end-to-end.

At the output, the token representations are fed into an output layer for token-level tasks, such as sequence tagging or question answering, and the [CLS] representation is fed into an output layer for classification, such as entailment or sentiment analysis.

Reading Notes — BERT

Introduction

Bidirectional Encoder Representations from Transformers

Fine-tuning BERT

Written by lzhangstat