Understanding word embeddings — connect word to number
Word embeddings represent words as vectors in a multi-dimensional space. The distance and direction between these vectors reflect the similarity and relationships among the corresponding words. This enables machine learning algorithms, which typically work with numerical inputs, to process text data. Pre-trained embeddings, derived from large corpora, can be fine-tuned for specific natural language processing (NLP) tasks.
In this article, I’ll introduce word embeddings and discuss popular word embedding models, including Word2Vec, GloVe, ELMo and WordPiece, the embedding model used by BERT.
What are embeddings?
Machine learning models only take numeric values as input. So we need one way to convert word to number when want to process languages in machine learning models. One naive encoding method is to use one-hot encoding, which creates a binary vector where each word corresponds to a unique dimension. But there are two problems associated with this approach. First off, for a large vocabulary, this results in high-dimensional sparse vectors, which are computationally expensive and memory-intensive. Secondly, one-hot encoding fails to capture any semantic relationships between words. For example, the cosine similarity between the one-hot encoding for word happy and sad v.s. happy and joy are the same, which may not make much sense.
In NLP, it’s important to recognize that words are no longer the smallest units. Tokenization has become a fundamental process, breaking down text into smaller, meaningful units called tokens. These tokens can be words, parts of words, or even individual characters like punctuation marks. So when discussing word embedding, we are often referring to token embedding.
How can we get embedding?
In this section, I introduce several embedding models, ranging from the most classic one to the state-of-art ones used in LLM like BERT or GPT.
Classic embeddings — Word2Vec
Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space.
Word2vec can utilize either of two model architectures to produce these distributed representations of words: Continuous Bag-Of-Words (CBOW) or continuously sliding skip-gram. In both architectures, word2vec considers both individual words and a sliding context window as it iterates over the corpus. According to the authors’ note, CBOW is faster while skip-gram does a better job for infrequent words.
Given the neighbor set N = {−4, −3, −2, −1, 1, 2, 3, 4},
- CBOW: Predicts a target word based on its context (surrounding words). It learns to maximize the likelihood of the target word given its context words.
- Skip-gram: Predicts context words given a target word. It aims to maximize the likelihood of context words given the target word.
To model the conditional probability in both CBOW and skip-gram, we utilize the dot-product-softmax of word embeddings. In the below formula, v is the word vector.
Classic embeddings — GloVe
“GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.”
- It constructs a word co-occurrence matrix X from the entire corpus, capturing how often words appear together. The entry X_{ij} is the number of times word j occurs in the context of word i.
- The optimization objective is to learn word vectors that minimize the difference between the dot product of word vectors and the logarithm of their co-occurrence probabilities.
where w_i and w_j are the word embedding vector, and f(x) is the weighting function.
- Unlike Word2Vec, GloVe doesn’t rely solely on local context windows but considers global statistics.
Contextualized embeddings — ELMo
ELMo (embeddings from language model) is a word embedding method that provides contextualized word representations, meaning it captures the meaning of a word based on its context within a sentence. It means that ELMo provides different representations for words that share the same spelling but have different meanings such as “bank” in “river bank” and “bank balance”.
ELMo is trained using a two-layer bidirectional language model (biLM). Each layer consists of both a forward and a backward pass through the input text. The model learns to predict the next word in both directions, considering the entire sentence context. By considering a word’s entire context, bidirectional models capture a more comprehensive understanding of its meaning.
Embeddings used in BERT — WordPiece
For a given token in BERT, its input representation (embedding) is constructed by summing the corresponding token, segment and position embeddings.
- Token embedding: WordPiece is a subword tokenization method used by BERT. It splits words into smaller units (subword tokens) based on their frequency. The most frequent words remain whole, while less frequent words are split into subword tokens. These subword tokens are then mapped to corresponding embeddings based on the embedding matrix (which is initialized randomly and trained jointly with the rest of BERT during pre-training of BERT). They capture contextual information and semantic relationships between subword tokens.
- Segment embeddings: They are zeros for the tokens in 1st sentence, and ones for the tokens in the 2nd sentence. This is useful for the Next Sentence Prediction tasks BERT is trained on.
- Position embeddings: They encode the position of each token in the sequence. These embeddings allow BERT to understand the context and order of tokens.
References
https://www.ibm.com/topics/word-embeddings
https://nlp.stanford.edu/projects/glove/