Natural Language Processing Glossary (Part I)

5 min readJul 13, 2024

Text preprocessing

Tokenization: refers to the process of converting a sequence of text into smaller parts, known as tokens. These tokens can be as small as characters or as long as words.

Embedding: Embeddings are vector representations of words or phrases in a fixed context. They capture semantic relationships and are commonly used in natural language processing (NLP) tasks. In traditional NLP models, embeddings (e.g., Word2Vec, GloVe) map individual words to dense vectors based on co-occurrence statistics. These embeddings are context-independent; they don’t consider the surrounding words or sentence context. Large language models (LLMs) like BERT, GPT, and RoBERTa generate contextual embeddings by considering the entire sentence or document. These embeddings adapt to the context, capturing nuances and meaning variations based on the surrounding words.

Word2vec: Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. In particular, words which appear in similar contexts are mapped to vectors which are nearby as measured by cosine similarity.

Padding and truncation: Sentences are of different lengths, so they can’t be converted to fixed-size tensors. Padding and truncation are strategies for dealing with this problem.Padding means we append a special padding token to match the length of the longest sequence in a batch or the model’s maximum accepted length. Truncation means we trim longer sequences to fit the desired length.

Masking: After padding, we need to inform the model that certain parts of the data are actually padding and should be ignored during processing. This mechanism is called masking. Masking helps the model focus on relevant information while ignoring padded steps.

Model General

Zero-shot, one-shot and few shot: Few-Shot refers to the setting where the model is given a few demonstrations of the task at inference time as conditioning, but no weight updates are allowed. One-Shot is the same as few-shot except that only one demonstration is allowed, in addition to a natural language description of the task. Zero-Shot is the same as one-shot except that no demonstrations are allowed.

Seq2Seq, Encoder, Decoder: A sequence-to-sequence (Seq2Seq) model is a machine learning architecture designed for tasks involving sequential data, like text. Most Seq2Seq models contain two parts Encoder or Decoder. The input sequence is processed by an encoder, which captures the context of the input. The decoder takes the internal state from the encoder and predicts the next characters of the target sequence. It generates an output sequence based on the input.

Knowledge base: A knowledge base (KB) is a structured database containing a collection of facts represented in the form of (subject, relation, object). It serves as a resource for AI chatbots and conversational systems. These systems leverage the knowledge base to provide natural language responses to users, enabling them to understand human queries and generate relevant answers based on the stored information.

Knowledge Distillation: Knowledge distillation is a technique in which a large language model transfers its knowledge to a smaller model to achieve similar performance with reduced computational resources.

Quantization: Quantization reduces the precision of numerical representations in large language models to make them more memory-efficient during deployment.

Model training

Training warm-up: Training warmup steps refer to an initial phase of training where the learning rate is gradually increased from a small value to the target learning rate over a specified number of steps or epochs. The primary purpose of warm-up steps is to allow the model to adapt gradually to the data. Without warm-up, the model might be influenced by early training examples (superstitions) and require additional epochs to achieve convergence.

Fine-tuning: Fine-Tuning updatES the weights of a pre-trained model by training on a supervised dataset specific to the desired task. Typically thousands to hundreds of thousands of labeled examples are used. Fine-tuning allows us to leverage pre-trained models (such as BERT or GPT) that have already learned useful features from large-scale data. By fine-tuning on a specific task, we adapt these models to perform well on a narrower domain without starting from scratch.

Beam Search: It is a heuristic search algorithm used in the decoding strategy to produce next token. In contrast to greedy decoding, where the most likely next word is chosen at each step, beam search considers multiple possibilities simultaneously. Start with an empty sequence (usually just the start token) and we initialize a set of candidate sequences (the “beam”) For each candidate sequence: 1. Generate the next token/word based on the current context and the model’s predictions. 2. Calculate the probability of the entire sequence (including the new token). 3. Keep the top-k (where k is the beam size) sequences with the highest probabilities. Repeat this process for each step until a predefined maximum length or until an end token is generated.

Model Evaluation

Perplexity: Perplexity (PPL) is one of the most common metrics for evaluating language models. The metric applies specifically to classical language models (sometimes called autoregressive or causal language models) and is not well defined for masked language models like BERT. Perplexity is defined as the exponentiated average negative log-likelihood of a sequence, equivalent to the exponentiation of the cross-entropy between the data and model predictions. A low perplexity indicates that the model is confident in its predictions. It suggests that the model assigns high probabilities to observed text. However, low perplexity doesn’t guarantee accuracy — it merely reflects confidence.

Ablation study: An ablation study investigates the performance of an AI system by removing certain components to understand the contribution of the component to the overall system.

Other

Temperature: It is a hyperparameter that controls the randomness of predictions by scaling the logits before applying softmax. A higher temperature produces more random outputs, while a lower temperature makes the model’s output more deterministic.

Named-entity recognition (NER): is a subtask of information extraction in natural language processing (NLP). Its purpose is to locate and classify named entities (such as person names, organizations, locations, etc.) mentioned in unstructured text.