Entropy and cross-entropy

4 min readApr 29, 2024

In the realm of machine learning, entropy plays a pivotal role in understanding various loss functions. In this article, I will delve into the meaning of entropy and cross-entropy, exploring how they contribute to the creation of loss functions within the machine learning landscape.

Entropy

According to Wikipedia, entropy is a concept most commonly associated with a state of disorder, randomness, or uncertainty. It finds applications across various fields, but its significance to data scientists lies primarily in information theory. The fundamental idea behind information theory is that the occurrence of an unlikely event provides more informative content than the occurrence of a likely event.

Formally, the entropy H of a probability distribution is defined as

In other words, the entropy of a distribution is the expected amount of information in an event drawn from that distribution. It gives a lower bound on the number of bits needed on average to encode symbols drawn from a distribution P. Distributions that are nearly deterministic (where the outcome is nearly certain) have low entropy; distributions that are closer to uniform have high entropy.

Entropy is used broadly in the training of decision trees. Decision trees recursively split the data based on the feature that maximizes information gain, which represents the reduction in entropy achieved by splitting the data based on a particular feature. Essentially, information gain measures how much uncertainty is removed after the split.

As a side note, similar to entropy, the Gini index quantifies also impurity and can be used in the decision tree split. It measures the probability of misclassifying a randomly chosen element from the subgroup. Lower Gini index indicates better purity.

Cross Entropy

If we have two separate probability distributions P(X) and Q(X) over the same random variable X, we can measure how different these two distributions are using the KL divergence:

Consider two probability distributions P and Q. Usually, P represents the data, the observations, or a empirical probability distribution determined by the data. Distribution Q represents instead a theory, a model, a description or an approximation of P. Cross-entropy measures the relative entropy between two probability distributions over the same set of events. Intuitively, to calculate cross-entropy between P and Q, you simply calculate entropy for Q using probability weights from P. Formally,

For discrete distribution,

The cross entropy of P and Q is equal to the entropy of P plus the KL divergence between P and Q. The KL divergence is interpreted as the average difference of the number of bits required for encoding samples of P using a code optimized for Q rather than one optimized for P. Note that the roles of P and Q can be reversed in some situations where that is easier to compute, such as with the expectation–maximization (EM) algorithm and evidence lower bound (ELBO) computations.

Note that KL divergence is not a proper distance metric. For one thing, it is not symmetric in P and Q.

Cross-entropy is commonly used as a loss function in machine learning, especially for classification tasks. During model training, the cross-entropy loss function guides the optimization process by adjusting model weights to minimize the error between predicted and actual outcomes. The goal is to minimize entropy by splitting the data into more homogeneous subsets during decision tree construction. For classification problems, using cross-entropy as a loss function is equivalent to maximizing log likelihood. I have shown the proof of equivalence in another article.

Maximum likelihood estimator and cross entropy in logistic regression

Cross entropy is a widely used loss function in classification problem and logistic regression is not an exception…

lzhangstat.medium.com

Why cross-entropy loss can be a better loss function than MSE?

Mean squared error and mean absolute error often lead to poor results when used with gradient-based optimization. In many cases this happens because the activation functions used to produce the output of the hidden units or the output units saturate. The negative log-likelihood helps to avoid this problem for many models. Several output units involve an exp function that can saturate when its argument is very negative. The log function in the negative log-likelihood cost function undoes the exp of some output units.

References:

Deep Learning (Ian J. Goodfellow, Yoshua Bengio and Aaron Courville), MIT Press, 2016.