Losses and Metrics

This post will cover the most common and useful deep learning loss functions and metrics.

Losses are functions that guide a model's optimization. Metrics are human interpretable measures that indicate how well a model is performing. Feel free to skip to specific sections:

Losses:

Mean Squared Error (MSE)
Cross-Entropy Loss (CE)
Kullback-Leibler (KL) Divergence

Loss Functions

Mean Squared Error (MSE)

Used for: Regression

Definition: Mean squared error is the mean of the squared error. That is, for every example, get the difference of the target and the prediction, square this difference, add these values together and divide by the number of examples: \[MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2\] MSE gives more weight to larger differences due to squaring.

Cross-Entropy Loss (CE)

Used for: Classification

Definition: Cross-entropy loss is the loss obtained by taking the negative log of the probability corresponding to the target label, summing this up for all examples, and then dividing by the number of examples: \[CE = -\frac{1} {n} \sum_{i=1}^{n} \sum_{c=1}^{C} y_{i,c} * log(\hat{p}_{i, c})\] Here \(C\) is the number of classes. \(\hat{p}_{i,c}\) is the predicted probability for class \(c\) of example \(i\), and \(y_{i,c}\) is the target probability (typically 0 or 1) that example \(i\) belongs to class \(c\). For a given \(i\), typically only a single \(y_{i,c}\) will be nonzero. This has the effect of ignoring all output probabilities aside from the one that corresponds to the target class. Note that there are some cases where the target probability is soft (i.e., not just 0 or 1). One example is when training a student model to match a teacher model.

Cross entropy loss can be used for both binary and multiclass classification. When used for binary classification with a single output value, it's common to see it written as follows (often referred to as binary cross entropy loss, BCE): \[BCE = -\frac{1} {n} \sum_{i=1}^{n} y_{i} * log(\hat{p_i}) + (1 - y_{i}) * log(1 - \hat{p_i})\] This is saying, "When the label is 0, take the negative log of one minus the prediction. When the label is one, take the negative log of the prediction." It can also be used in the case where multiple classes appear in a single example. In this case, each potential class would have its own probability ranging from 0 to 1, and the BCE loss would be calculated for each class and averaged (over all classes and examples) to get a single loss value.

Kullback-Leibler (KL) Divergence

Used for: Matching learned distribution to target distribution

Definition: Kullback-Leibler (KL) Divergence is the amount that a probability distribution differs from a second, expected target distribution. Said differently, if probability distribution \(p\) is used to model true distribution \(y\), KL divergence is the amount that the cross-entropy exceeds the entropy of \(y\): \[D_{KL}(y \ || \ \hat{p}) = CE(y, \hat{p}) - H(y)\] The entropy, \(H\), is defined as: \[H(y) = -\sum_{i} y_i * log(y_i)\] And we know from earlier that cross entropy, \(CE\) is: \[CE(y, \hat{p}) = -\sum_{i} y_i * log(\hat{p}_i)\] Hence, with some algebra we can restate KL Divergence as: \[D_{KL}(y \ || \ \hat{p}) = \sum_{i} y_i * log(\frac{y_i}{\hat{p_i}})\] If \(p\) perfectly matches \(y\), then the KL divergence is 0. Otherwise, it is some positive value.

Hinge Loss

Used for:

Definition:

Poisson Loss

Used for:

Definition:

Metrics

Accuracy

Used for: Classification

Definition: Accuracy is the number of correct predictions divided by the total number of predictions. It is defined as follows, where \(n\) is the number of instances (i.e., samples, examples, or data points depending on your preferred terminology): \[Accuracy = \frac{1}{n}\sum_{i=1}^{n}pred_i==label_i\]

It can also be defined in terms of true and false positives and negatives: \[Accuracy = \frac{TP + TN}{TP + TN + TP + FN}\]

Accuracy is not a good metric to used with a skewed dataset. For example, a classification task with 95% of labels being Class A can get 95% accuracy by simply predicting Class A for every example.

For multiclass classification problems there is an accuracy called Top k Accuracy (sometimes notated as @top k). In top k accuracy, a prediction is correct if the true label is one of the top k predictions.

For example, if the true label is giraffe, and the model predicts giraffe as the 4th most likely class, then this is considered a correct prediction for all top k accuracies where k >= 4.

Confusion Matrix

Used for: Classification

Definition: A confusion matrix is a matrix which provides a detailed breakdown of correct and incorrect classifications. Specifically, for each class it lists the number of correct predictions and incorrect predictions.

An example confusion matrix for a multiclass classifier with 3 classes in shown below:

Confusion Matrix
	Predicted A	Predicted B	Predicted C
Actual A	50	5	2
Actual B	3	70	4
Actual C	1	6	40

Each row represents an actual class and each column represents a predicted class. Thus, the cell at row i and column j tells you how many instances of the class in row i were classified as the class in row j. A perfect confusion matrix would have 0s everywhere but the main diagonal.

For each class, we can compute the true positives (TP), false positives (FP), true negatives (TN), and false Negatives (FN). This is shown below for Class A:

TP for A: These are the instances that are actually A and are also predicted as A. So, TP for A is 50.
FP for A: These are the instances that are not A but are predicted as A. So, FP for A is the sum of all predicted A's for actual B's and C's, which is 3 + 1 = 4.
TN for A: These are the instances that are not A and are also not predicted as A. So, TN for A is the sum of all predicted and actual B's and C's except FP for A, which is (70+4) + (6+40) = 120.
FN for A: These are the instances that are actually A but are not predicted as A. So, FN for A is the sum of all actual A's predicted as B's and C's, which is 5 + 2 = 7.

Precision, Recall, and F1 Score

Precision used for: Classification

Precision Definition: Number of true positives for a class divided by total number of instances predicted as that class: \[precision = \frac{TP}{TP + FP}\] Precision is a metric that answers the question, "When the model predicts positive, how often is it correct?" If the model never has any false positives it will have a precision of 1.0. This means that as the certainty threshold increases precision will increase. For example, if the model only predicts positive when it's extremely certain it's correct, then it will have high precision. A model that correctly predicts a single positive instance will have a precision of 1.0.

Recall used for: Classification

Recall Definition: Number of True positives for a class divided by total number of actual instances of that class: \[recall = \frac{TP}{TP + FN}\] Recall is a metric that answers the question, "Out of all the actual positive instances, how many did the model correctly identify?" If that models has zero false negatives, meaning it successfully identified every positive instance, then it will have a recall of 1.0. Thus, as the threshold for predicting a positive instance is lowered the recall will increase.

Typically there is a trade-off between precision and recall. As one increases the other decreases and vice versa. Generally speaking, high precision leads to false negatives and high recall leads to false positives.

The application that the model is used for usually determines whether high precision or high recall is more important. For example, in medicine it's often important to not miss illness, hence a higher recall (i.e., a higher percentage of actual positive cases detected as positive) is preferred to higher precision. On the other hand, spam detection generally should have higher precision, since it's more important to avoid labeling legitimate email as spam than it is to let some spam get through.

F1 score used for: Classification

F1 Score Definition: A metric that combines precision and recall. Specifically, it is the harmonic mean of precision and recall:

\[F1 = 2* \frac{precision * recall}{precision + recall}\] \[= \frac{TP}{TP+\frac{FN + FP}{2}}\]

The harmonic mean gives more weight to lower values than a regular mean. Thus, in order for the F1 score to be high both precision and recall must be high.

ROC Curve

Used for: Classification

Definition: The ROC (receiver operating characteristic) curve shows performance of the model at all classification thresholds. It accomplishes this by plotting the true positive rate (TPR, aka recall) vs the false positive rate (FPR). \[TPR = \frac{TP}{TP + FN}\] \[FPR = \frac{FP}{FP + TN}\] The higher the recall, the higher the number of false positives and thus the higher the FPR. Hence, as with precision and recall there is a trade-off. The shape of ROC curves for a typical classifier, perfect classifier, and random classifier are shown below.

The ROC curves provides a visualization of how well the classifier does in the TPR/FPR trade-off. In order to get numeric values that indicate how good a specific ROC curve is (and hence how good a specific classifier is), the area under this curve (AUC) can be calculated. AUC is discussed next.

Area Under the Curve (AUC)

Used for: Classification

Definition: AUC is the area under the ROC curve. A value of 1 is a perfect classifier, and a value of 0.5 indicates a completely random classifier. AUC can be thought of as the probability that a random positive example is scored higher than a random negative example. An AUC is shown below:

One nice feature of the AUC is it provides a single value to compare models regardless of classification threshold. That said, in a situation where minimizing or maximizing a certain classification error is important the AUC may not be a relevant value.