This entire post is dedicated to coming to terms with tensorflow's metrics, and the varying input formats associated with it. It is a direct result of me not using the correct version of a metric and leaving a model to train for 4 days (only realising afterwards that I should have used the non-sparse version of Accuracy).

This is not meant to expose the underpinnings and statistical wizardry of the intentions of these metrics (information theory, physics, etc), but is rather meant to be my notes on how to correctly use these metrics in deep-learning applications

Accuracy-Derived

Accuracy

Probably not the one you should use, this expects a list, where each item is a prediction label The only place I've seen it used is in the CropNet example, where individual examples were evaluated separately. Most architectures for classification typically output a vector of class probabilities, as opposed to a hard prediction, so this might be useful after-the-fact OR if you threshold the vector

import tensorflow as tf
import numpy as np
from tensorflow.keras import metrics

# binary example
y_actual = [1, 0, 0, 1]
y_pred = [1, 1, 0, 1] # we expect 0.75% accuracy

m = metrics.Accuracy()
m(y_actual, y_pred)

<tf.Tensor: shape=(), dtype=float32, numpy=0.75>

# multinomial example
y_actual = [1, 2, 3, 4]
y_pred = [1, 2, 2, 4] # we expect 0.75% accuracy

m = metrics.Accuracy()
m(y_actual, y_pred)

<tf.Tensor: shape=(), dtype=float32, numpy=0.75>

# OHE? Apparently not
y_actual = [
    [0, 1, 0],
    [1, 0, 0]
]

y_pred = [
    [0.2, 0.5, 0.3],
    [0.2, 0.5, 0.3],
]

m = metrics.Accuracy()
m(y_actual, y_pred)

<tf.Tensor: shape=(), dtype=float32, numpy=0.0>

Binary Accuracy

This seems to be controlled via a threshold parameter, and is a specific version of the above. This might be useful if (e.g.) your network has a single output cell Dense(1), which represents a positive/negative class.

This is a simple: $$ \frac{\text{Number True Predictions}}{\text{Number Predictions}} $$ (exactly the same as Accuracy above, except each entry is expected to be some probability of class)

m = tf.keras.metrics.BinaryAccuracy(threshold=0.5)
m([1, 0, 0, 1], [0.501, 0, 1, 1]) # expect 0.75 when default threshold used

<tf.Tensor: shape=(), dtype=float32, numpy=0.75>

Categorical Accuracy

This is one of my most-used accuracy measures. It calcualtes how often predictions match one-hot-labels. y_true and y_pred are both one-hot encode

y_true = [
    [1, 0, 0],
    [0, 0, 1],
]
y_pred = [
    [0.6, 0.15, 0.25],
    [0.5, 0.3, 0.2], # expect 50% accuracy
]

m = metrics.CategoricalAccuracy()
m(y_true, y_pred)

<tf.Tensor: shape=(), dtype=float32, numpy=0.5>

SparseCategorical Accuracy

This expects a vector of class probabilities as y_pred and a list of actual class-labels as y_true. It is taken as the ratio of the correct predictions (argmax of the vector) over the net number of predictions

y_true = [0, 2] # exactly the same as prior, now just as actual labels

m = metrics.SparseCategoricalAccuracy()
m(y_true, y_pred)

<tf.Tensor: shape=(), dtype=float32, numpy=0.5>

Crossentropy-Type

As opposed to raw true-vs-false predictions, entropy is an information-theoretic measure used to quantify uncertainty.

Binary Crossentropy

If from_logits is true, the output is not assumed to be bounded between 0 and 1. (The negative sign is to counteract the fact that log of a number less than 1 is negative)

The formula for calculating this is as follows:

$$ \frac{1}{N}\sum_{i=1}^N - \left[y_i \log(p_i) + (1-y_i)\log(1-p_i) \right] $$

y_true = [0, 1] # shape is (batch_size, d0, .., dN)
y_pred = [0.2, 0.9]
m = metrics.BinaryCrossentropy() 
m(y_true, y_pred)

<tf.Tensor: shape=(), dtype=float32, numpy=0.1642519>

Doing this manually

# manually, N=1 so I ignore that
log_loss = 0
for y_i, p_i in zip(y_true, y_pred):
    log_loss += -(y_i * np.log(p_i) + (1-y_i)*np.log(1-p_i))

# so it makes sense, same as above!
log_loss /= 2 

log_loss

0.164252033486018

Categorical Crossentropy

Essentially binary cross-entropy with an added dimension:

$$ -\frac{1}{N}\sum{i=1}^N \sum{j=1}^M y{ij} \log(p{ij}) $$

y_true = [[0, 0, 1]] # here, our sample belongs to class 2 (index of position is 2)
y_pred = [[0.1, 0.1, 0.8]] # our predicted implies class 2 has the highest probability
m = metrics.CategoricalCrossentropy()

m(y_true, y_pred)

<tf.Tensor: shape=(), dtype=float32, numpy=0.22314353>

N = 1
M = 3
log_loss = 0
for i in range(N):
    for j in range(M):
        log_loss -= y_true[i][j]*np.log(y_pred[i][j])


log_loss /= N
log_loss

0.2231435513142097

Sparse Categorical Crossentropy

Exactly the same as above, except accepts y_true as single labels, instead of vectors

y_true = [2]
m = metrics.SparseCategoricalCrossentropy()

m(y_true, y_pred) # the answer should be EXACTLY the same as above

<tf.Tensor: shape=(), dtype=float32, numpy=0.22314355>

Checkpoint - What does "sparse" mean?

So the difference between accuracy/crossentropy and their respective sparse versions are the format of the labels. The sparse versions expect that the labels are defined as-is (class 2 implies the label is [2]), whilst the non-sparse versions expect one-hot encoded labels (so the same class 2 looks like [0, 0, 1, 0... N] if we have N classes)

Other Common Metrics

Mean Absolute Error

This is the typical difference between predicted and actual scaled by the number of samples (also taken as the absolute sum of errors)

$$ \frac{\sum_{i=1}^{N}|\hat{y}-y|}{N} $$

y_true = [1, 1, 2]
y_pred = [1, 2, 2] # this expects 0.33 error
m = metrics.MeanAbsoluteError()
m(y_true, y_pred)

<tf.Tensor: shape=(), dtype=float32, numpy=0.33333334>

# does it also work with OHE?
y_pred = [
    [0, 1, 0],
    [0, 0, 1],
    [0, 0, 1]
]
m(y_true, y_pred) # apprently not...

<tf.Tensor: shape=(), dtype=float32, numpy=0.8333333>

# if we define y_true as OHE
y_true = [
    [0, 1, 0],
    [0, 1, 0],
    [0, 0, 1],
]
m(y_true, y_pred) # also no

<tf.Tensor: shape=(), dtype=float32, numpy=0.5714286>

Mean Absolute Percentage Error

This appears to not only consider how many predictions are wrong, but appears to be scaled by the label (following the usual MAPE formula)

$$ \frac{100}{N}\sum_{i=1}^N \left|\frac{y-\hat{y}}{y} \right| $$

where $y$ is the actual value, and $\hat{y}$ is the forecast value

y_true = [1, 2, 4]
y_pred = [1, 2, 3]

m = metrics.MeanAbsolutePercentageError()
m(y_true, y_pred)

<tf.Tensor: shape=(), dtype=float32, numpy=8.333334>

y_true = [1, 2, 4]
y_pred = [1, 2, 2]

m = metrics.MeanAbsolutePercentageError()
m(y_true, y_pred) # although the same class is wrong, the percentage is different

<tf.Tensor: shape=(), dtype=float32, numpy=16.666668>

# doing it in numpy
error = 0
N = 3
for true, pred in zip(y_true, y_pred):
    error += np.abs((true-pred)/true)

error *= (100/N)

error # seems to line up with the above

16.666666666666668

# one-hot-encoding?
y_true = [[0, 1, 0]]
y_pred = [[0, 1, 0]]

m(y_true, y_pred) # guess not, because this error is supposed to be zero

<tf.Tensor: shape=(), dtype=float32, numpy=8.333334>

Mean Squared Error

This is given as the sum of errors squared:

$$ \frac{1}{N}\sum_{i=1}^N\left(y_i-\hat{y}_i\right)^2 $$

y_true = [1, 2, 4]
y_pred = [1, 2, 3]

m = metrics.MeanSquaredError()

m(y_true, y_pred)

<tf.Tensor: shape=(), dtype=float32, numpy=0.33333334>

N = 3
error = 0
for true, pred in zip(y_true, y_pred):
    error += np.power(true-pred, 2)

error /= N

error # seems about right

0.3333333333333333

# OHE?
y_true = [
    [0, 1, 0, 0, 0],
    [0, 0, 1, 0, 0],
    [0, 0, 0, 0, 1],
]

y_pred = [
    [0, 1, 0, 0, 0],
    [0, 0, 1, 0, 0],
    [0, 0, 0, 1, 0],
]

m(y_true, y_pred) # nope, I'm still not even sure how this is calculated....

<tf.Tensor: shape=(), dtype=float32, numpy=0.18333334>

# simpler example
y_true = [
    [0, 1],
]

y_pred = [
    [1, 0],
]

m(y_true, y_pred) # THIS SHOULD BE ZERO

<tf.Tensor: shape=(), dtype=float32, numpy=0.21666667>

In short: the MSE, MAPE and MAE metrics are not suitable for one-hot-encoded labels/predictions

Precision

This is taken as: $$ \frac{\text{Number of true positives}}{\text{Number of true+false positives}} $$

This implementation only supports binary targets (which makes sense in the context of true-vs-false samples)

# binary example
y_true = [0, 1, 0]
y_pred = [0, 0.6, 1]

m = metrics.Precision()
m(y_true, y_pred) # should be 1/2

<tf.Tensor: shape=(), dtype=float32, numpy=0.5>

# multinomial 
y_true = [1, 2, 3, 3]
y_pred = [1, 2, 3, 4]

m(y_true, y_pred) # doesn't work here

---------------------------------------------------------------------------

InvalidArgumentError                      Traceback (most recent call last)

<ipython-input-53-7c4421497ddf> in <module>
      3 y_pred = [1, 2, 3, 4]
      4 
----> 5 m(y_true, y_pred)

...

InvalidArgumentError: predictions must be <= 1
Condition x <= y did not hold.
First 3 elements of x:
[1. 2. 3.]
First 1 elements of y:
[1.]

# OHE?
y_true = [
    [0, 1, 0],
    [1, 0, 0]
]

y_pred = [
    [1, 0, 0],
    [1, 0, 0],
]

m(y_true, y_pred) # seems like it's able to handle OHE labels

<tf.Tensor: shape=(), dtype=float32, numpy=0.5>

Recall

This is taken as: $$ \frac{\text{Number of true positives}}{\text{Number of true positives + false negatives}} $$

This implementation only supports binary targets (which makes sense in the context of true-vs-false samples

# binary example
y_true = [0, 1, 0]
y_pred = [0, 1, 1]

m = metrics.Recall()
m(y_true, y_pred) # should be 1

<tf.Tensor: shape=(), dtype=float32, numpy=1.0>

# OHE?
y_true = [
    [1, 0, 0],
    [1, 0, 0]
]

y_pred = [
    [1, 0, 0],
    [1, 0, 0],
]

m(y_true, y_pred) # seems like it's not able to handle OHE labels....

<tf.Tensor: shape=(), dtype=float32, numpy=0.8888889>

KL Divergence

This is taken as a product of the true-class probability multiplied by the log-ratio of predicted to true class probability per-sample $$ \sum_{i=1}^K p_k\log{\frac{p_k}{q_k}} $$

y_true = [[0, 1], [0, 0]] 
y_pred = [[0.6, 0.4], [0.4, 0.6]]

m = metrics.KLDivergence()
m(y_true, y_pred)

<tf.Tensor: shape=(), dtype=float32, numpy=0.45814306>

def KL(P, Q):
    epsilon = 1e-4

    P = np.array(P) + epsilon
    Q = np.array(Q) + epsilon

    return np.sum(P*np.log(P/Q))

KL(y_true, y_pred) # hmm

0.9136630059540092

t = np.clip(y_true, 1e-4, 1)
p = np.clip(y_pred, 1e-4, 1)

np.sum(t * np.log(t/p), axis=-1) # so then what exactly is Tensorflow's KL divergence doing?!

array([ 0.91542078, -0.00169936])

Summary

This summarizes the metric name and input formats for the metris listed:

Metric	True Format	Predicted Format
Accuracy	List of classes	List of classes
Binary Accuracy	List of classes	List of class probabilities
Categorical Accuracy	OHE vector of classes	Vector of class probabilites per-sample
Sparse Categorical Accuracy	List of classes (not OHE)	Vector of class probabilites per-sample
Binary Cross Entropy	List of classes	List of class probabilities
Categorical Cross Entropy	OHE vector of classes	Vector of class probabilities
Sparse Categorical Cross Entropy	List of classes	Vector of class probabilities
MAE	List of classes	List of classes
MSE	List of classes	List of classes
MAPE	List of classes	List of classes
Precision	List of binary labels	List of probabilities
Recall	List of binary labels	List of probabilities
KL Divergence	???	???

The KL divergence in keras still eludes me, hopefully I should be able to make more sense of it and update it in the future.

TensorFlow Metric Interfaces Explained

Table of contents