#### Performance Metrics

The precise metrics for forecast-evaluation are described below.

For clinical status, we will use similar metrics to those that proved effective in the CADDementia challenge:

1. The multiclass area under the receiver operating curve (mAUC).
2. The overall balanced classification accuracy (BCA).

mAUC is independent of the group sizes and gives an overall measure of classification ability that accounts for relative likelihoods assigned to each class. The simpler BCA on the other hand does not exploit the probabilistic nature of the forecasts, but simply considers the accuracy of the most likely classification. Both metrics normalise for group size, which is important because the smallest group (very likely AD at test time) is the most important.

For the other features we will use:

1. The mean absolute error (MAE).
2. The weighted error score (WES).
3. The coverage probability accuracy (CPA).

The MAE focuses purely on accuracy of prediction ignoring confidence, whereas the WES incorporates participants' confidence estimates into the error score. The CPA provides an assessment of the accuracy of the confidence estimates, irrespective of the prediction accuracy.

The rest of this page gives the mathematical details of each metric. These are not essential to understand, but useful to make your forecasts as good as possible.

For transparency, we provide in the TADPOLE Github repository the code that will be used to compute these performance measures, both for the main submission and for the interim leaderboard. These are implemented in the Python script evalOneSubmission.py

#### Clinical Status Predictions

Diagram illustrating the Receiver Operating Characteristic curve. As the decision threshold (beta) varies, the trade-off between different classification outcomes varies. The four outcomes are: true positives (TP); false positives (FP); true negatives (TN); false negatives (FN). The area under the curve (AUC) is an overall measure of the ability to discriminate positive and negative cases. Source: Wikipedia.

###### 1. Multiclass area under the receiver operating curve (mAUC)

Classical ROC analysis considers only binary classification problems. The theory does extend to multi-class problems such as that posed in TADPOLE's clinical status prediction. The AUC $\hat{A}(c_i|c_j)$ for classification of a class $c_i$ against another class $c_j$, is:

$\hat{A}(c_i|c_j)=\frac{S_i-n_i(n_i+1)/2}{n_i n_j}$ [1]

where $n_i$ and $n_j$ are the number of points belonging to classes $i$ and $j$ respectively, while $S_i$ is the sum of the ranks of the class $i$ test points after ranking all the class $i$ and $j$ data points in increasing likelihood of belonging to class $i$. See (Hand & Till, 2001) for the complete derivation. For situations with three or more classes, $\hat{A}(c_{i}|c_{j})\neq\hat{A}(c_{j}|c_{i})$. Therefore, we use the average

$\hat{A}(c_{i},c_{j})=\frac{\hat{A}(c_{i}|c_{j})+\hat{A}(c_{j}|c_i{})}{2}.$ [2]

The overall mAUC is obtained by averaging equation [2] over all pairs of classes. For $L$ classes, the number pairs of classes is $L(L-1)/2$, so that

$\textrm{mAUC}=\frac{2}{L(L-1)}\sum_{i=2}^L\sum_{j=1}^{i}\hat{A}(c_i,c_j).$ [3]

The class probabilities that go into the calculation of $S_{i}$ in equation [1] are $p_{CN}$, $p_{MCI}$, and $p_{AD}$, which we derive from the likelihoods $L_{CN}$, $L_{MCI}$, and $L_{AD}$ provided by the participants by normalising by their sum so that, for example:

$p_{CN} = L_{CN}/(L_{CN} + L_{MCI} + L_{AD}).$ [4]

###### 2. Balanced classification accuracy (BCA)

First, the data points are assigned a hard classification to the class (CN, MCI, or AD) with the highest likelihood, i.e. the class with likelihood $\textrm{max}([L_{CN}, L_{MCI}, L_{AD}])$. The balanced accuracy for class $i$ is then:

$BCA_i = \frac{1}{2}\left[ \frac{TP}{TP+FN} + \frac{TN}{TN+FP} \right]$ [5]

where TP, FP, TN, FN represent the number of true positives, false positives, true negatives and false negatives for classification as class $i$ . True positives are data points with true label $i$ and correctly classified as such, while the false negatives are the data points with true label $i$ incorrectly classified to a different class $j \ne i$. True negatives and false positives are defined similarly.

The overall BCA is given by the mean of all the balanced accuracies for every class:

$BCA = \frac{1}{L} \sum_{i=1}^L BCA_i.$  [6]

If two or more classes have equal likelihoods, we will add either 1/2 or 1/3 to the TP count if the correct class has the highest likelihood, depending on how many classes have equal likelihood.

#### Continuous feature predictions

For ADAS-Cog13 and ventricle volume, we will use:

The mean absolute error

$MAE=\frac{1}{N}\sum_{i=1}^{N}\left |{\tilde{M}_i-M_i}\right |$ [7]

where $N$ is the number of data points acquired by the time the forecasts are evaluated, $M_i$ is the actual value in individual $i$ in future data, $\tilde{M}_i$ is the participant's best guess at $M_i$.

The weighted error score

$WES=\frac{\sum_{i=1}^{N}\tilde{C}_i\left | \tilde{M}_i-M_i\right |}{\sum_{i=1}^{N}\tilde{C}_i}$ [8]

where the weightings $\tilde{C}_i$ are the participant's relative confidences in their $\tilde{M}_i$. We estimate $\tilde{C}_i$ as the inverse of the width of the 50% confidence interval of their biomarker estimate, i.e. $\tilde{C}_i=\left ( C_+-C_- \right )^{-1}$, where $\left [C_-,C_+\right ]$ is the confidence interval provided by the participant.

The coverage probability accuracy

$CPA = \left |\textrm{actual coverage probability} - \textrm{nominal coverage probability}\right |$ [9]

where the nominal coverage probability is 0.5 – the target for the confidence intervals – and the actual coverage probability is the proportion of measurements that fall within the corresponding 50% confidence interval.

#### Deciding the winner

The key metrics for deciding the winners will be the multiclass AUC of clinical status classification – equation [3] above - and the mean absolute error - equation [7].

However, the other metrics are also important to determine the utility of a prediction algorithm and the published results will rank the performance of each entry on each metric.

Organised by: