Contents¶
- General Observations
- Prize Winners
- Main Results
- Additional Entries
- Confidence Intervals
- Meta Analysis
- Demographics of training/test sets
- Description of forecasting algorithms
- Statistics of TADPOLE participants
Results (July 2019)¶
TADPOLE's first phase is complete and we have evaluated all the prize-eligible submissions.
Watch the live announcement on YouTube: https://www.youtube.com/watch?v=BFS9Sr0lhuM
Check below for a description of the evaluation dataset and the overall rankings.
General observations¶
-
There was no clear "one-size-fits-all" winner.
-
Data-driven approaches for both feature selection and prediction of target variables generally performed well.
-
Many teams combined different types of algorithms to produce forecasts:
-
- Most used statistical regression;
- Some used generic machine learning techniques that are robust and can work well for other problems; and
- Some used disease progression models that are specifically tailored for the current problem of disease prediction.
-
Forecasts were very good for clinical diagnosis and ventricle volume -- on the other hand, predicting ADAS turned out to be very difficult -- no team was able to generate forecasts that were significantly better than random guessing
-
Meta-analysis results: the most-important features that helped improve predictions were DTI & CSF for clinical diagnosis, and "augmented features" for Ventricle volume prediction.
-
Throughout this page we will refer to ADAS-Cog 13 as simply ADAS.
TADPOLE Prize winners¶
Category | Team | Members | Institution | Country | Prize |
Overall best | Frog | Keli Liu, Paul Manser, Christina Rabe | Genentech | USA | £5000 |
Clinical status | Frog | Keli Liu, Paul Manser, Christina Rabe | Genentech | USA | £5000 |
Ventricle volume | EMC1 | Vikram Venkatraghavan, Esther Bron, Stefan Klein | Erasmus MC | Netherlands | £5000 |
Best university team | Apocalypse | Manon Ansart | ICM, INRIA | France | £5000 |
High-School (best) | Chen-MCW | Gang Chen | Medical College Wisconsin | USA | £5000 |
High-School (runner up) | CyberBrains | Ionut Buciuman, Alex Kelner, Raluca Pop, Denisa Rimocea, Kruk Zsolt | Vasile Lucaciu College | Romania | £2500 |
Overall best D3 prediction | GlassFrog | Steven Hill, Brian Tom, Anais Rouanst, Zhiyue Huang, James Howlett, Steven Kiddle, Simon R. White, Sach Mukherjee, Bernd Taschler | Cambridge University | UK | £2500 |
Overall Results¶
Legend:
- MAUC – Multiclass Area Under the Curve
- BCA – Balanced Classification Accuracy
- MAE – Mean Absolute Error
- WES – Weighted Error Score
- CPA – Coverage Probability Accuracy for 50% Confidence Interval
- ADAS – Alzheimer's Disease Assessment Scale Cognitive (13)
- VENTS – Ventricle Volume
- RANK (overall) – We first compute the sum of ranks from MAUC, ADAS MAE and VENTS MAE, then derive the final ranking from these sums of ranks. For example, the top entry has the lowest sum of ranks from these three categories.
Note (14 June 2019): The rankings in each prize category can be found by ordering according to Diagnosis MAUC, and ADAS and Ventricle MAE. The overall rankings below require valid submissions for every target variable.
Overall scores — longitudinal dataset D2¶
RANK |
FILE NAME | MAUC RANK | MAUC | BCA | ADAS RANK | ADAS MAE | ADAS WES | ADAS CPA | VENTS RANK | VENTS MAE | VENTS WES | VENTS CPA |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1.0 | Frog | 1.0 | 0.931 | 0.849 | 4.0 | 4.85 | 4.74 | 0.44 | 10.0 | 0.45 | 0.33 | 0.47 |
2.0 | EMC1-Std | 8.0 | 0.898 | 0.811 | 23.5 | 6.05 | 5.40 | 0.45 | 1.5 | 0.41 | 0.29 | 0.43 |
3.0 | VikingAI-Sigmoid | 16.0 | 0.875 | 0.760 | 7.0 | 5.20 | 5.11 | 0.02 | 11.5 | 0.45 | 0.35 | 0.20 |
4.0 | EMC1-Custom | 11.0 | 0.892 | 0.798 | 23.5 | 6.05 | 5.40 | 0.45 | 1.5 | 0.41 | 0.29 | 0.43 |
5.0 | CBIL | 9.0 | 0.897 | 0.803 | 15.0 | 5.66 | 5.65 | 0.37 | 13.0 | 0.46 | 0.46 | 0.09 |
6.0 | Apocalypse | 7.0 | 0.902 | 0.827 | 14.0 | 5.57 | 5.57 | 0.50 | 20.0 | 0.52 | 0.52 | 0.50 |
7.0 | GlassFrog-Average | 5.0 | 0.902 | 0.825 | 8.0 | 5.26 | 5.27 | 0.26 | 29.0 | 0.68 | 0.60 | 0.33 |
8.0 | GlassFrog-SM | 5.0 | 0.902 | 0.825 | 17.0 | 5.77 | 5.92 | 0.20 | 21.0 | 0.52 | 0.33 | 0.20 |
9.0 | BORREGOTECMTY | 19.0 | 0.866 | 0.808 | 20.0 | 5.90 | 5.82 | 0.39 | 5.0 | 0.43 | 0.37 | 0.40 |
10.0 | EMC-EB | 3.0 | 0.907 | 0.805 | 39.0 | 6.75 | 6.66 | 0.50 | 9.0 | 0.45 | 0.40 | 0.48 |
11.5 | lmaUCL-Covariates | 22.0 | 0.852 | 0.760 | 27.0 | 6.28 | 6.29 | 0.28 | 3.0 | 0.42 | 0.41 | 0.11 |
11.5 | CN2L-Average | 27.0 | 0.843 | 0.792 | 9.0 | 5.31 | 5.31 | 0.35 | 16.0 | 0.49 | 0.49 | 0.33 |
13.0 | VikingAI-Logistic | 20.0 | 0.865 | 0.754 | 21.0 | 6.02 | 5.91 | 0.26 | 11.5 | 0.45 | 0.35 | 0.20 |
14.0 | lmaUCL-Std | 21.0 | 0.859 | 0.781 | 28.0 | 6.30 | 6.33 | 0.26 | 4.0 | 0.42 | 0.41 | 0.09 |
15.5 | CN2L-RandomForest | 10.0 | 0.896 | 0.792 | 16.0 | 5.73 | 5.73 | 0.42 | 31.0 | 0.71 | 0.71 | 0.41 |
15.5 | FortuneTellerFish-SuStaIn | 40.0 | 0.806 | 0.685 | 3.0 | 4.81 | 4.81 | 0.21 | 14.0 | 0.49 | 0.49 | 0.18 |
17.0 | CN2L-NeuralNetwork | 41.0 | 0.783 | 0.717 | 10.0 | 5.36 | 5.36 | 0.34 | 7.0 | 0.44 | 0.44 | 0.27 |
18.0 | BenchmarkMixedEffectsAPOE | 35.0 | 0.822 | 0.749 | 2.0 | 4.75 | 4.75 | 0.36 | 23.0 | 0.57 | 0.57 | 0.40 |
19.0 | Tohka-Ciszek-RandomForestLin | 17.0 | 0.875 | 0.796 | 22.0 | 6.03 | 6.03 | 0.15 | 22.0 | 0.56 | 0.56 | 0.37 |
20.0 | BGU-LSTM | 12.0 | 0.883 | 0.779 | 25.0 | 6.09 | 6.12 | 0.39 | 25.0 | 0.60 | 0.60 | 0.23 |
21.0 | DIKU-GeneralisedLog-Custom | 13.0 | 0.878 | 0.790 | 11.5 | 5.40 | 5.40 | 0.26 | 38.5 | 1.05 | 1.05 | 0.05 |
22.0 | DIKU-GeneralisedLog-Std | 14.0 | 0.877 | 0.790 | 11.5 | 5.40 | 5.40 | 0.26 | 38.5 | 1.05 | 1.05 | 0.05 |
23.0 | CyberBrains | 34.0 | 0.823 | 0.747 | 6.0 | 5.16 | 5.16 | 0.24 | 26.0 | 0.62 | 0.62 | 0.12 |
24.0 | AlgosForGood | 24.0 | 0.847 | 0.810 | 13.0 | 5.46 | 5.11 | 0.13 | 30.0 | 0.69 | 3.31 | 0.19 |
25.0 | lmaUCL-halfD1 | 26.0 | 0.845 | 0.753 | 38.0 | 6.53 | 6.51 | 0.31 | 6.0 | 0.44 | 0.42 | 0.13 |
26.0 | BGU-RF | 28.0 | 0.838 | 0.673 | 29.5 | 6.33 | 6.10 | 0.35 | 17.5 | 0.50 | 0.38 | 0.26 |
27.0 | Mayo-BAI-ASU | 52.0 | 0.691 | 0.624 | 5.0 | 4.98 | 4.98 | 0.32 | 19.0 | 0.52 | 0.52 | 0.40 |
28.0 | BGU-RFFIX | 32.0 | 0.831 | 0.673 | 29.5 | 6.33 | 6.10 | 0.35 | 17.5 | 0.50 | 0.38 | 0.26 |
29.0 | FortuneTellerFish-Control | 31.0 | 0.834 | 0.692 | 1.0 | 4.70 | 4.70 | 0.22 | 50.0 | 1.38 | 1.38 | 0.50 |
30.0 | GlassFrog-LCMEM-HDR | 5.0 | 0.902 | 0.825 | 31.0 | 6.34 | 6.21 | 0.47 | 51.0 | 1.66 | 1.59 | 0.41 |
31.0 | SBIA | 43.0 | 0.776 | 0.721 | 43.0 | 7.10 | 7.38 | 0.40 | 8.0 | 0.44 | 0.31 | 0.13 |
32.0 | Chen-MCW-Stratify | 23.0 | 0.848 | 0.783 | 36.5 | 6.48 | 6.24 | 0.23 | 36.5 | 1.01 | 1.00 | 0.11 |
33.0 | Rocket | 54.0 | 0.680 | 0.519 | 18.0 | 5.81 | 5.71 | 0.34 | 28.0 | 0.64 | 0.64 | 0.29 |
34.5 | Chen-MCW-Std | 29.0 | 0.836 | 0.778 | 36.5 | 6.48 | 6.24 | 0.23 | 36.5 | 1.01 | 1.00 | 0.11 |
34.5 | BenchmarkSVM | 30.0 | 0.836 | 0.764 | 40.0 | 6.82 | 6.82 | 0.42 | 32.0 | 0.86 | 0.84 | 0.50 |
36.0 | DIKU-ModifiedMri-Custom | 36.5 | 0.807 | 0.670 | 33.5 | 6.44 | 6.44 | 0.27 | 34.5 | 0.92 | 0.92 | 0.01 |
37.0 | DIKU-ModifiedMri-Std | 38.5 | 0.806 | 0.670 | 33.5 | 6.44 | 6.44 | 0.27 | 34.5 | 0.92 | 0.92 | 0.01 |
38.0 | DIVE | 51.0 | 0.708 | 0.568 | 42.0 | 7.10 | 7.10 | 0.34 | 15.0 | 0.49 | 0.49 | 0.13 |
39.0 | ITESMCEM | 53.0 | 0.680 | 0.657 | 26.0 | 6.26 | 6.26 | 0.35 | 33.0 | 0.92 | 0.92 | 0.43 |
40.0 | BenchmarkLastVisit | 44.5 | 0.774 | 0.792 | 41.0 | 7.05 | 7.05 | 0.45 | 27.0 | 0.63 | 0.61 | 0.47 |
41.0 | Sunshine-Conservative | 25.0 | 0.845 | 0.816 | 44.5 | 7.90 | 7.90 | 0.50 | 43.5 | 1.12 | 1.12 | 0.50 |
42.0 | BravoLab | 46.0 | 0.771 | 0.682 | 47.0 | 8.22 | 8.22 | 0.49 | 24.0 | 0.58 | 0.58 | 0.41 |
43.0 | DIKU-ModifiedLog-Custom | 36.5 | 0.807 | 0.670 | 33.5 | 6.44 | 6.44 | 0.27 | 47.5 | 1.17 | 1.17 | 0.06 |
44.0 | DIKU-ModifiedLog-Std | 38.5 | 0.806 | 0.670 | 33.5 | 6.44 | 6.44 | 0.27 | 47.5 | 1.17 | 1.17 | 0.06 |
45.0 | Sunshine-Std | 33.0 | 0.825 | 0.771 | 44.5 | 7.90 | 7.90 | 0.50 | 43.5 | 1.12 | 1.12 | 0.50 |
46.0 | Billabong-UniAV45 | 49.0 | 0.720 | 0.616 | 48.5 | 9.22 | 8.82 | 0.29 | 41.5 | 1.09 | 0.99 | 0.45 |
47.0 | Billabong-Uni | 50.0 | 0.718 | 0.622 | 48.5 | 9.22 | 8.82 | 0.29 | 41.5 | 1.09 | 0.99 | 0.45 |
48.0 | ATRI-Biostat-JMM | 42.0 | 0.779 | 0.710 | 51.0 | 12.88 | 69.62 | 0.35 | 54.0 | 1.95 | 5.12 | 0.33 |
49.0 | Billabong-Multi | 56.0 | 0.541 | 0.556 | 55.0 | 27.01 | 19.90 | 0.46 | 40.0 | 1.07 | 1.07 | 0.45 |
50.0 | ATRI-Biostat-MA | 47.0 | 0.741 | 0.671 | 52.0 | 12.88 | 11.32 | 0.19 | 53.0 | 1.84 | 5.27 | 0.23 |
51.0 | BIGS2 | 58.0 | 0.455 | 0.488 | 50.0 | 11.62 | 14.65 | 0.50 | 49.0 | 1.20 | 1.12 | 0.07 |
52.0 | Billabong-MultiAV45 | 57.0 | 0.527 | 0.530 | 56.0 | 28.45 | 21.22 | 0.47 | 45.0 | 1.13 | 1.07 | 0.47 |
53.0 | ATRI-Biostat-LTJMM | 55.0 | 0.636 | 0.563 | 54.0 | 16.07 | 74.65 | 0.33 | 52.0 | 1.80 | 5.01 | 0.26 |
- | Threedays | 2.0 | 0.921 | 0.823 | - | - | - | - | - | - | - | - |
- | ARAMIS-Pascal | 15.0 | 0.876 | 0.850 | - | - | - | - | - | - | - | - |
- | IBM-OZ-Res | 18.0 | 0.868 | 0.766 | - | - | - | - | 46.0 | 1.15 | 1.15 | 0.50 |
- | Orange | 44.5 | 0.774 | 0.792 | - | - | - | - | - | - | - | - |
- | SMALLHEADS-NeuralNet | 48.0 | 0.737 | 0.605 | 53.0 | 13.87 | 13.87 | 0.41 | - | - | - | - |
- | SMALLHEADS-LinMixedEffects | - | - | - | 46.0 | 8.09 | 7.94 | 0.04 | - | - | - | - |
- | Tohka-Ciszek-SMNSR | - | - | - | 19.0 | 5.87 | 5.87 | 0.14 | - | - | - | - |
The results on the D2 dataset suggest that we do not have a clear winner on all categories. While Frog had the best overall submission with the lowest sum of ranks, for each performance metric individually we had different winners: Frog (clinical diagnosis MAUC of 0.931), ARAMIS-Pascal (clinical diagnosis BCA of 0.850), FortuneTellerFish-Control (ADAS MAE and WES of 4.7), VikingAI-Sigmoid (ADAS CPA of 0.02), EMC1-Std/EMC1-Custom (ventricle MAE of 0.41 and ventricle WES or 0.29), and DIKU-ModifiedMri-Std/ DIKU-ModifiedMri-Custom (ventricle CPA of 0.01).
Overall scores — cross-sectional dataset D3¶
RANK |
FILE NAME | MAUC RANK | MAUC | BCA | ADAS RANK | ADAS MAE | ADAS WES | ADAS CPA | VENTS RANK | VENTS MAE | VENTS WES | VENTS CPA |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1.0 | GlassFrog-Average | 3.0 | 0.897 | 0.826 | 5.0 | 5.86 | 5.57 | 0.25 | 3.0 | 0.68 | 0.55 | 0.24 |
2.0 | GlassFrog-LCMEM-HDR | 3.0 | 0.897 | 0.826 | 9.0 | 6.57 | 6.56 | 0.34 | 1.0 | 0.48 | 0.38 | 0.24 |
3.0 | GlassFrog-SM | 3.0 | 0.897 | 0.826 | 4.0 | 5.77 | 5.77 | 0.19 | 9.0 | 0.82 | 0.55 | 0.07 |
4.0 | Tohka-Ciszek-RandomForestLin | 11.0 | 0.865 | 0.786 | 2.0 | 4.92 | 4.92 | 0.10 | 10.0 | 0.83 | 0.83 | 0.35 |
7.0 | VikingAI-Logistic | 8.0 | 0.876 | 0.768 | 6.0 | 5.94 | 5.91 | 0.22 | 22.0 | 1.04 | 1.01 | 0.18 |
7.0 | Rocket | 10.0 | 0.865 | 0.771 | 3.0 | 5.27 | 5.14 | 0.39 | 23.0 | 1.06 | 1.06 | 0.27 |
7.0 | lmaUCL-Std | 13.0 | 0.854 | 0.698 | 17.0 | 6.95 | 6.93 | 0.05 | 6.0 | 0.81 | 0.81 | 0.22 |
7.0 | lmaUCL-Covariates | 13.0 | 0.854 | 0.698 | 17.0 | 6.95 | 6.93 | 0.05 | 6.0 | 0.81 | 0.81 | 0.22 |
7.0 | lmaUCL-halfD1 | 13.0 | 0.854 | 0.698 | 17.0 | 6.95 | 6.93 | 0.05 | 6.0 | 0.81 | 0.81 | 0.22 |
10.0 | EMC1-Std | 30.0 | 0.705 | 0.567 | 7.0 | 6.29 | 6.19 | 0.47 | 4.0 | 0.80 | 0.62 | 0.48 |
11.0 | SBIA | 28.0 | 0.779 | 0.782 | 10.0 | 6.63 | 6.43 | 0.40 | 8.0 | 0.82 | 0.75 | 0.18 |
13.0 | BGU-LSTM | 6.0 | 0.877 | 0.776 | 14.0 | 6.75 | 6.17 | 0.39 | 27.0 | 1.11 | 0.79 | 0.17 |
13.0 | BGU-RFFIX | 6.0 | 0.877 | 0.776 | 14.0 | 6.75 | 6.17 | 0.39 | 27.0 | 1.11 | 0.79 | 0.17 |
13.0 | BGU-RF | 6.0 | 0.877 | 0.776 | 14.0 | 6.75 | 6.17 | 0.39 | 27.0 | 1.11 | 0.79 | 0.17 |
15.0 | BravoLab | 18.0 | 0.813 | 0.730 | 28.0 | 8.02 | 8.02 | 0.47 | 2.0 | 0.64 | 0.64 | 0.42 |
16.5 | BORREGOTECMTY | 15.0 | 0.852 | 0.748 | 8.0 | 6.44 | 5.86 | 0.46 | 30.0 | 1.14 | 1.02 | 0.49 |
16.5 | CyberBrains | 17.0 | 0.830 | 0.755 | 1.0 | 4.72 | 4.72 | 0.21 | 35.0 | 1.54 | 1.54 | 0.50 |
18.0 | ATRI-Biostat-MA | 19.0 | 0.799 | 0.772 | 26.0 | 7.39 | 6.63 | 0.04 | 11.0 | 0.93 | 0.97 | 0.10 |
19.5 | EMC-EB | 9.0 | 0.869 | 0.765 | 27.0 | 7.71 | 7.91 | 0.50 | 21.0 | 1.03 | 1.07 | 0.49 |
19.5 | DIKU-GeneralisedLog-Std | 20.0 | 0.798 | 0.684 | 20.5 | 6.99 | 6.99 | 0.17 | 16.5 | 0.95 | 0.95 | 0.05 |
21.0 | DIKU-GeneralisedLog-Custom | 21.0 | 0.798 | 0.681 | 20.5 | 6.99 | 6.99 | 0.17 | 16.5 | 0.95 | 0.95 | 0.05 |
22.5 | DIKU-ModifiedLog-Std | 22.5 | 0.798 | 0.688 | 23.5 | 7.10 | 7.10 | 0.17 | 13.5 | 0.95 | 0.95 | 0.05 |
22.5 | DIKU-ModifiedMri-Std | 22.5 | 0.798 | 0.688 | 23.5 | 7.10 | 7.10 | 0.17 | 13.5 | 0.95 | 0.95 | 0.05 |
24.5 | DIKU-ModifiedLog-Custom | 24.5 | 0.798 | 0.691 | 23.5 | 7.10 | 7.10 | 0.17 | 13.5 | 0.95 | 0.95 | 0.05 |
24.5 | DIKU-ModifiedMri-Custom | 24.5 | 0.798 | 0.691 | 23.5 | 7.10 | 7.10 | 0.17 | 13.5 | 0.95 | 0.95 | 0.05 |
26.0 | Billabong-Uni | 31.0 | 0.704 | 0.626 | 11.5 | 6.69 | 6.69 | 0.38 | 19.5 | 0.98 | 0.98 | 0.48 |
27.0 | Billabong-UniAV45 | 32.0 | 0.703 | 0.620 | 11.5 | 6.69 | 6.69 | 0.38 | 19.5 | 0.98 | 0.98 | 0.48 |
28.0 | ATRI-Biostat-JMM | 26.0 | 0.794 | 0.781 | 29.0 | 8.45 | 8.12 | 0.34 | 18.0 | 0.97 | 1.45 | 0.37 |
29.0 | CBIL | 16.0 | 0.847 | 0.780 | 33.0 | 10.99 | 11.65 | 0.49 | 29.0 | 1.12 | 1.12 | 0.39 |
30.0 | BenchmarkLastVisit | 27.0 | 0.785 | 0.771 | 19.0 | 6.97 | 7.07 | 0.42 | 33.0 | 1.17 | 0.64 | 0.11 |
31.0 | Billabong-MultiAV45 | 33.0 | 0.682 | 0.603 | 30.5 | 9.30 | 9.30 | 0.43 | 24.5 | 1.09 | 1.09 | 0.49 |
32.0 | Billabong-Multi | 34.0 | 0.681 | 0.605 | 30.5 | 9.30 | 9.30 | 0.43 | 24.5 | 1.09 | 1.09 | 0.49 |
33.0 | ATRI-Biostat-LTJMM | 29.0 | 0.732 | 0.675 | 34.0 | 12.74 | 63.98 | 0.37 | 32.0 | 1.17 | 1.07 | 0.40 |
34.0 | BenchmarkSVM | 36.0 | 0.494 | 0.490 | 32.0 | 10.01 | 10.01 | 0.42 | 31.0 | 1.15 | 1.18 | 0.50 |
35.0 | DIVE | 35.0 | 0.512 | 0.498 | 35.0 | 16.66 | 16.74 | 0.41 | 34.0 | 1.42 | 1.42 | 0.34 |
- | IBM-OZ-Res | 1.0 | 0.905 | 0.830 | - | - | - | - | 36.0 | 1.77 | 1.77 | 0.50 |
Here, most submissions have worse performance compared to the equivalent predictions on the D2 longitudinal dataset, due to the lack of longitudinal, multimodal data. GlassFrog-Average had the best overall rank and obtained a diagnosis MAUC of 0.897, ADAS MAE of 5.86 and a Ventricle MAE of 0.68 (% ICV). For diagnosis prediction, IBM-OZ-Res obtained the highest clinical diagnosis scores: MAUC of 0.905 and BCA of 0.830. For ADAS predictions, CyberBrains had the best MAE and WES of 4.72. ATRI-Biostat-MA obtained the best ADAS CPA of 0.04. For Ventricle prediction, GlassFrog-LCMEM-HDR had a MAE of 0.48 (% ICV) and the best WES of 0.38, while the 6 DIKU submissions obtained the best CPA of 0.05.
Additional entries¶
In addition to the standard predictions and the benchmarks, we also included two consensus predictions by taking the mean (ConsensusMean) and median (ConsensusMedian) over all predictions from all participants. For D2 predictions, the ConsensusMedian submission obtained the best overall rank, obtaining MAUC of 0.925 in diagnosis prediction (second-best), 5.12 error on ADAS-Cog 13 MAE (ninth-best) and 0.38 on Ventricles MAE, the best result in this category for D2. On the other hand, ConsensusMean ranked 3rd overall on D2, with diagnosis MAUC of 0.920 (fourth-best), ADAS-Cog 13 MAE of 3.75, the best prediction in this category, and Ventricle MAE of 0.48 (rank 16). For ADAS-Cog 13 and Ventricle volume prediction, the best consensus methods reduced the error by 11% and 8% respectively compared to the best prediction from participants or benchmarks.
In order to test whether the best results have not been obtained by chance due to randomness in the test set, we evaluated n=62 (as many as number of entries) randomly perturbed predictions from the simplest benchmark, BenchmarkLastVisit, and computed the best results obtained by any of these predictions. These are shown as RandomisedBest, and obtain high scores especially for ADAS-Cog 13, ranking 3rd with a final MAE of 4.52. High performance scores are also obtained for Ventricles, ranking 14 with an MAE of 0.47, a 14% increase in error from the best forecast, while for diagnosis prediction a lower MAUC score of 0.797 is obtained, ranking 43rd. This suggests that the entries with higher MAE than RandomisedBest should be interpreted with care, as the scores and ranks could be high due to randomness in the test set. This is particularly relevant for ADAS-Cog 13 predictions, where only the BenchmarkMixedEffects and ConsensusMean got better results, suggesting all other methods are not able to predict the ADAS-Cog 13 any better than random guessing based on the last available measurement.
It is worth mentioning that, while drafting the manuscript, we discovered that dropping APOE as a covariate in the BenchmarkMixedEffectsAPOE model considerably decreases the error in ADAS prediction, so we included it as an additional entry for scientific interest.
Additional entries for D2¶
RANK |
FILE NAME | MAUC RANK | MAUC | BCA | ADAS RANK | ADAS MAE | ADAS WES | ADAS CPA | VENTS RANK | VENTS MAE | VENTS WES | VENTS CPA |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1.5 | ConsensusMedian | 1.0 | 0.925 | 0.857 | 4.0 | 5.12 | 5.01 | 0.28 | 1.0 | 0.38 | 0.33 | 0.09 |
1.5 | ConsensusMean | 2.0 | 0.920 | 0.835 | 1.0 | 3.75 | 3.54 | 0.00 | 3.0 | 0.48 | 0.45 | 0.13 |
3.5 | BenchmarkMixedEffects | 3.0 | 0.846 | 0.706 | 2.0 | 4.19 | 4.19 | 0.31 | 4.0 | 0.56 | 0.56 | 0.50 |
3.5 | RandomisedBest | 4.0 | 0.797 | 0.803 | 3.0 | 4.52 | 4.52 | 0.27 | 2.0 | 0.47 | 0.45 | 0.33 |
Additional entries for D3¶
RANK |
FILE NAME | MAUC RANK | MAUC | BCA | ADAS RANK | ADAS MAE | ADAS WES | ADAS CPA | VENTS RANK | VENTS MAE | VENTS WES | VENTS CPA |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1.0 | ConsensusMean | 1.0 | 0.917 | 0.821 | 2.0 | 4.58 | 4.34 | 0.12 | 2.0 | 0.73 | 0.72 | 0.09 |
2.0 | ConsensusMedian | 2.0 | 0.905 | 0.817 | 3.0 | 5.44 | 5.37 | 0.19 | 1.0 | 0.71 | 0.65 | 0.10 |
3.0 | BenchmarkMixedEffects | 3.0 | 0.839 | 0.728 | 1.0 | 4.23 | 4.23 | 0.34 | 3.0 | 1.13 | 1.13 | 0.50 |
Confidence Intervals¶
Below are confidence intervals (CIs) computed for every submission, based on 50 bootstraps of the test set D4. The first figure (Fig. 1) shows CIs based on forecasts from D2, while the second (Fig. 2) shows CIs for forecasts on D3.
Fig 1. Confidence intervals for forecasts based on the longitudinal D2 prediction set.
Fig 2. Confidence intervals for forecasts based on the cross-sectional D3 prediction set.
Meta-analysis¶
To understand which types of features and algorithms yielded higher performance, we show here associations between predictive performance and feature selection methods, different types of features, methods for data imputation, and methods for forecasting of target variables (diagnosis, ADAS and ventricles). For each type of feature/method and each target variable (clinical diagnosis, ADAS and Ventricles), we show the distribution of estimated coefficients from a general linear model, derived from the approximated inverse hessian matrix at the maximum likelihood estimator. From this analysis we removed outliers, defined as submissions with ADAS MAE higher than 10 and Ventricle MAE higher than 0.15 (%ICV). For all plots, distributions to the right of the gray dashed vertical line are associated with better performance.
The results in Fig. 3 below show trends that indicate what aspects of the methods could be associated with better performance. For feature selection, methods that perform manual selection of features are associated with better predictive performance in ADAS13 and Ventricles. In terms of feature types, including features from many modalities was generally associated with an increase in overall performance, except for FDG (for all target variables). Moreover, augmented features correlate with overall performance improvements especially for ventricle prediction. In terms of data imputation methods, while some differences can be observed, no clear conclusions can be drawn currently. In terms of prediction models, we notice that neural networks are more significantly associated with increased performance in ventricle prediction, while disease progression models are associated with decreased performance in prediction or clinical diagnosis and ventricles. However, given the small number of methods tested (\<50) and the large number of degrees of freedom (n=21), these results should be interpreted with care.
Fig 3. Associations between the prediction of clinical diagnosis, ADAS and Ventricle volume and different strategies of (top) feature selection, (upper-middle) types of features, (lower-middle) data imputation strategies and (bottom) prediction methods for the target variables. For each type of feature/method (rows) and each target variable (columns), we show the distribution of estimated coefficients from a general linear model. Positive coefficients, where distributions lie to the right of the dashed vertical line, indicate better performance than baseline (vertical dashed line). For ADAS and Ventricle prediction, we flipped the sign of the coefficients, to consistently show better performance to the right of the vertical line.
Demographics of D1-D4 datasets¶
Summary of TADPOLE datasets D1-D4. Each subject has been allocated to either Control, MCI or AD group based on diagnosis at the first available visit within each dataset. The bottom table contains the number of visits with data available, by modality. For example, in D4 there were a total of 150 visits where an MRI scan was undertaken, which represented a total of 64% of all visits analysed across all subjects in D4.
Measure | D1 | D2 | D3 | D4 |
Cognitively Normal | ||||
Subjects | 1667 | 896 | 896 | 219 |
Number (%) | 508 (30.5%) | 369 (41.2%) | 299 (33.4%) | 94 (42.9%) |
Visits per subject | 8.3 (4.5) | 8.5 (4.9) | 1.0 (0.0) | 1.0 (0.2) |
Age | 74.3 (5.8) | 73.6 (5.7) | 72.3 (6.2) | 78.4 (7.0) |
Gender (% male) | 48.6% | 47.2% | 43.5% | 47.9% |
MMSE | 29.1 (1.1) | 29.0 (1.2) | 28.9 (1.4) | 29.1 (1.1) |
Converters | 18 (3.5%) | 9 (2.4%) | ||
Mild Cognitive Impairment | ||||
Number (%) | 841 (50.4%) | 458 (51.1%) | 269 (30.0%) | 90 (41.1%) |
Visits per subject | 8.2 (3.7) | 9.1 (3.6) | 1.0 (0.0) | 1.1 (0.3) |
Age | 73.0 (7.5) | 71.6 (7.2) | 71.9 (7.1) | 79.4 (7.0) |
Gender (% male) | 59.3% | 56.3% | 58.0% | 64.4% |
MMSE | 27.6 (1.8) | 28.0 (1.7) | 27.6 (2.2) | 28.1 (2.1) |
Converters | 117 (13.9%) | 37 (8.1%) | 9 (10.0%) | |
Alzheimer’s Disease | ||||
Number (%) | 318 (19.1%) | 69 (7.7%) | 136 (15.2%) | 29 (13.2%) |
Visits per subject | 4.9 (1.6) | 5.2 (2.6) | 1.0 (0.0) | 1.1 (0.3) |
Age | 74.8 (7.7) | 75.1 (8.4) | 72.8 (7.1) | 82.2 (7.6) |
Gender (% male) | 55.3% | 68.1% | 55.9% | 51.7% |
MMSE | 23.3 (2.0) | 23.1 (2.0) | 20.5 (5.9) | 19.4 (7.2) |
Converters | 9 (31.0%) | |||
Number of visits with available data (as % of total visits) | ||||
Cognitive | 8862 (69.9%) | 5218 (68.1%) | 753 (84.0%) | 223 (95.3%) |
MRI | 7884 (62.2%) | 4497 (58.7%) | 224 (25.0%) | 150 (64.1%) |
FDG | 2119 (16.7%) | 1544 (20.2%) | 0 (0.0%) | 0 (0.0%) |
AV45 | 2098 (16.6%) | 1758 (23.0%) | 0 (0.0%) | 0 (0.0%) |
AV1451 | 89 (0.7%) | 89 (1.2%) | 0 (0.0%) | 0 (0.0%) |
DTI | 779 (6.1%) | 636 (8.3%) | 0 (0.0%) | 0 (0.0%) |
CSF | 2347 (18.5%) | 1458 (19.0%) | 0 (0.0%) | 0 (0.0%) |
Description of Algorithms¶
Summary¶
We had a total of 33 participating teams, who submitted a total of 58 forecasts from D2, 34 forecasts from D3, and 6 forecasts from custom prediction sets. A total of 8 D2/D3 submissions from 6 teams did not have predictions for all three target variables, so we only computed the performance metrics for the available target variables. Another 3 submissions lacked confidence intervals for either ADAS or ventricle volume, which we imputed using default low-width confidence ranges of 2 for ADAS and 0.002 for Ventricles/ICV.
Table 1 below summarizes the methods used in the submissions in terms of feature selection, handling of missing data, predictive models for clinical diagnosis and ADAS/Ventricles biomarkers, as well as training and prediction times. Condensed descriptions of each submitted method can be found here, while even more detailed descriptions are here (original files submitted by participants).
Submission | Feature selection | Number of features | Missing data imputation | Diagnosis prediction | ADAS/Vent. Prediction | Training time | Prediction time (one subject) |
---|---|---|---|---|---|---|---|
AlgosForGood | Manual | 16+5* | forward-filling | Aalen model | linear regression | 1 minute | 1 second |
Apocalypse | Manual | 16 | population average | SVM | linear regression | 40 minutes | 3 minutes |
ARAMIS-Pascal | Manual | 20 | population average | Aalen model | - | 16 seconds | 0.02 seconds |
ATRI-Biostat-JMM | automatic | 15 | random forest | random forest | linear mixed effects model | 2 days | 1 second |
ATRI-Biostat-LTJMM | automatic | 15 | random forest | random forest | DPM | 2 days | 1 second |
ATRI-Biostat-MA | automatic | 15 | random forest | random forest | DPM + linear mixed effects model | 2 days | 1 second |
BGU-LSTM | automatic | 67 | none | feed-forward NN | LSTM | 1 day | milliseconds |
BGU-RF/ BGU-RFFIX | automatic | ~67+1340* | none | semi-temporal RF | semi-temporal RF | a few minutes | milliseconds |
BIGS2 | automatic | all | Iterative Soft-Thresholded SVD | RF | linear regression | 2.2 seconds | 0.001 seconds |
Billabong (all) | Manual | 15-16 | linear regression | linear scale | non-parametric SM | 7 hours | 0.13 seconds |
BORREGOSTECMTY | automatic | ~100 + 400* | nearest-neighbour | regression ensemble | ensemble of regression + hazard models | 18 hours | 0.001 seconds |
BravoLab | automatic | 25 | hot deck | LSTM | LSTM | 1 hour | a few seconds |
CBIL | Manual | 21 | linear interpolation | LSTM | LSTM | 1 hour | one minute |
Chen-MCW | Manual | 9 | none | linear regression | DPM | 4 hours | \< 1 hour |
CN2L-NeuralNetwork | automatic | all | forward-filling | RNN | RNN | 24 hours | a few seconds |
CN2L-RandomForest | Manual | >200 | forward-filling | RF | RF | 15 minutes | \< 1 minute |
CN2L-Average | automatic | all | forward-filling | RNN/RF | RNN/RF | 24 hours | \< 1 minute |
CyberBrains | Manual | 5 | population average | linear regression | linear regression | 20 seconds | 20 seconds |
DIKU (all) | semi-automatic | 18 | none | Bayesian classifier/LDA + DPM | DPM | 290 seconds | 0.025 seconds |
DIVE | Manual | 13 | none | KDE+DPM | DPM | 20 minutes | 0.06 seconds |
EMC1 | automatic | 250 | nearest neighbour | DPM + 2D spline + SVM | DPM + 2D spline | 80 minutes | a few seconds |
EMC-EB | automatic | 200-338 | nearest-neighbour | SVM classifier | SVM regressor | 20 seconds | a few seconds |
FortuneTellerFish-Control | Manual | 19 | nearest neighbour | multiclass ECOC SVM | linear mixed effects model | 1 minute | \< 1 second |
FortuneTellerFish-SuStaIn | Manual | 19 | nearest neighbour | multiclass ECOC SVM + DPM | linear mixed effects model + DPM | 5 hours | \< 1 second |
Frog | automatic | ~70+420* | none | gradient boosting | gradient boosting | 1 hour | - |
GlassFrog-LCMEM-HDR | semi-automatic | all | forward-fill | multi-state model | DPM + regression | 15 minutes | 2 minutes |
GlassFrog-SM | Manual | 7 | linear model | multi-state model | parametric SM | 93 seconds | 0.1 seconds |
GlassFrog-Average | semi-automatic | all | forward-fill/linear | multi-state model | DPM + SM + regression | 15 minutes | 2 minutes |
IBM-OZ-Res | Manual | 10-15 | filled with zero | stochastic gradient boosting | stochastic gradient boosting | 20 minutes | 0.1 seconds |
ITESMCEM | Manual | 48 | mean of previous values | RF | LASSO + Bayesian ridge regression | 20 minutes | 0.3 seconds |
lmaUCL (all) | Manual | 5 | regression | multi-task learning | multi-task learning | 2 hours | milliseconds |
Mayo-BAI-ASU | Manual | 15 | population average | linear mixed effects model | linear mixed effects model | 20 minutes | 1.3 seconds |
Orange | Manual | 17 | none | clinician’s decision tree | clinician’s decision tree | none | 0.2 seconds |
Rocket | manual | 6 | median of diagnostic group | linear mixed effects model | DPM | 5 minutes | 0.3 seconds |
SBIA | Manual | 30-70 | dropped visits with missing data | SVM + density estimator | linear mixed effects model | 1 minute | a few seconds |
SPMC-Plymouth (all) | Automatic | 20 | none | ? | - | ? | 1 minute |
SmallHeads-NeuralNetwork | automatic | 376 | nearest neighbour | deep fully -connected NN | deep fully -connected NN | 40 minutes | 0.06 seconds |
SmallHeads-LinMixedEffects | automatic | ? | nearest neighbour | - | linear mixed effects model | 25 minutes | 0.13 seconds |
Sunshine (all) | semi-automatic | 6 | population average | SVM | linear model | 30 minutes | \< 1 minute |
Threedays | Manual | 16 | none | RF | - | 1 minute | 3 seconds |
Tohka-Ciszek-SMNSR | Manual | ~32 | nearest neighbour | - | SMNSR | several hours | a few seconds |
Tohka-Ciszek-RandomForestLin | Manual | ~32 | mean patient value | RF | linear model | a few minutes | a few seconds |
VikingAI (all) | Manual | 10 | none | DPM + ordered logit model | DPM | 10 hours | 8 seconds |
BenchmaskLastVisit | None | 3 | none | constant model | constant model | 7 seconds | milliseconds |
BenchmarkMixedEffects | None | 3 | none | Gaussian model | linear mixed effects model | 30 seconds | 0.003 seconds |
BenchmarkMixedEffectsAPOE | None | 4 | none | Gaussian model | linear mixed effects model | 30 seconds | 0.003 seconds |
BenchmarkSVM | Manual | 6 | mean of previous values | SVM | support vector regressor (SVR) | 20 seconds | 0.001 seconds |
Table 1. Summary of methods used in the TADPOLE submissions. Keywords: SVM – Support Vector Machine, RF – random forest, LSTM – long short-term memory network, NN – neural network, RNN – recurrent neural network, SMNSR - Sparse Multimodal Neighbourhood Search Regression, DPM – disease progression model, KDE – kernel density estimation, LDA – linear discriminant analysis, SM – slope model, ECOC - error-correcting output codes, SVD – singular value decomposition (*) Augmented features
Participant statistics¶
Locations of participating teams¶
Team categories¶
Prediction methods¶
Organised by:
Prize sponsors: