Results - TADPOLE - Grand Challenge

Contents¶

General Observations
Prize Winners
Main Results
Additional Entries
Confidence Intervals
Meta Analysis
Demographics of training/test sets
Description of forecasting algorithms
Statistics of TADPOLE participants

Results (July 2019)¶

TADPOLE's first phase is complete and we have evaluated all the prize-eligible submissions.

Watch the live announcement on YouTube: https://www.youtube.com/watch?v=BFS9Sr0lhuM

Check below for a description of the evaluation dataset and the overall rankings.

General observations¶

There was no clear "one-size-fits-all" winner.
Data-driven approaches for both feature selection and prediction of target variables generally performed well.
Many teams combined different types of algorithms to produce forecasts:
1. Most used statistical regression;
2. Some used generic machine learning techniques that are robust and can work well for other problems; and
3. Some used disease progression models that are specifically tailored for the current problem of disease prediction.
Forecasts were very good for clinical diagnosis and ventricle volume -- on the other hand, predicting ADAS turned out to be very difficult -- no team was able to generate forecasts that were significantly better than random guessing
Meta-analysis results: the most-important features that helped improve predictions were DTI & CSF for clinical diagnosis, and "augmented features" for Ventricle volume prediction.
Throughout this page we will refer to ADAS-Cog 13 as simply ADAS.

TADPOLE Prize winners¶


Category	Team	Members	Institution	Country	Prize
Overall best	Frog	Keli Liu, Paul Manser, Christina Rabe	Genentech	USA	£5000
Clinical status	Frog	Keli Liu, Paul Manser, Christina Rabe	Genentech	USA	£5000
Ventricle volume	EMC1	Vikram Venkatraghavan, Esther Bron, Stefan Klein	Erasmus MC	Netherlands	£5000
Best university team	Apocalypse	Manon Ansart	ICM, INRIA	France	£5000
High-School (best)	Chen-MCW	Gang Chen	Medical College Wisconsin	USA	£5000
High-School (runner up)	CyberBrains	Ionut Buciuman, Alex Kelner, Raluca Pop, Denisa Rimocea, Kruk Zsolt	Vasile Lucaciu College	Romania	£2500
Overall best D3 prediction	GlassFrog	Steven Hill, Brian Tom, Anais Rouanst, Zhiyue Huang, James Howlett, Steven Kiddle, Simon R. White, Sach Mukherjee, Bernd Taschler	Cambridge University	UK	£2500

Overall Results¶

Legend:

MAUC – Multiclass Area Under the Curve
BCA – Balanced Classification Accuracy
MAE – Mean Absolute Error
WES – Weighted Error Score
CPA – Coverage Probability Accuracy for 50% Confidence Interval
ADAS – Alzheimer's Disease Assessment Scale Cognitive (13)
VENTS – Ventricle Volume
RANK (overall) – We first compute the sum of ranks from MAUC, ADAS MAE and VENTS MAE, then derive the final ranking from these sums of ranks. For example, the top entry has the lowest sum of ranks from these three categories.

Note (14 June 2019): The rankings in each prize category can be found by ordering according to Diagnosis MAUC, and ADAS and Ventricle MAE. The overall rankings below require valid submissions for every target variable.

Overall scores — longitudinal dataset D2¶

RANK	FILE NAME	MAUC RANK	MAUC	BCA	ADAS RANK	ADAS MAE	ADAS WES	ADAS CPA	VENTS RANK	VENTS MAE	VENTS WES	VENTS CPA
1.0	Frog	1.0	0.931	0.849	4.0	4.85	4.74	0.44	10.0	0.45	0.33	0.47
2.0	EMC1-Std	8.0	0.898	0.811	23.5	6.05	5.40	0.45	1.5	0.41	0.29	0.43
3.0	VikingAI-Sigmoid	16.0	0.875	0.760	7.0	5.20	5.11	0.02	11.5	0.45	0.35	0.20
4.0	EMC1-Custom	11.0	0.892	0.798	23.5	6.05	5.40	0.45	1.5	0.41	0.29	0.43
5.0	CBIL	9.0	0.897	0.803	15.0	5.66	5.65	0.37	13.0	0.46	0.46	0.09
6.0	Apocalypse	7.0	0.902	0.827	14.0	5.57	5.57	0.50	20.0	0.52	0.52	0.50
7.0	GlassFrog-Average	5.0	0.902	0.825	8.0	5.26	5.27	0.26	29.0	0.68	0.60	0.33
8.0	GlassFrog-SM	5.0	0.902	0.825	17.0	5.77	5.92	0.20	21.0	0.52	0.33	0.20
9.0	BORREGOTECMTY	19.0	0.866	0.808	20.0	5.90	5.82	0.39	5.0	0.43	0.37	0.40
10.0	EMC-EB	3.0	0.907	0.805	39.0	6.75	6.66	0.50	9.0	0.45	0.40	0.48
11.5	lmaUCL-Covariates	22.0	0.852	0.760	27.0	6.28	6.29	0.28	3.0	0.42	0.41	0.11
11.5	CN2L-Average	27.0	0.843	0.792	9.0	5.31	5.31	0.35	16.0	0.49	0.49	0.33
13.0	VikingAI-Logistic	20.0	0.865	0.754	21.0	6.02	5.91	0.26	11.5	0.45	0.35	0.20
14.0	lmaUCL-Std	21.0	0.859	0.781	28.0	6.30	6.33	0.26	4.0	0.42	0.41	0.09
15.5	CN2L-RandomForest	10.0	0.896	0.792	16.0	5.73	5.73	0.42	31.0	0.71	0.71	0.41
15.5	FortuneTellerFish-SuStaIn	40.0	0.806	0.685	3.0	4.81	4.81	0.21	14.0	0.49	0.49	0.18
17.0	CN2L-NeuralNetwork	41.0	0.783	0.717	10.0	5.36	5.36	0.34	7.0	0.44	0.44	0.27
18.0	BenchmarkMixedEffectsAPOE	35.0	0.822	0.749	2.0	4.75	4.75	0.36	23.0	0.57	0.57	0.40
19.0	Tohka-Ciszek-RandomForestLin	17.0	0.875	0.796	22.0	6.03	6.03	0.15	22.0	0.56	0.56	0.37
20.0	BGU-LSTM	12.0	0.883	0.779	25.0	6.09	6.12	0.39	25.0	0.60	0.60	0.23
21.0	DIKU-GeneralisedLog-Custom	13.0	0.878	0.790	11.5	5.40	5.40	0.26	38.5	1.05	1.05	0.05
22.0	DIKU-GeneralisedLog-Std	14.0	0.877	0.790	11.5	5.40	5.40	0.26	38.5	1.05	1.05	0.05
23.0	CyberBrains	34.0	0.823	0.747	6.0	5.16	5.16	0.24	26.0	0.62	0.62	0.12
24.0	AlgosForGood	24.0	0.847	0.810	13.0	5.46	5.11	0.13	30.0	0.69	3.31	0.19
25.0	lmaUCL-halfD1	26.0	0.845	0.753	38.0	6.53	6.51	0.31	6.0	0.44	0.42	0.13
26.0	BGU-RF	28.0	0.838	0.673	29.5	6.33	6.10	0.35	17.5	0.50	0.38	0.26
27.0	Mayo-BAI-ASU	52.0	0.691	0.624	5.0	4.98	4.98	0.32	19.0	0.52	0.52	0.40
28.0	BGU-RFFIX	32.0	0.831	0.673	29.5	6.33	6.10	0.35	17.5	0.50	0.38	0.26
29.0	FortuneTellerFish-Control	31.0	0.834	0.692	1.0	4.70	4.70	0.22	50.0	1.38	1.38	0.50
30.0	GlassFrog-LCMEM-HDR	5.0	0.902	0.825	31.0	6.34	6.21	0.47	51.0	1.66	1.59	0.41
31.0	SBIA	43.0	0.776	0.721	43.0	7.10	7.38	0.40	8.0	0.44	0.31	0.13
32.0	Chen-MCW-Stratify	23.0	0.848	0.783	36.5	6.48	6.24	0.23	36.5	1.01	1.00	0.11
33.0	Rocket	54.0	0.680	0.519	18.0	5.81	5.71	0.34	28.0	0.64	0.64	0.29
34.5	Chen-MCW-Std	29.0	0.836	0.778	36.5	6.48	6.24	0.23	36.5	1.01	1.00	0.11
34.5	BenchmarkSVM	30.0	0.836	0.764	40.0	6.82	6.82	0.42	32.0	0.86	0.84	0.50
36.0	DIKU-ModifiedMri-Custom	36.5	0.807	0.670	33.5	6.44	6.44	0.27	34.5	0.92	0.92	0.01
37.0	DIKU-ModifiedMri-Std	38.5	0.806	0.670	33.5	6.44	6.44	0.27	34.5	0.92	0.92	0.01
38.0	DIVE	51.0	0.708	0.568	42.0	7.10	7.10	0.34	15.0	0.49	0.49	0.13
39.0	ITESMCEM	53.0	0.680	0.657	26.0	6.26	6.26	0.35	33.0	0.92	0.92	0.43
40.0	BenchmarkLastVisit	44.5	0.774	0.792	41.0	7.05	7.05	0.45	27.0	0.63	0.61	0.47
41.0	Sunshine-Conservative	25.0	0.845	0.816	44.5	7.90	7.90	0.50	43.5	1.12	1.12	0.50
42.0	BravoLab	46.0	0.771	0.682	47.0	8.22	8.22	0.49	24.0	0.58	0.58	0.41
43.0	DIKU-ModifiedLog-Custom	36.5	0.807	0.670	33.5	6.44	6.44	0.27	47.5	1.17	1.17	0.06
44.0	DIKU-ModifiedLog-Std	38.5	0.806	0.670	33.5	6.44	6.44	0.27	47.5	1.17	1.17	0.06
45.0	Sunshine-Std	33.0	0.825	0.771	44.5	7.90	7.90	0.50	43.5	1.12	1.12	0.50
46.0	Billabong-UniAV45	49.0	0.720	0.616	48.5	9.22	8.82	0.29	41.5	1.09	0.99	0.45
47.0	Billabong-Uni	50.0	0.718	0.622	48.5	9.22	8.82	0.29	41.5	1.09	0.99	0.45
48.0	ATRI-Biostat-JMM	42.0	0.779	0.710	51.0	12.88	69.62	0.35	54.0	1.95	5.12	0.33
49.0	Billabong-Multi	56.0	0.541	0.556	55.0	27.01	19.90	0.46	40.0	1.07	1.07	0.45
50.0	ATRI-Biostat-MA	47.0	0.741	0.671	52.0	12.88	11.32	0.19	53.0	1.84	5.27	0.23
51.0	BIGS2	58.0	0.455	0.488	50.0	11.62	14.65	0.50	49.0	1.20	1.12	0.07
52.0	Billabong-MultiAV45	57.0	0.527	0.530	56.0	28.45	21.22	0.47	45.0	1.13	1.07	0.47
53.0	ATRI-Biostat-LTJMM	55.0	0.636	0.563	54.0	16.07	74.65	0.33	52.0	1.80	5.01	0.26
-	Threedays	2.0	0.921	0.823	-	-	-	-	-	-	-	-
-	ARAMIS-Pascal	15.0	0.876	0.850	-	-	-	-	-	-	-	-
-	IBM-OZ-Res	18.0	0.868	0.766	-	-	-	-	46.0	1.15	1.15	0.50
-	Orange	44.5	0.774	0.792	-	-	-	-	-	-	-	-
-	SMALLHEADS-NeuralNet	48.0	0.737	0.605	53.0	13.87	13.87	0.41	-	-	-	-
-	SMALLHEADS-LinMixedEffects	-	-	-	46.0	8.09	7.94	0.04	-	-	-	-
-	Tohka-Ciszek-SMNSR	-	-	-	19.0	5.87	5.87	0.14	-	-	-	-

The results on the D2 dataset suggest that we do not have a clear winner on all categories. While Frog had the best overall submission with the lowest sum of ranks, for each performance metric individually we had different winners: Frog (clinical diagnosis MAUC of 0.931), ARAMIS-Pascal (clinical diagnosis BCA of 0.850), FortuneTellerFish-Control (ADAS MAE and WES of 4.7), VikingAI-Sigmoid (ADAS CPA of 0.02), EMC1-Std/EMC1-Custom (ventricle MAE of 0.41 and ventricle WES or 0.29), and DIKU-ModifiedMri-Std/ DIKU-ModifiedMri-Custom (ventricle CPA of 0.01).

Overall scores — cross-sectional dataset D3¶

RANK	FILE NAME	MAUC RANK	MAUC	BCA	ADAS RANK	ADAS MAE	ADAS WES	ADAS CPA	VENTS RANK	VENTS MAE	VENTS WES	VENTS CPA
1.0	GlassFrog-Average	3.0	0.897	0.826	5.0	5.86	5.57	0.25	3.0	0.68	0.55	0.24
2.0	GlassFrog-LCMEM-HDR	3.0	0.897	0.826	9.0	6.57	6.56	0.34	1.0	0.48	0.38	0.24
3.0	GlassFrog-SM	3.0	0.897	0.826	4.0	5.77	5.77	0.19	9.0	0.82	0.55	0.07
4.0	Tohka-Ciszek-RandomForestLin	11.0	0.865	0.786	2.0	4.92	4.92	0.10	10.0	0.83	0.83	0.35
7.0	VikingAI-Logistic	8.0	0.876	0.768	6.0	5.94	5.91	0.22	22.0	1.04	1.01	0.18
7.0	Rocket	10.0	0.865	0.771	3.0	5.27	5.14	0.39	23.0	1.06	1.06	0.27
7.0	lmaUCL-Std	13.0	0.854	0.698	17.0	6.95	6.93	0.05	6.0	0.81	0.81	0.22
7.0	lmaUCL-Covariates	13.0	0.854	0.698	17.0	6.95	6.93	0.05	6.0	0.81	0.81	0.22
7.0	lmaUCL-halfD1	13.0	0.854	0.698	17.0	6.95	6.93	0.05	6.0	0.81	0.81	0.22
10.0	EMC1-Std	30.0	0.705	0.567	7.0	6.29	6.19	0.47	4.0	0.80	0.62	0.48
11.0	SBIA	28.0	0.779	0.782	10.0	6.63	6.43	0.40	8.0	0.82	0.75	0.18
13.0	BGU-LSTM	6.0	0.877	0.776	14.0	6.75	6.17	0.39	27.0	1.11	0.79	0.17
13.0	BGU-RFFIX	6.0	0.877	0.776	14.0	6.75	6.17	0.39	27.0	1.11	0.79	0.17
13.0	BGU-RF	6.0	0.877	0.776	14.0	6.75	6.17	0.39	27.0	1.11	0.79	0.17
15.0	BravoLab	18.0	0.813	0.730	28.0	8.02	8.02	0.47	2.0	0.64	0.64	0.42
16.5	BORREGOTECMTY	15.0	0.852	0.748	8.0	6.44	5.86	0.46	30.0	1.14	1.02	0.49
16.5	CyberBrains	17.0	0.830	0.755	1.0	4.72	4.72	0.21	35.0	1.54	1.54	0.50
18.0	ATRI-Biostat-MA	19.0	0.799	0.772	26.0	7.39	6.63	0.04	11.0	0.93	0.97	0.10
19.5	EMC-EB	9.0	0.869	0.765	27.0	7.71	7.91	0.50	21.0	1.03	1.07	0.49
19.5	DIKU-GeneralisedLog-Std	20.0	0.798	0.684	20.5	6.99	6.99	0.17	16.5	0.95	0.95	0.05
21.0	DIKU-GeneralisedLog-Custom	21.0	0.798	0.681	20.5	6.99	6.99	0.17	16.5	0.95	0.95	0.05
22.5	DIKU-ModifiedLog-Std	22.5	0.798	0.688	23.5	7.10	7.10	0.17	13.5	0.95	0.95	0.05
22.5	DIKU-ModifiedMri-Std	22.5	0.798	0.688	23.5	7.10	7.10	0.17	13.5	0.95	0.95	0.05
24.5	DIKU-ModifiedLog-Custom	24.5	0.798	0.691	23.5	7.10	7.10	0.17	13.5	0.95	0.95	0.05
24.5	DIKU-ModifiedMri-Custom	24.5	0.798	0.691	23.5	7.10	7.10	0.17	13.5	0.95	0.95	0.05
26.0	Billabong-Uni	31.0	0.704	0.626	11.5	6.69	6.69	0.38	19.5	0.98	0.98	0.48
27.0	Billabong-UniAV45	32.0	0.703	0.620	11.5	6.69	6.69	0.38	19.5	0.98	0.98	0.48
28.0	ATRI-Biostat-JMM	26.0	0.794	0.781	29.0	8.45	8.12	0.34	18.0	0.97	1.45	0.37
29.0	CBIL	16.0	0.847	0.780	33.0	10.99	11.65	0.49	29.0	1.12	1.12	0.39
30.0	BenchmarkLastVisit	27.0	0.785	0.771	19.0	6.97	7.07	0.42	33.0	1.17	0.64	0.11
31.0	Billabong-MultiAV45	33.0	0.682	0.603	30.5	9.30	9.30	0.43	24.5	1.09	1.09	0.49
32.0	Billabong-Multi	34.0	0.681	0.605	30.5	9.30	9.30	0.43	24.5	1.09	1.09	0.49
33.0	ATRI-Biostat-LTJMM	29.0	0.732	0.675	34.0	12.74	63.98	0.37	32.0	1.17	1.07	0.40
34.0	BenchmarkSVM	36.0	0.494	0.490	32.0	10.01	10.01	0.42	31.0	1.15	1.18	0.50
35.0	DIVE	35.0	0.512	0.498	35.0	16.66	16.74	0.41	34.0	1.42	1.42	0.34
-	IBM-OZ-Res	1.0	0.905	0.830	-	-	-	-	36.0	1.77	1.77	0.50

Here, most submissions have worse performance compared to the equivalent predictions on the D2 longitudinal dataset, due to the lack of longitudinal, multimodal data. GlassFrog-Average had the best overall rank and obtained a diagnosis MAUC of 0.897, ADAS MAE of 5.86 and a Ventricle MAE of 0.68 (% ICV). For diagnosis prediction, IBM-OZ-Res obtained the highest clinical diagnosis scores: MAUC of 0.905 and BCA of 0.830. For ADAS predictions, CyberBrains had the best MAE and WES of 4.72. ATRI-Biostat-MA obtained the best ADAS CPA of 0.04. For Ventricle prediction, GlassFrog-LCMEM-HDR had a MAE of 0.48 (% ICV) and the best WES of 0.38, while the 6 DIKU submissions obtained the best CPA of 0.05.

Additional entries¶

In addition to the standard predictions and the benchmarks, we also included two consensus predictions by taking the mean (ConsensusMean) and median (ConsensusMedian) over all predictions from all participants. For D2 predictions, the ConsensusMedian submission obtained the best overall rank, obtaining MAUC of 0.925 in diagnosis prediction (second-best), 5.12 error on ADAS-Cog 13 MAE (ninth-best) and 0.38 on Ventricles MAE, the best result in this category for D2. On the other hand, ConsensusMean ranked 3rd overall on D2, with diagnosis MAUC of 0.920 (fourth-best), ADAS-Cog 13 MAE of 3.75, the best prediction in this category, and Ventricle MAE of 0.48 (rank 16). For ADAS-Cog 13 and Ventricle volume prediction, the best consensus methods reduced the error by 11% and 8% respectively compared to the best prediction from participants or benchmarks.

In order to test whether the best results have not been obtained by chance due to randomness in the test set, we evaluated n=62 (as many as number of entries) randomly perturbed predictions from the simplest benchmark, BenchmarkLastVisit, and computed the best results obtained by any of these predictions. These are shown as RandomisedBest, and obtain high scores especially for ADAS-Cog 13, ranking 3rd with a final MAE of 4.52. High performance scores are also obtained for Ventricles, ranking 14 with an MAE of 0.47, a 14% increase in error from the best forecast, while for diagnosis prediction a lower MAUC score of 0.797 is obtained, ranking 43rd. This suggests that the entries with higher MAE than RandomisedBest should be interpreted with care, as the scores and ranks could be high due to randomness in the test set. This is particularly relevant for ADAS-Cog 13 predictions, where only the BenchmarkMixedEffects and ConsensusMean got better results, suggesting all other methods are not able to predict the ADAS-Cog 13 any better than random guessing based on the last available measurement.

It is worth mentioning that, while drafting the manuscript, we discovered that dropping APOE as a covariate in the BenchmarkMixedEffectsAPOE model considerably decreases the error in ADAS prediction, so we included it as an additional entry for scientific interest.

Additional entries for D2¶

RANK	FILE NAME	MAUC RANK	MAUC	BCA	ADAS RANK	ADAS MAE	ADAS WES	ADAS CPA	VENTS RANK	VENTS MAE	VENTS WES	VENTS CPA
1.5	ConsensusMedian	1.0	0.925	0.857	4.0	5.12	5.01	0.28	1.0	0.38	0.33	0.09
1.5	ConsensusMean	2.0	0.920	0.835	1.0	3.75	3.54	0.00	3.0	0.48	0.45	0.13
3.5	BenchmarkMixedEffects	3.0	0.846	0.706	2.0	4.19	4.19	0.31	4.0	0.56	0.56	0.50
3.5	RandomisedBest	4.0	0.797	0.803	3.0	4.52	4.52	0.27	2.0	0.47	0.45	0.33

Additional entries for D3¶

RANK	FILE NAME	MAUC RANK	MAUC	BCA	ADAS RANK	ADAS MAE	ADAS WES	ADAS CPA	VENTS RANK	VENTS MAE	VENTS WES	VENTS CPA
1.0	ConsensusMean	1.0	0.917	0.821	2.0	4.58	4.34	0.12	2.0	0.73	0.72	0.09
2.0	ConsensusMedian	2.0	0.905	0.817	3.0	5.44	5.37	0.19	1.0	0.71	0.65	0.10
3.0	BenchmarkMixedEffects	3.0	0.839	0.728	1.0	4.23	4.23	0.34	3.0	1.13	1.13	0.50

Confidence Intervals¶

Below are confidence intervals (CIs) computed for every submission, based on 50 bootstraps of the test set D4. The first figure (Fig. 1) shows CIs based on forecasts from D2, while the second (Fig. 2) shows CIs for forecasts on D3.

Fig 1. Confidence intervals for forecasts based on the longitudinal D2 prediction set.

Fig 2. Confidence intervals for forecasts based on the cross-sectional D3 prediction set.

Meta-analysis¶

To understand which types of features and algorithms yielded higher performance, we show here associations between predictive performance and feature selection methods, different types of features, methods for data imputation, and methods for forecasting of target variables (diagnosis, ADAS and ventricles). For each type of feature/method and each target variable (clinical diagnosis, ADAS and Ventricles), we show the distribution of estimated coefficients from a general linear model, derived from the approximated inverse hessian matrix at the maximum likelihood estimator. From this analysis we removed outliers, defined as submissions with ADAS MAE higher than 10 and Ventricle MAE higher than 0.15 (%ICV). For all plots, distributions to the right of the gray dashed vertical line are associated with better performance.

The results in Fig. 3 below show trends that indicate what aspects of the methods could be associated with better performance. For feature selection, methods that perform manual selection of features are associated with better predictive performance in ADAS13 and Ventricles. In terms of feature types, including features from many modalities was generally associated with an increase in overall performance, except for FDG (for all target variables). Moreover, augmented features correlate with overall performance improvements especially for ventricle prediction. In terms of data imputation methods, while some differences can be observed, no clear conclusions can be drawn currently. In terms of prediction models, we notice that neural networks are more significantly associated with increased performance in ventricle prediction, while disease progression models are associated with decreased performance in prediction or clinical diagnosis and ventricles. However, given the small number of methods tested (\<50) and the large number of degrees of freedom (n=21), these results should be interpreted with care.

Fig 3. Associations between the prediction of clinical diagnosis, ADAS and Ventricle volume and different strategies of (top) feature selection, (upper-middle) types of features, (lower-middle) data imputation strategies and (bottom) prediction methods for the target variables. For each type of feature/method (rows) and each target variable (columns), we show the distribution of estimated coefficients from a general linear model. Positive coefficients, where distributions lie to the right of the dashed vertical line, indicate better performance than baseline (vertical dashed line). For ADAS and Ventricle prediction, we flipped the sign of the coefficients, to consistently show better performance to the right of the vertical line.

Demographics of D1-D4 datasets¶

Summary of TADPOLE datasets D1-D4. Each subject has been allocated to either Control, MCI or AD group based on diagnosis at the first available visit within each dataset. The bottom table contains the number of visits with data available, by modality. For example, in D4 there were a total of 150 visits where an MRI scan was undertaken, which represented a total of 64% of all visits analysed across all subjects in D4.


Measure	D1	D2	D3	D4
Cognitively Normal
Subjects	1667	896	896	219
Number (%)	508 (30.5%)	369 (41.2%)	299 (33.4%)	94 (42.9%)
Visits per subject	8.3 (4.5)	8.5 (4.9)	1.0 (0.0)	1.0 (0.2)
Age	74.3 (5.8)	73.6 (5.7)	72.3 (6.2)	78.4 (7.0)
Gender (% male)	48.6%	47.2%	43.5%	47.9%
MMSE	29.1 (1.1)	29.0 (1.2)	28.9 (1.4)	29.1 (1.1)
Converters	18 (3.5%)	9 (2.4%)
Mild Cognitive Impairment
Number (%)	841 (50.4%)	458 (51.1%)	269 (30.0%)	90 (41.1%)
Visits per subject	8.2 (3.7)	9.1 (3.6)	1.0 (0.0)	1.1 (0.3)
Age	73.0 (7.5)	71.6 (7.2)	71.9 (7.1)	79.4 (7.0)
Gender (% male)	59.3%	56.3%	58.0%	64.4%
MMSE	27.6 (1.8)	28.0 (1.7)	27.6 (2.2)	28.1 (2.1)
Converters	117 (13.9%)	37 (8.1%)		9 (10.0%)
Alzheimer’s Disease
Number (%)	318 (19.1%)	69 (7.7%)	136 (15.2%)	29 (13.2%)
Visits per subject	4.9 (1.6)	5.2 (2.6)	1.0 (0.0)	1.1 (0.3)
Age	74.8 (7.7)	75.1 (8.4)	72.8 (7.1)	82.2 (7.6)
Gender (% male)	55.3%	68.1%	55.9%	51.7%
MMSE	23.3 (2.0)	23.1 (2.0)	20.5 (5.9)	19.4 (7.2)
Converters				9 (31.0%)

Number of visits with available data (as % of total visits)
Cognitive	8862 (69.9%)	5218 (68.1%)	753 (84.0%)	223 (95.3%)
MRI	7884 (62.2%)	4497 (58.7%)	224 (25.0%)	150 (64.1%)
FDG	2119 (16.7%)	1544 (20.2%)	0 (0.0%)	0 (0.0%)
AV45	2098 (16.6%)	1758 (23.0%)	0 (0.0%)	0 (0.0%)
AV1451	89 (0.7%)	89 (1.2%)	0 (0.0%)	0 (0.0%)
DTI	779 (6.1%)	636 (8.3%)	0 (0.0%)	0 (0.0%)
CSF	2347 (18.5%)	1458 (19.0%)	0 (0.0%)	0 (0.0%)

Description of Algorithms¶

Summary¶

We had a total of 33 participating teams, who submitted a total of 58 forecasts from D2, 34 forecasts from D3, and 6 forecasts from custom prediction sets. A total of 8 D2/D3 submissions from 6 teams did not have predictions for all three target variables, so we only computed the performance metrics for the available target variables. Another 3 submissions lacked confidence intervals for either ADAS or ventricle volume, which we imputed using default low-width confidence ranges of 2 for ADAS and 0.002 for Ventricles/ICV.

Table 1 below summarizes the methods used in the submissions in terms of feature selection, handling of missing data, predictive models for clinical diagnosis and ADAS/Ventricles biomarkers, as well as training and prediction times. Condensed descriptions of each submitted method can be found here, while even more detailed descriptions are here (original files submitted by participants).

Submission	Feature selection	Number of features	Missing data imputation	Diagnosis prediction	ADAS/Vent. Prediction	Training time	Prediction time (one subject)
AlgosForGood	Manual	16+5*	forward-filling	Aalen model	linear regression	1 minute	1 second
Apocalypse	Manual	16	population average	SVM	linear regression	40 minutes	3 minutes
ARAMIS-Pascal	Manual	20	population average	Aalen model	-	16 seconds	0.02 seconds
ATRI-Biostat-JMM	automatic	15	random forest	random forest	linear mixed effects model	2 days	1 second
ATRI-Biostat-LTJMM	automatic	15	random forest	random forest	DPM	2 days	1 second
ATRI-Biostat-MA	automatic	15	random forest	random forest	DPM + linear mixed effects model	2 days	1 second
BGU-LSTM	automatic	67	none	feed-forward NN	LSTM	1 day	milliseconds
BGU-RF/ BGU-RFFIX	automatic	~67+1340*	none	semi-temporal RF	semi-temporal RF	a few minutes	milliseconds
BIGS2	automatic	all	Iterative Soft-Thresholded SVD	RF	linear regression	2.2 seconds	0.001 seconds
Billabong (all)	Manual	15-16	linear regression	linear scale	non-parametric SM	7 hours	0.13 seconds
BORREGOSTECMTY	automatic	~100 + 400*	nearest-neighbour	regression ensemble	ensemble of regression + hazard models	18 hours	0.001 seconds
BravoLab	automatic	25	hot deck	LSTM	LSTM	1 hour	a few seconds
CBIL	Manual	21	linear interpolation	LSTM	LSTM	1 hour	one minute
Chen-MCW	Manual	9	none	linear regression	DPM	4 hours	\< 1 hour
CN2L-NeuralNetwork	automatic	all	forward-filling	RNN	RNN	24 hours	a few seconds
CN2L-RandomForest	Manual	>200	forward-filling	RF	RF	15 minutes	\< 1 minute
CN2L-Average	automatic	all	forward-filling	RNN/RF	RNN/RF	24 hours	\< 1 minute
CyberBrains	Manual	5	population average	linear regression	linear regression	20 seconds	20 seconds
DIKU (all)	semi-automatic	18	none	Bayesian classifier/LDA + DPM	DPM	290 seconds	0.025 seconds
DIVE	Manual	13	none	KDE+DPM	DPM	20 minutes	0.06 seconds
EMC1	automatic	250	nearest neighbour	DPM + 2D spline + SVM	DPM + 2D spline	80 minutes	a few seconds
EMC-EB	automatic	200-338	nearest-neighbour	SVM classifier	SVM regressor	20 seconds	a few seconds
FortuneTellerFish-Control	Manual	19	nearest neighbour	multiclass ECOC SVM	linear mixed effects model	1 minute	\< 1 second
FortuneTellerFish-SuStaIn	Manual	19	nearest neighbour	multiclass ECOC SVM + DPM	linear mixed effects model + DPM	5 hours	\< 1 second
Frog	automatic	~70+420*	none	gradient boosting	gradient boosting	1 hour	-
GlassFrog-LCMEM-HDR	semi-automatic	all	forward-fill	multi-state model	DPM + regression	15 minutes	2 minutes
GlassFrog-SM	Manual	7	linear model	multi-state model	parametric SM	93 seconds	0.1 seconds
GlassFrog-Average	semi-automatic	all	forward-fill/linear	multi-state model	DPM + SM + regression	15 minutes	2 minutes
IBM-OZ-Res	Manual	10-15	filled with zero	stochastic gradient boosting	stochastic gradient boosting	20 minutes	0.1 seconds
ITESMCEM	Manual	48	mean of previous values	RF	LASSO + Bayesian ridge regression	20 minutes	0.3 seconds
lmaUCL (all)	Manual	5	regression	multi-task learning	multi-task learning	2 hours	milliseconds
Mayo-BAI-ASU	Manual	15	population average	linear mixed effects model	linear mixed effects model	20 minutes	1.3 seconds
Orange	Manual	17	none	clinician’s decision tree	clinician’s decision tree	none	0.2 seconds
Rocket	manual	6	median of diagnostic group	linear mixed effects model	DPM	5 minutes	0.3 seconds
SBIA	Manual	30-70	dropped visits with missing data	SVM + density estimator	linear mixed effects model	1 minute	a few seconds
SPMC-Plymouth (all)	Automatic	20	none	?	-	?	1 minute
SmallHeads-NeuralNetwork	automatic	376	nearest neighbour	deep fully -connected NN	deep fully -connected NN	40 minutes	0.06 seconds
SmallHeads-LinMixedEffects	automatic	?	nearest neighbour	-	linear mixed effects model	25 minutes	0.13 seconds
Sunshine (all)	semi-automatic	6	population average	SVM	linear model	30 minutes	\< 1 minute
Threedays	Manual	16	none	RF	-	1 minute	3 seconds
Tohka-Ciszek-SMNSR	Manual	~32	nearest neighbour	-	SMNSR	several hours	a few seconds
Tohka-Ciszek-RandomForestLin	Manual	~32	mean patient value	RF	linear model	a few minutes	a few seconds
VikingAI (all)	Manual	10	none	DPM + ordered logit model	DPM	10 hours	8 seconds
BenchmaskLastVisit	None	3	none	constant model	constant model	7 seconds	milliseconds
BenchmarkMixedEffects	None	3	none	Gaussian model	linear mixed effects model	30 seconds	0.003 seconds
BenchmarkMixedEffectsAPOE	None	4	none	Gaussian model	linear mixed effects model	30 seconds	0.003 seconds
BenchmarkSVM	Manual	6	mean of previous values	SVM	support vector regressor (SVR)	20 seconds	0.001 seconds

Table 1. Summary of methods used in the TADPOLE submissions. Keywords: SVM – Support Vector Machine, RF – random forest, LSTM – long short-term memory network, NN – neural network, RNN – recurrent neural network, SMNSR - Sparse Multimodal Neighbourhood Search Regression, DPM – disease progression model, KDE – kernel density estimation, LDA – linear discriminant analysis, SM – slope model, ECOC - error-correcting output codes, SVD – singular value decomposition (*) Augmented features