Data

Diagram showing TADPOLE biomarkers. Source of individual images: Wikimedia Commons

TADPOLE challenge participants are free to use any data to inform their predictions (such as when building predictive models). For convenience, TADPOLE provides three "standard" data sets, derived from the ADNI study:

  • D1 - a comprehensive longitudinal data set for training;
  • D2 - a comprehensive longitudinal data set on rollover subjects for forecasting;
  • D3 - a limited forecasting data set on the same rollover subjects as D2.

We also refer to D4, the future test set. The archive containing all these standard data and associated files is available from the ADNI website (login to ADNI, follow Download -> Study Data -> Test Data -> Data for Challenges -> "Tadpole Challenge Data"). You will need to register with ADNI and apply to access these data files. In the archive, the file D1_D2.csv contains both data sets D1 and D2, with row-wise membership indicated by a 1 in the "D1" and "D2" columns, respectively. The file D3.csv contains D3.

All queries on the structure and meaning of the standard data should be posted on the Google group. The rest of this page provides detail on the content and construction of the TADPOLE standard data sets.

List of biomarkers

For advancing the diagnosis of dementia, assessment of quantitative biomarkers (medical measurements that can indicate a disease) in addition to cognitive tests is of great value. The five most commonly investigated biomarkers were recently included in the revised diagnostic criteria for AD and MCI due to AD (Albert et al.,2011; McKhann et al., 2011). These five biomarkers can be divided into two categories: measures of the amyloid beta protein and measures of damage to nerve cells (Jack et al., 2012). For the first category, amyloid beta can be measured using either cerebrospinal fluid (CSF) puncture or amyloid positron emission tomography (PET). For the second category, damage to the nerve cells can be measured indirectly by quantifying the fraction of tau protein in the CSF or using tau-PET, or directly by quantifying brain metabolism using fluoro-deoxyglucose (FDG) PET or atrophy using magnetic resonance imaging (MRI).

TADPOLE standard datasets contain some or all of the following biomarkers:

  1. Main cognitive tests (excluding subtypes): - neuropsychological tests administered by a clinical expert
    1. CDR Sum of Boxes
    2. ADAS11
    3. ADAS13
    4. MMSE
    5. RAVLT
    6. Moca
    7. Ecog
  2. MRI ROIs (Freesurfer) - measures of brain structural integrity
    1. volumes
    2. cortical thicknesses
    3. surface areas
  3. FDG PET ROI averages - measure cell metabolism, where cells affected by AD show reduced metabolism
  4. AV45 PET ROI averages - measures amyloid-beta load in the brain, where amyloid-beta is a protein that mis-folds (i.e. its 3D structure is not properly constructed), which then leads to AD
  5. AV1451 PET ROI averages - measures tau load in the brain, where tau is another protein which, when abnormal, damages neurons and thus leads to AD
  6. DTI ROI measures - measures microstructural parameters related to cells and axons (cell radial diffusivity, axonal diffusivity, etc ... )
    1. Mean diffusivity
    2. Axial diffusivity
    3. Radial diffusivity
  7. CSF biomarkers - amyloid and tau levels in the cerebrospinal fluid (CSF), as opposed to the cerebral cortex
  8. Others:
    1. APOE status - a gene that is a risk factor for developing AD
    2. Demographic information: age, gender, education, etc ...
    3. Diagnosis: either cognitively cormal (CN), mild cognitive impairment (MCI) or Alzheimer's disease (AD).
Getting started

In TADPOLE_D1_D2.csv, each row represents data for one particular visit of a subject, and each column represents a feature or measurement (commonly called biomarker) from the subject at that particular visit.  The first columns in the spreadsheet contain unique identifiers: RID (roster ID) uniquely identifies every subject, VISCODE (visit code) is the timepoint when the visit takes place (bl is baseline or month 0, m06 is month 6, etc ..), SITE represents the site ID where the visit took place. Other important columns are: EXAMDATE represents the date of the clinical examination, AGE is their age at baseline visit, PTEDUCAT represents their total years of education.

The TADPOLE_D1_D2.csv spreadsheet contains many types of biomarkers (or measurements), some more important than others. Participants can start by using only a small subset of the biomarkers which are known to be informative. Here is a list of biomarkers we suggest participants unfamiliar with ADNI data to start with:

  • The main measures to be predicted: DX, ADAS13, Ventricles
  • Cognitive tests: CDRSB, ADAS11, MMSE, RAVLT_immediate
  • MRI measures: Hippocampus, WholeBrain, Entorhinal, MidTemp
  • PET measures: FDG, AV45
  • CSF measures: ABETA_UPENNBIOMK9_04_19_17  (amyloid-beta level in CSF), TAU_UPENNBIOMK9_04_19_17 (tau level), PTAU_UPENNBIOMK9_04_19_17 (phosphorylated tau level)
  • Risk factors: APOE4, AGE

Other important biomarkers that participants can consider are the various MRI, PET and DTI measures for the hippocampus, entorhinal cortex, temporal and parietal lobe structures. Use the dictionary file (TADPOLE_D1_D2_Dict.csv) and search for keywords such as "hippocampus" or "hippocampal" to find the necessary columns. For example, column ST44CV_UCSFFSL_02_01_16_UCSFFSL51ALL_08_01_16 represents the volume of the left hippocampus. If desired, the measures for the left and right structures can be averaged together.

Cognitive tests

Cognitive tests can help in the diagnosis of AD. In the tests, people are instructed to copy drawings similar to the one shown in the picture, remember words, read and subtract numbers. These intercalated pentagons are used in the Mini-Mental State Examination (MMSE), an extensively used cognitive test. Image source: Wikipedia

Cognitive tests are neuropsychological tests administered by a clinical expect which assess several skills: general cognition, memory, language, vision, etc ... These cognitive tests give an overall sense of whether a person is aware of their symptoms, is aware of the surrounding environment (i.e. he/she knows where they are, know the date and time) and whether he/she can remember a short list of words, follow instructions and do simple calculations.

Cognitive tests are important in Alzheimer's disease because they measure cognitive decline in a direct and quantifiable manner. In the cascade of pathological events that lead to Alzheimer's disease, cognitive decline is one of the latest to become abnormal. This is because the first abnormalities are first noticed on the microscopical scale through the misfolding of a protein called Amyloid beta. These are followed by changes at larger scales: loss of the neurons' myelin sheath, neuron death, visible atrophy in MRI scans and finally cognitive decline. (Jack et al., 2013, 2010b)

These tests have several limitations: 1. they suffer from practice effects, i.e. patients who undertake the same test several times can learn/remember how to do it, and thus score higher at a follow-up visit; this limits the usefulness of the test in assessing dementia 2. they have floor or ceiling effects, which means that many subjects might score the highest/lowest score possible and 3. they can be biased, as they are undertaken by a human expert who might be influenced by prior knowledge of the subject's cognitive abilities.

MRI measures

Left: MRI scan of a subject before the onset of atrophy. Right: MRI scan of the subject with severe atrophy due to AD, which is visible throughout the brain. The coloured regions represent deep gray matter structures, affected early in the disease process (hippocampus = red; entorhinal cortex = blue; perirhinal cortex = green) MRI is a widely used technology for measuring the extent of atrophy and tracking the progression of Alzheimer's disease (AD). Source: Neurology.org

Magnetic resonance imaging (MRI) is a technique used to image  the anatomy and the physiological processes of the brain and other body parts. With MRI, atrophy can be quantified by measuring the volume of gray matter (GM) and white matter (WM) of the brain. The GM is the brain tissue that consists of nerve cells and the WM consists of fibres connecting these nerve cells. GM can be found in the cortex of the brain and in sub-cortical areas. As a structural MRI scan shows contrast (i.e. differences in pixel intensities) between these tissues, it can be used for volume  measurement. Atrophy by indicated by the loss of volume in a particular brain region between two scans, one initial scan and one follow-up scan. Atrophy is caused by the death of neurons in regions affected.

TADPOLE datasets include three main types of structural MRI markers of atrophy: 1. ROI volumes 2. ROI cortical thicknesses 3. ROI surface areas, where an ROI (region of interest) is a 3D sub-region of the brain such as the inferior temporal lobe. Obtaining these structural MRI markers from the images is a long and complicated process. This involves registering (i.e. aligning) the MRI images with each other and performing a segmentation of the main brain structures using an atlas-based technique. More information can be found on the Freesurfer website: https://surfer.nmr.mgh.harvard.edu/fswiki/LongitudinalProcessing

Quantification of atrophy with MRI is a very important biomarker as it is widely available and non-invasive. Also, it is a good indicator of progression of MCI to dementia in an individual subject because it becomes abnormal in close temporal proximity to the onset of the cognitive impairment (Jack et al., 2013, 2010b).

These measures are computed with an image analysis software called Freesurfer using two pipelines: cross-sectional (each subject visit is independent) or longitudinal (uses information from all the visits of a subject). The longitudinal measures are ?more robust?, but the downside is that there are more missing values in our TADPOLE spreadsheet. The MRI biomarkers in TADPOLE can be found in the columns containing UCSFFSX (cross-sectional) and UCSFFSL (longitudinal).

PET measures

Fluorodeoxyglucose (FDG) PET images for a cognitively normal subject (left), a subject with mild cognitive impairment (middle) and Alzheimer's disease (right). FDG PET measures cellular metabolism, which is known to decrease during the development of AD. There is decreased metabolism in the parietal region (white arrow) in the Alzheimer's subject compared to the cognitively normal subject. Images courtesy of Suzanne Baker, PhD; William Jagust, MD; and Susan Landau, PhD.

Positron Emission Tomography (PET) detects pairs of gamma rays emitted by a radioactive tracer, which is introduced into the body of a biologically active molecule. Three-dimensional images of tracer concentration within the body are then constructed by computer analysis. Before a PET scan, the patient is injected with a contrast agent (containing the tracer) which spreads throughout the brain and binds to abnormal proteins (amyloid and tau). This enables researchers to track the concentration of these proteins. PET scans can be of several types, depending on the cellular and molecular processes that are being measured:

  • cell metabolism using Fluorodeoxyglucose (FDG) PET: Neuronal cell metabolism refers to the the activity going on inside neuronal cells such as the processing of food and elimination of waste. Neurons that are about to die will show reduced metabolism, so FDG PET is an indicator of neurodegeneration. FDG PET can be used to measure cell metabolism.
  • levels of abnormal proteins such as amyloid-beta through AV45 PET. Amyloid-beta misfolding (i.e. errors in the construction of its 3D structure) is thought to be one of the causes of Alzheimer's disease. High levels of misfolded amyloid-beta in the brain are thought to eventually lead to future neurodegeneration and cognitive decline. AV45 PET can be used to measure the levels of amyloid in the brain.
  • levels of abnormal tau proteins through AV1451 PET: Abnormal phosphorylated tau (i.e. tau protein + a phosphorus group) that gather together in an insoluble form eventually causes damage to the neuron's cytoskeleton, causing the neuron's transport system to collapse and thus to the neuron's death.

The PET measures are important because they give information about molecullar processes that happen in the brain. These are usually the first to become abnormal in the cascade of events that lead to Alzheimer's disease, and are therefore important early markers of the disease that is about to unfold. In TADPOLE, these PET measures might be indicative of whether a healthy control will eventually progress to mild cognitive impairment (MCI) status or not.  

While PET scans are non-invasive, they have some limitations. One main limitations is that the patient is exposed to ionizing radiation, which limits the number of scans they can take in a specific time interval. PET scans also have a much lower spatial resolution compared to MRI scans. One other caveat with AV1451 PET (tau imaging) is that it is a very new imaging technology and still under research, and very few subjects in the TADPOLE dataset have undertaken these images. PET measures can be found in columns containing "BAIPETNMRC" (FDG PET), "UCBERKELEYAV45" (AV45) and "UCBERKELEYAV1451" (AV1451).

DTI measures

(Left) Diffusion tensor image of a brain showing white matter fibre connections. The colors represent the direction of the connection (red for left-right, blue for superior-inferior, and green for anterior-posterior). (Middle) Zoomed image into the small region of interest (ROI), showing the diffusion tensor ellipses. Each ellipse indicates the direction where water molecules diffused (i.e. moved). (Right) Diagram showing the difference between isotropic diffusion (i.e. equal in all directions) versus anisotropic diffusion, along with the diffusivity measures that can be computed. Image sources: [1] [2] [3]    

While structural MRI measures brain atrophy, MRI can also be used to measure other markers of neurodegeneration that provide complementary information for dementia diagnosis. One such marker is diffusion tensor imaging (DTI). DTI can measure the degeneration of white matter (connections between neurons) in the brain. This is done by analysing the diffusion of water molecules along the neuron fibre connections. Molecular diffusion in tissues is not free, but reflects interactions with many obstacles, such as macromolecules, fibers, and membranes. When a fiber connection degrades, the diffusion becomes more isotropic (i.e. equal in every direction), which can be quantified using a measure called fractional anisotropy.  

DTI is important for analysing the progression of Alzheimer's disease. It has been shown that dementia affects white matter bundles (Sachdev et al., 2013). DTI has also shown great potential for aiding the diagnosis of dementia (Bozzali et al., 2002; Lu et al., 2014; Zhang et al., 2009).

DTI measures have some limitations. In ADNI, it is a relatively recent imaging modality, and thus many subjects will not have any DTI scans. Another common problem with diffusion tensor imaging and structural MRI is the partial volume effect, which means that measures at each voxel (3D pixel) are biased due to averaging across many different cells that are contained in that voxel. In the TADPOLE spreadsheet, DTI measures can be found in columns containing "DTIROI".

CSF measures

Diagram showing the cerebro-spinal fluid (CSF) coloured in blue, which is found in the subarachnoid space around the brain and spinal cord. Source: Wikipedia

The cerebrospinal fluid (CSF) is a clear, colourless body fluid found in the brain and spinal cord. It acts as a cushion or buffer for the brain, providing basic mechanical and immunological protection to the brain inside the skull. A sample of the CSF can be taken from patients invasively, by inserting a needle in the spinal cord, a procedure called lumbar puncture.

Measures of CSF are very important for dementia research. In the CSF, the concentration of abnormal proteins such as amyloid-beta and tau is a strong indicator of AD. Abnormal levels of concentrations in these proteins are some of the earliest signs of Alzheimer's disease and can indicate abnormalities many years before symptom onset.

The CSF measures have some limitations. One key limitations is that the lumbar puncture is highly invasive and thus not performed in many studies, although a fair amount of ADNI subjects agreed to undergo the procedure. The CSF measures are also not specific to any particular part of the brain.

Risk factors

Diagram showing different risk factors related to lifestyle and the associated level of evidence. Source: Baumgart et al., 2015

There are several important risk factors that are known to cause dementia. The alipoprotein E4 variant (APOE E4) is a gene that is the largest known risk factor for AD. Subjects with APOE E4 have a risk 10 to 30 times higher of developing AD compared to non-carriers (i.e. subjects without the gene). The exact mechanism through which the presence of APOE E4 leads to AD is not known. The presence of APOE E4 in a particular subject is denoted by a 1 in the APOE column in TADPOLE_D1_D2.csv

Another known and important risk factor for AD is age – the older subjects are the more likely they are to develop AD. Above the age of 65, the risk of developing dementia doubles every 5 years. Gender is another known risk factor, where women seem more likely to develop AD than men. The reasons for this are still unclear.

Finally, there exist many other risk factors related to existing medical conditions and lifestyle. Medical conditions such as type 2 diabetes, high blood pressure, high cholesterol, obesity or depression are known to increase the risk of developing dementia. Lifestyle factors known to increase the risk of developing dementia include physical inactivity, smoking, unhealthy diet, excessive alcohol or head injuries.

While some of these risk factors (APOE, age and gender) are found in the TADPOLE_D1_D2.csv spreadsheet, the other factors are not present. The information does however exist in the ADNI database (one spreadsheet is under Study Data-> Medical History -> Medical History [ADNI1,GO,2]) and TADPOLE participants are welcome to use the information from these spreadsheets if desired.

TADPOLE standard data sets

The TADPOLE standard data sets can be downloaded from the LONI. After logging in, go to Download -> Study Data -> Test Data -> Data for Challenges and download "Tadpole Challenge Data".

Here is a description of the TADPOLE standard data sets.

D1. TADPOLE Standard training set

The Standard training set (D1) was created from the ADNIMERGE spreadsheet, to which we added regional MRI (volumes, cortical thickness, surface area), PET (FDG, AV45 and AV1451), DTI (regional means of standard indices) and CSF measurements.

The MRI measurements included are FreeSurfer processed ROI volumes, cortical thicknesses, and cortical surface areas from the UCSFFSL (longitudinal pipeline) and UCSFFSX (cross-sectional pipeline) tables. Explicitly, spreadsheets UCSFFSL_02_01_16.csv, UCSFFSL51ALL_08_01_16.csv, UCSFFSX_11_02_15.csv, and UCSFFSX51_08_01_16.csv. Duplicate rows were removed by retaining the row with the most recent RUNDATE and IMAGEUID.

The PET measurements included are ROI SUVR values: FDG, AV45 and AV1451. The spreadsheets used were: BAIPETNMRC_09_12_16.csv, UCBERKELEYAV45_10_17_16.csv, and UCBERKELEYAV1451_10_17_16.csv.

The DTI biomarkers included are ROI summary measures taken from the spreadsheet DTIROI_04_30_14.csv. For example mean diffusivity MD, and axial diffusivity AD.

We also included three CSF biomarkers: Amyloid-beta, Tau and P-Tau. These values were taken from the Elecsys analysis, which can be found in the UPENNBIOMK9_04_19_17.csv spreadsheet.

In all cases, we matched rows between ADNIMERGE and these spreadsheets using the subject ID and visit code. Duplicate rows were removed, with the most recent preferred. For each modality we also included the ID of the image that was used to derive these summary measures.

D2. TADPOLE Standard prediction set

The set of D2 entries contains all currently available longitudinal data for prospective ADNI-3 subjects that are rollovers from earlier ADNI studies. Such subjects are active (PTSTATUS==’1’), with ADNI-2 visits (Phase==’ADNI2’), and screening was performed (RGSTATUS=='1'). These subjects were identified as follows:

  1. REGISTRY_ADNI2 = select all from REGISTRY, where Phase==’ADNI2’ and RGSTATUS==’1’
  2. DXARM = inner-join of DXSUM and ARM on {RID,Phase}
    Note: ARM is required for baseline diagnosis. Can also be used to identify ADNI-1 subgroups such as PET+1.5T; 3T+1.5T; etc. (see p23 of ADNI_data_training_slides_part2.pdf)
  3. DXARMREG = left-outer-join DXARM and REGISTRY_ADNI2 on {RID,Phase,VISCODE}
  4. D2_RID = select RID from DXARMREG, where DXCHANGE is not missing and Phase is not missing and PTSTATUS==’1’
  5. D2 = historical ADNI data for D2_RID individuals
D3. TADPOLE Cross-sectional prediction set

D3 uses the same set of participants as D2, but includes only the final visit and a limited number of data columns. The aim is to mimic screening data for a clinical trial in which the available information is typically limited to demographics, cognitive test scores, and structural MRI (derived brain volumes).

D4. TADPOLE Test set

The test set will contain ADNI-3 data from rollover individuals, acquired after the challenge submission deadline, and used for evaluating the forecasts according to the challenge metrics.

Further Details

The TADPOLE standard data sets are downloadable as spreadsheets from the ADNI website. For anyone interested in the details of how the spreadsheets were generated, we have made our scripts available on GitHub. Our repository also contains scripts for generating the leaderboard datasets, and sanity checking and evaluating a submission file.

Organised by:  

Prize sponsors: