The escalating global burden of neurodegenerative diseases (NDDs) like Alzheimer’s disease (AD) and Parkinson’s disease (PD) poses one of the most pressing public health challenges of our time. In 2021, over 3 billion people worldwide lived with a neurodegenerative condition, accounting for 443 million years of healthy life lost due to illness, disability, and premature death. While aging is the primary nonmodifiable risk factor, the sheer diversity in how these diseases manifest—from initial symptoms to progression and response to treatment—has profoundly complicated efforts to diagnose early, predict trajectories, and develop effective therapies. How can we meaningfully categorize this vast heterogeneity, especially in the crucial years before a formal diagnosis, to unlock more precise and effective interventions?
A groundbreaking study led by Jie Lian and Kazem Rahimi from the University of Oxford, published recently in Nature Aging, offers a compelling answer by leveraging vast datasets of electronic health records (EHRs) and advanced machine learning. Instead of merely observing disease after its onset, this research delves into the extensive prediagnostic clinical histories of patients, revealing distinct subtypes for both AD and PD. This innovative approach moves beyond traditional methods, offering a window into early disease drivers and shared systemic vulnerabilities that could redefine our understanding and treatment strategies.
See the original nature.com story for the full account.
Unmasking Hidden Patterns Before Diagnosis
This study's core innovation lies in its methodology: applying transformer-based deep learning models to large-scale, longitudinal EHR data to identify and validate disease subtypes based on prediagnostic clinical information. The researchers utilized CPRD Aurum, a comprehensive database of UK general practice records covering approximately 20% of the UK population, alongside the UK Biobank as an external validation set. This enabled an unprecedented scale of analysis, encompassing 113,545 AD patients and 45,825 PD patients from CPRD alone, with median prediagnostic observation periods stretching an impressive 18.9 years for AD and 19.1 years for PD. This extensive historical data allowed the machine learning model to "learn" the subtle, time-stamped patterns of health events preceding diagnosis.
What the study actually found, versus what simplified headlines might suggest, is not a new diagnostic test, but rather a sophisticated stratification of patient populations. Instead of proclaiming that "AI can now predict Alzheimer's," the research meticulously identified five distinct clinical-genetic subtypes for each disease. These subtypes were consistently replicated across internal and external datasets, demonstrating robust validity. This isn't about a crystal ball, but about understanding the different pathways people take towards a diagnosis, offering a blueprint for future personalized medicine. The deep learning model transformed complex, sequential EHR data into numerical representations, which were then clustered, ensuring that the identified groups were cohesive and distinct in their clinical and genetic profiles.
Convergent Pathways: Shared Vulnerabilities Across Neurodegeneration
The identified subtypes for AD included: classic late-onset presentation (cluster 1), vascular-related patterns (cluster 2), neuropsychiatric dominance (cluster 3), metabolic–inflammatory profiles (cluster 4), and sensorimotor pattern (cluster 5). Similarly, for PD, clusters emerged as: classic genetic PD (cluster 1), vascular-associated types (cluster 2), severe neuropsychiatric forms (cluster 3), metabolic–inflammatory phenotypes (cluster 4), and cardiovascular–motor subtypes (cluster 5). Notably, the largest AD subtype, cluster 1 (representing 27.7% of CPRD patients), displayed a classic late-onset profile, while the most prevalent PD subtype, also cluster 1 (28.8% of CPRD patients), suggested a classic genetic predisposition.
A particularly striking insight from the study is the emergence of convergent clinical patterns across both diseases. Subgroups characterized by vascular, metabolic, and mental health comorbidities appeared in both AD and PD. This suggests that shared systemic risk factors—such as vascular dysfunction, metabolic dysregulation (e.g., diabetes), or chronic inflammation—might influence how neurodegenerative processes unfold, years before symptoms become clear. For instance, the metabolic–inflammatory subtype (cluster 4) in both AD and PD, despite often showing lower polygenic risk scores (PRS) for the respective diseases, exhibited aggressive disease trajectories, including earlier symptom onset and higher mortality. This finding strongly supports the "type 3 diabetes" hypothesis in AD and its analog in PD, emphasizing the critical role of systemic metabolic health. The study also revealed specific genetic associations within these subtypes, such as APOE4 depletion and APOE2 enrichment in the AD metabolic–inflammatory cluster, and LRRK2 enrichment in the PD vascular-associated cluster.
Bridging Clinical Records with Precision Medicine
The prognostic relevance of these newly identified subtypes is profound. For example, AD cluster 5, characterized by cardiovascular and motor system dysfunction, showed significantly higher 5-year hospitalization rates and poorer survival. In PD, cluster 3, representing severe neuropsychiatric forms, exhibited the most severe symptoms, including elevated anxiety and depression, and faster progression of motor-related symptoms like falls and freezing of gait (FOG) both before and after diagnosis. These distinct prognostic outcomes underscore the clinical utility of identifying these subtypes early.
By leveraging routinely collected EHR data, this framework provides a scalable, non-invasive strategy for patient stratification. It offers a crucial step towards precision medicine, allowing clinicians to potentially identify individuals at higher risk for certain disease trajectories or comorbidities before diagnosis. This could enable targeted preventive strategies or earlier, more tailored interventions, moving beyond a one-size-fits-all approach to diseases notorious for their variability.
Limitations to Consider
While exciting, it is important to contextualize these findings within the study's limitations. First, the diagnoses of AD and PD in the EHR data, primarily from primary and secondary care, were not systematically biomarker-confirmed. This means the study captured clinically diagnosed cases, which may differ from those confirmed through advanced imaging or cerebrospinal fluid analyses. Second, symptom and disease definitions relied on coded data (SNOMED-CT, Read, ICD-10), which might not capture all relevant symptoms or disease incidences due to potential underreporting in routine clinical practice. For instance, specific motor signs in PD might be less frequently coded than other conditions.
Furthermore, critical cognitive testing data, such as MMSE scores, were only available for approximately 30% of participants, limiting the granularity of cognitive phenotype mapping. The authors also acknowledge the potential for detection bias in EHR-based NDD studies, where individuals with greater healthcare utilization might be more likely to have their conditions recorded. While adjustments were made for factors like visit frequency, some residual bias cannot be entirely ruled out. Finally, this analysis is exploratory and hypothesis-generating, reflecting population-level variation rather than definitive biological mechanisms.
The Road Ahead: Towards Personalized Neurodegenerative Care
This study lays critical groundwork, but the journey towards personalized neurodegenerative care is ongoing. The next research steps will involve integrating more granular clinical and imaging data, standardized cognitive testing, and temporally aligned longitudinal biomarker profiles. This multimodal integration will be crucial to bridge the gap between routine clinical information and emerging biological models of NDDs.
Future studies should also consider a case-control design that incorporates biomarker-defined AD and PD alongside non-dementia comparators. This would help to further validate the specificity of these identified clusters and clarify whether the observed vascular or metabolic profiles are truly disease-specific or reflect broader aging-related processes. Ultimately, the goal is to develop integrative risk frameworks that capture both genetic predispositions and comorbidity-related contributions to disease heterogeneity. For patients and clinicians alike, the question remains: how quickly can these data-driven insights translate into actionable tools that prevent, delay, or better manage these complex and devastating conditions? The potential for a future where neurodegenerative care is truly tailored to the individual patient’s unique biological and clinical profile is now more tangible than ever.







