Curating Precision Cohorts: From Long COVID to Unexplained Chronic Conditions

AIME 2025 Conference Workshop, Monday, June 23rd, 2025, 13:30 – 17:00

Abstract

Post-acute sequelae of COVID-19 (PASC), also called Long-COVID, remains a medical mystery. Many patients experience persistent symptoms such as fatigue, brain fog, and shortness of breath long after recovering from COVID-19, yet these conditions often go unrecognized due to their subtle and variable nature. Existing diagnostic tools, including the U09.9 ICD-10 code, have been inconsistently applied, reinforcing biases and limiting the scope of research. The need for a more precise, data-driven approach to identifying PASC has never been more urgent.

Azhir & Hügel et al. [1] developed an algorithm to identify unexplained, patient-specific conditions after a COVID-19 infection creating a PASC precision cohort, based on unfragmented diagnoses and treatment data. Empowering other researchers to create, curate, and analyze their own precision cohorts will significantly enhance PASC research. Consequently, the algorithm is open source and available in a docker container.

During the tutorial, attendees will gain hands-on experience using synthetic data to test and apply the algorithm. Additionally, we will explore the algorithm’s expanded capabilities in identifying other unexplained chronic conditions. This interactive session will equip participants with the knowledge and tools to implement AI-driven diagnostic approaches in their own research, fostering a more accurate, equitable, and efficient healthcare landscape.

Schedule: Monday, June 23rd, 2025, 13:30 – 17:00

Topic Duration Time
Welcome and General Introduction 10 minutes 13:30
Temporal Phenotyping in (Post-) COVID 20 minutes 13:40
Detailed Presentation PASC algorithm 30 minutes 14:00
Questions/Discussion about the algorithm 15 minutes 14:30
Setup/Start Docker-Containers with the algorithm and synthetic data 15 minutes 14:45
Break & Troubleshoot Start Docker Containers 30 minutes 15:00
Guided application of the algorithm on data 60 minutes 15:30
Questions/Discussion about the application 15 minutes 16:30
How to apply to identify other unexplained chronic conditions not directly linked to PASC 10 minutes 16:45
Feedback and WrapUp 5 minutes 16:55
END    

Preparation

We ask the participants that they have Docker already preinstalled (Link to official Docker Installation Guide: ) and downloaded the Docker Image (Link to Docker Image: TBA!) including the RStudio Instance with all required Packages and synthetic Data before the workshop. Please ensure that you can run/start containers using the hello world example from the above mentioned guide.

Topics

Longitudinal EHR Data

Longitudinal Electronic Health Record (EHR) data offers a opportunity to track patient health trajectories over time enabling the analysis of disease progression, treatment outcomes, and patient responses to interventions. Longitudinal data can be used to identify patterns, trends, and correlations in diseases and patient journeys. Researchers can leverage longitudinal data to address complex research questions, and develop predictive models that enhance patient care and outcomes.

Precision phenotyping

Precision phenotyping refers to the detailed characterization of diseases on a patient level. Therefore these approaches relay on large datasets to allow researchers and clinicians to get a better understanding (of the heterogeneity) of diseases. Precision phenotyping holds the potential to enhance personalized medicine by providing nuance insights and contributing to more precise, effective and individualized healthcare.

Synthetic Data

In the scope of the tutorial we will be using synthetic data based on the Synthea data sets, which resemble the population of Massachusetts. This allows to present an algorithm, which normally requires a large amount of sensitive data and allows us to share the data with the participants and enables the participants to apply the algorithm by them self in the tutorial.

Attention Mechanism

The algorithm that will be presented in this tutorial uses an attention mechanism to identify patient-specific PASC symptoms. For each possible PASC symptom, the attention mechanism checks the patient history of the patient, if another entry is associated with this entry based on the temporal distance between both entries. If an association is identified, the current symptom might not be a PASC symptom for this patient but associated with another condition of the patient.  

Post-acute sequelae of COVID-19 (PASC)

PASC, also called long or Post COVID, is a complex new disease that describes chronic conditions after a COVID-19 infection. The World Health Organization defines PASC as: “Post COVID-19 condition occurs in individuals with a history of probable or confirmed SARS CoV-2 infection, usually 3 months from the onset of COVID-19 with symptoms and that last for at least 2 months and cannot be explained by an alternative diagnosis.[...]” [2].  This definition of exclusion is challenging to implement. Nevertheless, it is necessary to develop automatic approaches to identify patient specific PASC symptoms and patients in large real-world data warehouses to build symptom-specific cohorts and run retrospective studies. 

References

1. Azhir A, Hügel J, et al. Precision phenotyping for curating research cohorts of patients with unexplained post-acute sequelae of COVID-19. Med (N Y). 2024 Nov 2. DOI: 10.1016/j.medj.2024.10.009
2. Soriano JB, Murthy S, et al. A clinical case definition of post-COVID-19 condition by a Delphi consensus. Lancet Infect Dis. 2022 Apr;22(4):e102–7. DOI: 10.1016/S1473-3099(21)00703-9

Motivation

Our proposed tutorial will showcase a novel algorithm designed to enhance the diagnosis of Long COVID by leveraging artificial intelligence and large-scale medical record analysis. Traditional diagnostic methods rely on a process of elimination, requiring clinicians to systematically rule out all other conditions before identifying Long COVID—a time-consuming and often biased approach that disproportionately affects marginalized communities. Our computational tool streamlines this process by analyzing extensive patient data, identifying subtle temporal patterns that link COVID-19 infections to lingering symptoms, and systematically excluding other potential causes. By utilizing AI-driven temporal association mining, the algorithm detects complex, often overlooked connections between symptoms and prior infections, enabling a more precise and individualized diagnosis while reducing biases inherent in conventional diagnostic coding systems.

The tutorial will enable participants to build their own precision cohorts using their own EHR data and analyze it. Furthermore, implementing it can provide synergies with other already ongoing PASC related projects. Through demonstrating in the end how the algorithm can be modified to use it to identify unexplained chronic conditions in general,  we provided participants with a tool to build their own precision cohorts of patients with unexplained chronic conditions. These cohorts can than used as a bases for further analysis. 

The tutorial covers aspects from the following areas, which are also scopes of the conference: machine learning and big data analytics, clinical decision support systems, and precision medicine.

Chairs

Jonas Hügel1,2, Arianna Dagliati3, Spiros Denaxas4,5, Ulrich Sax1,2, Shawn Murphy6,7, Hossein Estiri6,7

1 University Medical Center Göttingen, Department of Medical Informatics, Göttingen, Germany

2 University of Göttingen, Campus Institute Data Science, Section of Medical Data Science, Göttingen, Germany

3 University of Pavia, Department of Electrical, Computer and Biomedical Engineering, Pavia Italy

4 Institute of Health Informatics, University College London, London UK

5 British Heart Foundation Data Science Centre, HDR UK

6 Harvard Medical School, Boston, US

7 Massachusetts General Hospital, Boston, US

Chairs

Dr. Jonas Hügel

Dr. Hügel is working as a PostDoc at the University Medical Center Göttingen, Germany and is also a member of the Campus-Institute Data Science, Section of Medical Data Science at the University of Göttingen. In his works, he focuses on temporal phenotyping in complex diseases, such as cancer or Post-COVID. He intensively collaborates with the Clinical Augmented Intelligence (CLAI) Group at Harvard Medical School after being part of this group as a visiting researcher during his doctoral studies. Together with the CLAI group, he developed an algorithm to identify patient specific unexplained Post-COVID symptoms and how to use the algorithm to curate precision cohorts. 

Email: jonas.huegel(at)med.uni-goettingen.de
Homepage: https://medizininformatik.umg.eu/en/about-us/scientific-research-groups/translational-research-informatics/
ORCiD: https://orcid.org/0000-0002-4183-1287

Prof. Dr. Arianna Dagliati

Arianna Dagliati is a Assistant Professor (RTDb) at the Department of industrial and information engineering, the University of Pavia (Italy). Her research is dedicated to the development of mining approaches for enabling the recognition of temporal patterns and electronic phenotypes in longitudinal clinical data. Her commitment is to align her work to global translational medicine research priorities, embrace key steps for providing direct impact of machine learning and artificial intelligence in clinical practice: from software tools that link clinical knowledge with machine learning-based evidences from longitudinal data, to modeling approaches to identify critical transitions in patient’s histories. She has an extensive history of working in multidisciplinary teams and collaborated with a broad range of administrative, specialist and generalist clinicians, and public health professionals, to deliver scientific findings and to integrate algorithms in clinical decision support systems. She is part of the 4CE international consortium for electronic health record data-driven studies of the COVID-19 pandemic and appointed as AIME (society of AI in medicine) Board Member.

Email: arianna.dagliati(at)unipv.it
Homepage:http://www.labmedinfo.org/it/people/arianna-dagliati-phd/ 
ORCiD: https://orcid.org/0000-0002-5041-0409 

Prof. Dr. Spiros Denaxas

Spiros Denaxas is Professor of Biomedical Informatics and Deputy Institute Director for Research at the Institute of Health Informatics at the University College London. His lab's research focuses on creating and evaluating novel computational methods for creating and evaluating disease phenotypes in structured electronic health records, clinical and genomic data.

Email: s.denaxas(at)ucl.ac.uk
Homepage: https://denaxaslab.org
ORCiD: https://orcid.org/0000-0001-9612-7791 

Prof. Dr. Ulrich Sax

Biomedical Informaticist Ulrich Sax designs and operates clinical informatics and research informatics research Infrastructure since 1995. After several leading positions in operational health IT he is leading the translational research informatics group in the Department of Medical Informatics at the University Medical Center Göttingen (UMG). Heavily influenced by his Postdoc-time in Boston, his research interest is in integration and analysis of biomedical data in clinical research and processes of sharing this data. He co-leads the section ELSA in the National Research Data Infrastructure (NFDI) and heads the task area “Data use and Privacy in concert” in NFDI4health.

Email: ulrich.sax(at)med.uni-goettingen.de
Homepage: https://medizininformatik.umg.eu/en/about-us/scientific-research-groups/translational-research-informatics/
ORCID: https://orcid.org/0000-0002-8188-3495 

Prof. Dr. Shawn Murphy

Dr. Murphy is a Professor of Neurology and Biomedical Informatics at Harvard Medical School, Chief Research Information Officer at Mass General Brigham, and the Associate Director of the Lab of Computer Science at MGH. His current projects include Informatics for Integrating Biology and the Bedside (i2b2) and the Research Patient Data Registry (RPDR). his application, which serves over 5,000 investigators performing research using the hospital medical record, served as the test bed for his work with Zak Kohane in developing the open source Informatics for Integrating Biology and the Bedside (i2b2) software platform now operating at over 120 hospitals worldwide. Murphy’s contribution as chief architect of the i2b2 platform has served to strengthen the understanding of the metabolic and genetic underpinnings of complex diseases by developing an informatics framework to integrate data for clinical research from electronic health records.

Email: snmurphy(at)mgb.org
Homepage: https://researchers.mgh.harvard.edu/profile/94359/Shawn-Murphy

Prof. Dr. Hossein Estiri

Dr. Hossein Estiri is a computational scientist and associate professor at Harvard Medical School and the Massachusetts General Hospital’s Department of Medicine. His research focuses on leveraging machine learning and artificial intelligence to analyze large-scale healthcare data, including real-world clinical data, for advancing precision medicine. Dr. Estiri’s work emphasizes using data-driven approaches to improve patient outcomes, with a particular interest in phenotyping, healthcare disparities, and the application of temporal AI models in clinical settings. He has published extensively in the fields of biomedical informatics, machine learning, and health data science with an application in different complex diseases including (Post-) COVID-19.

Email: HESTIRI(at)mgh.harvard.edu
Homepage: https://clai.mgh.harvard.edu/team/he/
ORCiD: https://orcid.org/0000-0002-0204-8978 

Please register for the workshop during the conference registration. The registration is already open. We are looking forward to see you in Pavia.

https://aime25.aimedicine.info/registration/  

Folgen Sie uns