Past events | InfWeb

Information about past events organised by the CDT.

Combining Causal Inference and Machine Learning for Model-Agnostic Discovery in High-Dimensional Biology

Nima Hejazi, PhD, is an Assistant Professor of Biostatistics at the T.H. Chan School of Public Health of Harvard University

The widespread availability of high-dimensional data has catalyzed the process of biological pattern discovery. Today, the simultaneous screening of anywhere from thousands to millions of biological characteristics (in, e.g., genomics, metabolomics, proteomics, etc.) is commonplace in many experimental settings, making the simultaneous screening of such a large number of characteristics a central problem in computational biology and allied sciences. The information gleaned from such studies promises substantial progress, not only for the basic sciences but also for medicine and public health.

While the tools of modern chemical biology and biophysics allow for great precision in probing biological systems at the molecular-cellular level, population biomedical and public health sciences must operate without access to such a fine level of control. Instead, statistical innovations bridge the gap -- being used to dissect mechanistic processes and to mitigate the inferential obstacles imposed by the confounding of key relationships in observational (non-randomized) studies. Unfortunately, most off-the-shelf statistical techniques rely on restrictive assumptions - born of mathematical convenience - that invite opportunities for bias due to model misspecification (when the biological process under study fails to obey the assumed mathematical conveniences). Fortunately, model-agnostic inference, which draws on causal inference and semiparametric efficiency theory, provides an avenue for avoiding restrictive modeling assumptions while obtaining robust statistical inference about scientifically relevant parameters.

We outline this this framework briefly and introduce a model-agnostic technique for biomarker discovery that incorporates causal inference-based parameters, that leverages state-of-the-art machine learning for flexible estimation (mitigating potential for model misspecification), and that incorporates variance moderation (curbing Type-I error) to deliver stable inference in high-dimensional settings, even when sample sizes are limited. When paired with domain expertise, this approach reliably identifies biomarkers linked to disease or exposure patterns, yielding insights for future investigations into therapeutics and policy interventions.

The approach is implemented in the open-source biotmle R/Bioconductor package (https://bioconductor.org/biotmle). This talk is based on joint work with Alan Hubbard, Mark van der Laan, and Philippe Boileau, described in the pre-print:

https://arxiv.org/abs/1710.05451

About the speaker: Nima Hejazi, PhD, is an Assistant Professor of Biostatistics at the T.H. Chan School of Public Health of Harvard University. He completed an NSF Mathematical Sciences Postdoctoral Research Fellowship in 2022, and, prior to that, obtained his PhD in biostatistics from UC Berkeley. He was part of the founding core development team of the tlverse project (https://github.com/tlverse), an open source software ecosystem for targeted learning in the R programming language, and, since 2020, has collaborated closely with the Vaccine and Infectious Disease Division of the Fred Hutchinson Cancer Center, as a core member of the US Government Biostatistics Response Team and the COVID-19 Prevention Network as well as, more broadly, in studies of vaccine safety and efficacy of infectious diseases including HIV-1, malaria, and COVID-19.

Nima's research interests combine causal inference and machine learning, aiming to develop robust, efficient, and assumption-lean statistical procedures in a problem-first approach, tailored to problems arising in scientific collaborations. He is motivated by methodological topics related to distribution-free (nonparametric) inference, semiparametric-efficient inference, high-dimensional inference, targeted minimum loss estimation, and modern issues in the design of experiments (outcome-dependent sampling, sequentially adaptive treatments). His recent statistical research has been strongly informed by collaborative science in clinical trials and computational biology, especially as related to treatment and preventive vaccine efficacy trials and infectious disease epidemiology. Nima is also deeply interested in high-performance statistical computing and the development and design of open source software for reproducible applied statistical data science.

Predicting the usage of healthcare services using neural networks and nationwide healthcare register data

Pekka Marttinen is an associate professor in machine learning in the department of computer science at Aalto university, Finland

Accurately predicting the need for healthcare services is important, to allocate the limited resources fairly and efficiently. In this presentation I will introduce the problem of predicting future diagnoses and hospital visits using individual level trajectories of diagnoses and medical procedure codes available in electronic medical records. I will then present our recent work on developing neural networks for nationwide healthcare registers for this problem, to predict the usage of healthcare services by the elderly population in Finland. We show that by leveraging individual patient trajectories and modern neural network architectures, the prediction accuracy can be significantly improved compared to multiple strong baselines. The presentation is based on the following articles:

https://proceedings.mlr.press/v116/kumar20a.html

https://arxiv.org/abs/2108.13672

About the speaker: Pekka Marttinen is an associate professor in machine learning in the department of computer science at Aalto university, Finland. He received his PhD in Statistics at the University of Helsinki in 2008, and has been employed at Aalto since 2009, interleaved by periods as a visiting researcher at the Center for Communicable Disease Dynamics at Harvard and at the Sanger institute in Cambridge. He has received a Research Fellowship from the Academy of Finland. His research focuses on method development and modeling in machine learning, with emphasis on biomedical applications, and he has published 65 articles on these topics. He is best known for his work on scalable computational methods for detecting structure in massive genomic data sets and as well as for the development of Bayesian methodology for efficient and flexible modeling of complex and noisy data, including techniques such as likelihood-free inference and Bayesian neural networks.

Learning mixtures and DNA copy-numbers from bulk sequencing of tumor samples (video recording)

Brian Arnold, Senior Data Scientist, Princeton University.

As tumors expand, they evolve via the accumulation of copy-number aberrations (CNAs; amplifications or deletions of DNA) and point mutations, creating a mixture of distinct subclones that contain unique mutations. Quantifying CNAs and the the frequency of subclones is critical for understanding of how tumors arise and continually evolve, but it is challenging to do so from bulk tumor samples. We have developed an algorithm, HATCHet2 (Holistic Allele-specific Tumor Copy- number Heterogeneity), that uses a variety of machine learning techniques to infer subclone-specific copy numbers changes and the presence of whole-genome duplication. HATCHet2 has several features that contribute to its superior performance, including its ability to jointly analyze multiple samples from the same tumor. We show that HATCHet2 identifies subclonal CNAs in prostate cancer samples and detects hyptertriploidy and KRAS amplifications in testicular germ cell tumors.

About the speaker: Brian is a senior data scientist in the department of Computer Science at Princeton University, where he collaborates with faculty on a variety of projects involving genomics. Currently, he works with Ben Raphael to develop new methods to study cancer, and he also works with Shane Campbell-Staton to study how human activity shapes the evolution of other species. Brian received his PhD from Harvard University where he studied evolutionary genetics in the department of Organismic and Evolutionary Biology, and he later did postdoctoral studies at the Harvard School of Public Health in the department of Epidemiology. Brian has worked on diverse topics in genomics involving plants, bacteria, elephants, and cancer.

A model for academic-industry partnerships to deliver clinical AI solutions that are fit for purpose (video recording)

Carlo Tacchetti, Vita-Salute San Raffaele University

A correct assessment of patients’ diagnostic, prognostic, and therapy response to standard of care treatments, must consider individual variabilities, including prediction of adverse therapy side effects, and likelihood of surge of co-morbidities. The complexity of the anamnestic, imaging, laboratory, pathology, omics data available for each patient is a major obstacle to reach this goal in the day-by-day clinical practice.

The development and validation of safe, evidence-based Artificial Intelligence platforms is a possible solution to this need. The precise clinical definition of the most relevant data, obtained from different sources, and of the outcomes of interest, as well as the knowledge of the complexity of patients’, doctors’ and data journey within the hospital, play a pivotal role to create comprehensive data lakes and to establish standard trustworthy operating procedures of real practical usefulness to the scientist and to the clinician, and easily deployable both in large multi-specialistic hospitals, and in smaller community hospitals.

In this presentation I will describe two use cases (Covid-19 and lung cancer) as blueprint of the successful strategic partnership between San Raffaele Hospital and Microsoft that led to the design of a cloud-based AI multilayered platform. In addition, I will touch upon the advantages of adopting a Federated Machine Learning model to allow collaborations between different hospitals providing trustworthy GPDR compliant tools for Machine Learning model exchange, avoiding export of data from one institution to another.

The role of NHS AI Lab (video recording)

Eleonora Harwich, Head of Collaborations, NHS AI Lab| NHSX

In this session, the speaker will explore the progress made in the UK in terms of the adoption of artificial intelligence in health and care. The NHS AI Lab has been set up to support the development and deployment of safe, ethical and effective AI. This session will explore the progress it has made to date in terms supporting innovation, increasing adoption, clarifying regulation and routes to market as well as increasing the workforce and public's confidence in the technology.

Application of Deep-learning Architectures for Accurate Protein Structure Prediction (video recording)

Michael Herrera, School of Chemistry, University of Edinburgh

Since the advent of structural biology, the atomic coordinates of only a fraction of the billions of known protein sequences have been solved experimentally. This fact was recently highlighted during 2021 with the celebration of the 50 years of the Protein Databank (PDB). Advances in bioinformatics and molecular modelling have supplemented the dearth of experimentally-determined structures, offering reasonable predictions in cases where close structural homologues are already known. The application of deep-learning algorithms in recent years has greatly improved the predictive power of de novo computational methods, with contenders such as AlphaFold and trRosetta nearing experimental atomic accuracy. This seminar provides a concise introduction to the recently released AlphaFold 2 algorithm for highly accurate protein prediction, opening an exciting new chapter in structural biology and rational/semi-rational protein engineering. Following its open-source release, features of both AlphaFold 2 and RoseTTAFold have been assimilated into a freely-accessible online notebook (ColabFold) capable of multimeric prediction, including both homo- and hetero- assemblies. The seminar will cover the application of such neural network algorithms to study several protein targets including: an elusive mycobacterial enzyme; an unusual transaminase-reductase fusion involved in natural product biosynthesis; and a selection of machine-generated sequence homologues of medically relevant human enzymes.

Developing new speech signal processing algorithms for biomedical and life sciences applications: principles, findings, challenges, and a view to the future (video recording)

Thanasis Tsanas, Usher Institute, University of Edinburgh

Biomedical speech signal analysis has been gaining increasing momentum in the last 10-15 years. In this talk, I will focus on the key principles and state of the art signal processing algorithms to characterize sustained vowels and voice fillers. My aim is to demonstrate how the extracted characteristics from speech signals can be combined with machine learning techniques to develop robust, automated decision support tools assisting experts on their day-to-day praxis in the context of medical applications and forensic applications. I will highlight contemporary challenges and areas for further development, in particular with large speech databases that we have recently reported on such as the Parkinson’s Voice Initiative.

Artificial Intelligence to Tackle Infertility (video recording)

Alexandra Boussommier-Calleja, CEO & co-founder of ImVitro

This talk will first describe the path of Dr Alexandra Boussommier from being a researcher to an entrepreneur. We will then discuss ImVitro, the company she founded in 2019 that aims at using AI to tackle infertility. The overall research conducted by her team will be described along with the application of AI in embryology, the clinical needs it is addressing and the challenges faced in bringing AI into clinics.

ImVitro is an early stage Deep Tech start-up based in Paris, with pre-seed investment from Entrepreneur First. Its aim is to facilitate in vitro fertilization (IVF) using artificial intelligence to help the increasing number of people with fertility issues.

https://im-vitro.com/

DataLoch: using routine data to help drive improvements in healthcare (video recording)

Kathy Harrison, DataLoch programme manager, Usher Institute.

Atul Anand, DataLoch clinical lead, Centre for Cardiovascular Science.

The NHS in Scotland is facing the prospect of an aging population with more people with long-term conditions, with reducing resources and value for money challenges such as increasing costs of medicine and delayed discharge. DataLoch’s ambition is to create a health and social care data resource that enables a data driven approach to this challenge, supporting service managers, innovators and researchers to understand and deliver the change that’s needed. Routine data, processed to create a single longitudinal data set, will present a complete story of a patient’s health, diagnosis, treatments, medical procedures and outcomes. In this way, DataLoch will enable high quality research through slick and robust data access. The DataLoch service has been tested with an initial COVID-19 dataset and is now moving to a condition agnostic database structure to support wider research interests.

https://www.ed.ac.uk/usher/dataloch

Scripting the Use of Medical Technology – The Case of Data-based Clinical Decision Support Systems (video recording)

Kevin Wiggert, Department of Sociology of Technology and Innovation, Technical University of Berlin.

Newly developed Clinical Decision Support Systems (CDSSs), which are supposed to provide information or support decisions, are increasingly systemically opaque and thus less comprehensible for the physician. It becomes more and more unclear to the clinical user and to the developers themselves what data sources and what information are the fundaments of these technologies and how the technology is computing its reasoning on this data. To better comprehend the reasoning of these technologies for the actors engaged it is not only necessary to better understand the inner processes of the technology but also to get more insights into the assumptions the technology is built on. This, again, requires a better understanding of the imagination and ideas about the ways of use which are inscribed into the technology during its development. To exemplify this, I used a case study of the development of a CDSS meant to support treatment decisions in cardiology, a CDSS that is at least partly not based on expert knowledge, that means knowledge from medical experts and/or clinical guidelines but uses techniques of machine learning or simulation. It receives its way of reasoning from recognizing patterns within the data this engine is operating with. This can possibly lead to new findings and discoveries that are not based on pre-existing knowledge and not represented by clinical guidelines but on (potential) correlations between data points in a data corpus. In combining and extending the theoretical concepts of situational scenarios in technology development as well as the notion of scripts written into technology I am going to show that the building of the CDSS-prototype and the negotiation about the components of future situations of use co-evolved and led to specific scripts influencing the envisioned user to use the technology in particular ways and receive recommendations based on a so-called virtual patient, a data corpus “representing” the real patient under treatment. In the case study it was particularly the engineers’ perspective and less the clinicians’ one about the application context that got implemented, which ended in less recognition of the patients’ role in the consultation as well as an assumed passivity on the part of the clinician as recipient of information delivered by the technology.

Data-driven discoveries in cancer biology

Giovanni Stracquadanio, Senior Lecturer in Synthetic Biology, School of Biological Sciences

The availability of high-throughput technologies enables the characterization of cells with unprecedented resolution, ranging from the identification of single nucleotide mutations to the quantification of protein abundance. The availability of high-resolution molecular data has been particularly revolutionizing for cancer biology and has shed light on the underpinning mechanisms controlling tumour formation and response to treatment. These findings have been possible thanks to the development of analytical methods that can generate testable hypotheses from cancer omic data. Here I will present how we use Bayesian learning and deep learning to take advantage of population scale biobanks to identify the molecular mechanisms responsible for cancer heritability and tumorigenesis.

Artificial Intelligence for Data Analytics (video recording)

Chris Williams, School of Informatics, University of Edinburgh

The practical work of deploying a machine learning system is dominated by issues outside of training a model: data preparation, data cleaning, understanding the data set, debugging models, and so on. The goal of the Artificial Intelligence for Data Analytics project at the Alan Turing Institute is to help to automate the whole data analytics process by drawing on advances in AI and machine learning. We will describe tools to address such tasks, including identifying syntactic and semantic data types, data integration, and identifying and repairing missing and anomalous data. Joint work with the AIDA team: Taha Ceritli, James Geddes, Ernesto Jimenez-Ruiz, Ian Horrocks, Alfredo Nazabal, Tomas Petricek, Charles Sutton, Gerrit Van Den Burg.

Data: a Clinical Perspective (video recording)

Jennifer Quint, National Heart and Lung Institute, Imperial College London & BREATHE Hub.

This talk will consider what clinicians think about data and what they should think about data. I’ll talk about how and where data are collected, used and how we can make it even better. Whilst clinicians use and report data in day-to-day clinical practice, entry tools and reporting are often inefficient, and whilst clinicians are skilled at observation and medical decision making, they often struggle with accurately recording and measuring activities. This in turn leads to data variation and issues in data quality.

CDT Industry Day 2021

The CDT held a virtual Industry Event on 19 March 2021. The aim of this event was to bring together the industry partners, students and academics to discuss interests, challenges and research developments in Biomedical AI. The event featured talks from the industry partners and CDT students:

Industrial Partner Talks

Dr Kenj Takeda (Academic Health and AI Partnerships Director); “Microsoft Research AI in Healthcare”
Dr Flaviu Cipcigan, (Research Scientist in Impact Science): “IBM’s Science & Technology Outlook”
Dr Nichola Richmond (Director AI/ML and Machine Learning at GlaxoSmithKline); “AI@GSK & the AI Fellowship”

Student Talks

Michael Stam - "Machine Learning for Reliable Protein Design"
Rayna Andreeva - "Topological Data Analysis for Biomedical Images and 3D Surfaces"
Matús Falis – ““Addressing Concept Sparsity in Medical Text with Medical Ontologies”

Data-Driven Clinical Pathways

Speaker: Andy Smout, Vice President, Research at Canon Medical Research Europe

Abstract:

Canon Medical is a global supplier of radiology equipment and IT systems. Our Centre of Excellence for medical AI is based in Edinburgh, where we develop cutting-edge machine learning algorithms for detecting and characterising disease. Increasingly we find that access to data is the rate limiting step in machine learning – this is driving a change in how we do research, and has given birth to centres such as iCAIRD where Canon and other technology companies are able to work with clinical data in situ. We expect these centres to lower the barriers to entry for new medical AI companies, which in turn will create a diverse and potentially confusing marketplace with many overlapping and competing solutions. To make sense of this we see the need for an overarching infrastructure that brings AI into the clinical workflow, breaking down information silos and to create an integrated diagnostic solution focussed around the clinical pathway.

This article was published on 18 Dec, 2023