Past MSc projects
Abstracts of MSc projects completed by CDT students in previous cohorts.
Student: Alessandro Fontanella
Early screening and detection of pathologies are important to improve health outcomes of the general population. For automatic classification of eye diseases such as age-related macular degeneration (AMD) and diabetic retinopathy (DR), convolutional neural networks are often employed. Since retinal datasets are usually small in size, the standard approach is to perform transfer learning, i.e. use the weigths of a model trained on Image-Net as initialisation before fine-tuning the model on the retinal dataset available. Recent papers have shown the limitations of transfer learning from ImageNet. In particular, it has been shown that in medical imaging, performing transfer learning may not always improve the performance of a model trained from a random initialisation. In this work, we would like to propose SelfAdapt, a method that is able to exploit unlabelled data from different datasets to learn features in a self-supervised way while projecting all the data onto the same space to achieve better transfer. Through a series of experiments on AMD and DR grading, we show how our method improves the common approach of transfer learning from ImageNet followed by fine-tuning in terms of classification accuracy, AUC and clinical interpretability. Finally, we also test our approach on natural images and verify its effectiveness on Office-31 data.
Student: Domas Linkevicius
Antibiotic resistant bacteria are an outstanding biomedical problem, associated with a significant financial burden and a large number of deaths. Nevertheless, antibiotic resistance is not fully understood. A widely used class of antibiotics – the quinolones – induce bacterial cell death by causing DNA damage. However, bacteria are equipped with a biological machinery – particularly the RecBDC protein complex – that repairs the bacterial DNA and permits the evolution of superbugs. Our main aim in this thesis is to find optimal experimental designs for parameter inference in a recBCD gene expression model.
The field of Bayesian experimental design (BED) is concerned with a principled way of finding optimal experimental designs. In particular, we use MINEBED – a novel, likelihood free Bayesian experimental design methodology – to find optimal experimental designs. It offers a significant improvement over a number of other existing BED methodologies, due to allowing the usage of mutual information when searching for optimal designs.
In this thesis we validate MINEBED on a gene expression model and investigate multiple design optimization scenarios of practical relevance. Particularly, we show that MINEBED provides experimental designs that are reasonably robust to unknown sources of variability. Moreover, we show that at least two measurements are necessary to get a good parameter estimate in the recBCD gene expression model. Based on our results, we discuss how BED in general and MINEBED in particular can provide invaluable help in designing more efficient laboratory experiments in terms of cost, labour and quality of measurement.
Student: Evgenii Lobzaev
Single-cell RNA sequencing allows to identify and study cell-to-cell heterogeneity which is not possible using more conventional bulk sequencing methods. During sequencing cells are destroyed, making it impossible to obtain gene expression time-series trajectories for any given cell. Instead, only a snapshot for the expression profile associated to each cell can be obtained. Pseudotime inference methods allows to align these snapshots into a longitudinal trajectory. This alignment can capture dynamic cellular processes, such as cell cycle or differentiation and acts as a proxy for time-series.
In this work, we model gene expression trajectories along pseudotime, both in terms of mean expression and cell-to-cell variability patterns. This extends the work that we previously carried out in relation to this topic. We will conduct thorough testing of our existing model and its Bayesian extensions, identifying both advantages and disadvantages of each model and comparing their performance to the performance of other, publicly available tools.
Moreover, we will address the problem of mean-variability confounding that is often observed in single-cell RNA sequencing data. Finally, we will try to perform inference on the gene expression trajectories, while doing clustering. For this task we apply Dirichlet processes mixture models using various probabilistic programming languages.
Student: Katarzyna Szymaniak
In vivo deep-brain calcium imaging is a powerful technique used to monitor the activity of populations of neurons in the brains of freely moving animals. However, to perform data analysis and extract individual neuronal signals from the data, multi-stage pre-processing needs to be performed. In collaboration with the Centre for Discovery Brain Sciences, experiment on freely-behaving mice engaged in exploratory and navigational task was conducted. Using 1 photon-based calcium imaging, neural activity of an animal was recorded. The goal of the experiments is to provide localisation prediction based on images acquired from mice hippocampus. Exploration of interpretable deep learning based image processing algorithms for minimising pre-processing steps solution is applied.
Applied ResNet10 with background correction as behaviour predictor was computed. Neuron activation mapping inspired by class activation mapping for the object localisation was computed. Data impact was analysed using two different approaches: statistical frame merge and background correction. In addition, temporal dynamics of neural activity was explored. Generalisation ability of the model was also evaluated.
Student: Matúš Falis
This study aims to recreate and improve upon the state-of-the-art method for the identification of relations between drugs and drug-related information in the context of adverse drug events. We analyse and pre-process the data for the 2018 n2c2 Track 2 shared task, consisting of 505 discharge summaries of patients with adverse drug events from the intensive care units of the Beth Israel Deaconess Medical Centre in Boston. Using the gold standard of drug-related entities and relations provided for the challenge, and the current state-of-the-art method for this task we fine-tune a BERT classifier as our baseline model. We then proceed to investigate improvements to the baseline model through fine-tuning a domain-specific BERT classifier Clinical BERT, and enhancing the input text by including additional left and right context, and drug name normalisation. We find that a version of BERT pre-trained on text from the biomedical domain outperforms the more generic base BERT; that additional input context more uniquely defines the input text resulting in improved performance; that a domain-specific version of BERT can be combined with additional left and right context for further improvement; and that providing medical background knowledge via drug-name normalisation, while leading to lower performance when used in isolation, does not hinder the performance of models when combined with either domain-specific BERT or additional context. Our best model achieves an overall micro F1 score of 0.964 surpassing the state of the art.
Student: Michael Stam
Supervisors: Ajitha Rajan and Javier Alfaro
Immunopeptides are peptides that are presented to the immune system of an organism and can be used in treatments for cancer and viral infections. Motif discovery techniques can be applied to immunopeptides to discover conserved regions in the peptide sequences, known as motifs. These motifs can then be studied for various properties and their suitability for vaccine design. This research project builds on previous work by the Alfaro lab. The main outputs of this project is a new motif discovery pipeline that can be applied to samples of immunopeptide sequences, and a strategy to compare the performance of different techniques. Firstly, this pipeline was applied to a sample of 7,760 immunopeptide sequences to discover motifs. These motifs were then compared to those found by an in-house pipeline developed by the Alfaro lab, and to those found by an existing motif discovery technique called GibbsCluster. The quality of these motifs was assessed by a biologically motivated quality score defined in this project, and by the proportion of peptides that they covered. The results show that the motifs found in this project covered more peptides in the sample than was achieved in previous work however, these motifs did not achieve higher quality scores. Also, the pipeline developed in this project found motifs with better quality scores and covered more peptides, than those found by the existing motif discovery technique GibbsCluster, on the same sample of immunopeptides. Further research still needs to be conducted to apply these methods across more samples of immunopeptides, and to determine which technique to use in future work.
Student: Natalia Szlachetka
Con-ikot-ikot (CII), a small peptide toxin, is a promising candidate for developing a fluorescent probe for single-particle tracking of the dynamics of AMPA receptors (AMPARs) - a subtype of ionotropic glutamate receptors involved in synaptic plasticity, which is the underlying mechanism of the processes of memory and learning. Its structure could be used as a skeleton for genetic modification to alter its function, as well as computational protein design to develop smaller con-ikot-ikot-inspired probes; however, due to the laboratory production of CII being time-consuming and financially expensive, and because the yield of the protein is very low, it would not be possible to test all proposed mutants and designs in vitro. Molecular dynamics (MD) simulations have been carried out in this project to create the basis for a data set for use in the further stages of computational protein design and engineering. The simulations in this project were used to investigate con-ikot-ikot monomer and dimer in the context of their interactions with isolated AMPAR ligand-binding domains. Gaining an understanding of the interactions between CII and AMPARs will guide the design of mutants with altered binding sites or function, as well as small peptides with shapes mimicking the shape of the AMPAR-interacting interface of CII. Proposed structures would first be tested in silico and tweaked to optimise their characteristics, and a selected subset of designs that performed best in simulations could be produced for in vitro experiments.
Student: Nikitas Angeletos Chrysaitis
Supervisors: Peggy Series and Renaud Jardri
Recent cognitive science theories view the human brain as constantly making probabilistic calculations akin to Bayesian inference, with priors corresponding to top-down influences and likelihoods to bottom-up ones. Within this framework, mental disorders are usually understood as impairments that overweight one stream of information relative to the other. Circular inference is a novel extension of this approach, which mirrors the excitatory to inhibitory imbalances often found in mental illnesses. It predicts that, in individuals with such imbalances, priors or likelihoods would get reverberated in the brain’s hierarchical models of the environment, overwhelming the inferential process. This framework has been used to explain the mechanisms behind the symptoms of schizophrenia, and has been subsequently verified in such patients.
Autism has a complex relationship with schizophrenia, which is based both on their similarities and their differences. Therefore, in this project, we explore the use of circular inference in modelling the behaviour of individuals with autistic traits. In order to do that, we implement a decision-making task and collect data from participants online. Surprisingly, we find that circular inference better models the behaviour of all the participants. However, despite the known inhibitory impairments in autism spectrum disorders, we find no evidence of any relationship between autistic traits and information reverberation or any other impairment. We proceed to analyse the project’s limitations, and propose directions for future research, to further investigate and verify the circular inference framework and its explanatory power in mental illness and human behaviour in general.
Student: Rayna Andreeva
Optical coherence tomography angiography (OCTA) imaging is a relatively new modality for the discovery of retinal biomarkers linked to chronic diseases. Several studies have already shown the potential of using hand-crafted features on OCTA images for identifying patient status. However, they depend on the manual segmentation and pre-processing steps, which are prone to errors. In contrast to previous work, we suggest the usage of topological information which does not depend on geometric calculations and is more robust to noise.
We aim to investigate whether features based on topological invariants can help in identifying patient status from a relatively small dataset. We go further to test the robustness of the model on a dataset obtained from a different device. Additionally, we explore if a construction based on multiple topological invariants, which captures more in depth topological information by bi-filtering the images and a function on the images will help us detect particular regions where the difference between the diseased and healthy images are profound. Thirdly, we aim to link topological features with established biomarkers.
Our contributions are the innovative analysis of the topology of the OCTA images and the development of a pipeline for the discovery of topological biomarkers. Our results demonstrate how constructing and using topological invariants as features enables fairly accurate OCTA scan classification. As an additional benefit, our approach does not require pre-processing steps and enables better interpretability. We hope this will help doctors in their diagnosis through better human understanding of the detection. Finally, we show that our model maintains consistent performance across OCTA imaging devices, without any re-training which allows our method to be used as an automatic screening tool.
Student: Bryan Li
Understanding how cortical responses reshape over the course of learning has been the central theme of computational neuroscience. Thanks to the recent advances in neural imaging technologies, experimentalists are able to obtain high-quality recordings from hundreds of neurons over multiple days or even weeks. However, the complexity and dimensionality of population responses pose significant challenges for analysis. Existing methods of studying neuronal adaptation and learning often impose strong assumptions on the data or model, resulting in biased descriptions that do not generalize.
In this work, we explore the use of a special type of deep generative model called – cycle-consistent adversarial networks (CycleGAN) to learn the unknown mapping between pre-learning and post-learning in vivo cellular activities. To do so, we develop a framework to preprocess, train and evaluate calcium signals. We first test our framework on a synthetic dataset with ground-truth transformation. Subsequently, we applied it to neuronal activity from rodent visual cortex across different days that mice transition from novice to expert-level performance on the experimental task. We compare our model performance in both generated calcium imaging signals and their inferred spike trains. To maximize the performance of our model, we derive a novel approach to pre-sort neurons such that convolutional-based deep neural networks can take advantage of the spatial information that exists in neuronal activities. In addition, we incorporate a number of model visualization methods to improve the explainability of our work and also gain insights into the learning process as manifested in the cellular activities.
Together, our results demonstrate that analyzing neuronal learning processes with the data-driven deep unsupervised method holds tremendous potential.
Student: Craig Nicolson
Objective: Several distinct molecular endotypes of Acute Pancreatitis (AP) have recently been identified using multiomic time series data. Here we aim to use the same multiomic data-set to identify the minimum set(s) of variables that can discriminate between these endotypes, with the potential for this to form the basis of a translational endotyping tool.
Methods: We evaluate the multiomics, which comprise clinical data, whole blood transcriptomics, serum proteomics, and serum metabolomics collected following the presentation of patients to hospital (n = 34). Given the limited data-set size we use nested cross-validation with PLS-DA classification and permutation testing to identify target subsets of discriminatory multiomics. This includes the exploration of several novel methods of times series offsetting. These target subsets were then evaluated using pathway enrichment analysis and simple classifiers.
Results: Transcriptomics demonstrate the highest discriminatory ability of the data types. Patient reported time of symptom onset was used to align patients in their disease process. Group wise and summative gene selection across the endotypes showed strong discriminatory ability. In combination, this offsetting and selection allow us to achieve an AUROC of 0.83 using less than 1% of the original variable count used to identify the endotypes.
Conclusions: These results demonstrate a significant degree of endotype discrimination at discrete timepoints is using a small subset of transcriptomic targets.
Student: Ella Davyson
Major Depressive Disorder (MDD) is a complex condition, primarily characterised by a persistent low mood. MDD is currently one of the leading causes of disability worldwide and a large proportion of those diagnosed with MDD are unresponsive to antidepressant treatment. However the lack of understanding into the precise biological mechanisms of MDD has limited the identification of new drug targets and/or biomarkers for patient stratification. This study aimed to analyse the association between genetic risk to MDD to plasma protein levels, to further elucidate the biological basis of MDD. Polygenic Risk Scores (PRSs), which represent an aggregate score of genetic risk to MDD, were tested for associations with 4325 proteins in 1065 individuals, using a linear mixed effects model. Significant proteins were then tested for a causal influence on MDD using Mendelian Randomisation (MR) analysis. The PRS constructed from genome-wide significant variants (P < 5e5e08 ) was significantly associated with five proteins (BTN3A3, MICA, MICB, C4 and NOE2, PFDR < 0.05), that have roles immunoregulation and inflammatory processes. Covarying for a single SNP (rs200949) genotype effect in the model attenuated these associations. MR analysis provided evidence for BTN3A3, MICA and MICB having significantly causal roles in MDD (P < 0.05). These results suggest that more stringent PRSs may be more effective for the assessment of biological profiles in MDD and that inflammatory processes are key markers of MDD risk. This analysis pipeline provides a view into protein differences in those with increased genetic risk of MDD, and may be used to establish the biological functional differences between subtypes in MDD in the future.
Student: Filippo Corponi
Background: Bipolar disorder (BD) is a complex psychiatric disease, better understood in terms of a spectrum rather than a unique nosographic construct. Dissecting BD into clinical subgroups underpinning distinctive pathophysiological pathways is key for precision medicine. Converging evidence suggests that a subset of BD may develop from neurodevelopmental disruption and may have an early age at onset (AAO) and/or a history of psychotic symptoms as clinical trademarks. To this day the definition of early AAO remains problematic as it is mainly functional to the identification of homogenous clinical subgroups.
Aims: To test to what degree neurodevelopmental factors predict AAO and/or occurrence of psychotic symptoms and to assess which clinical phenotype would induce the best patients separation along neurodevelopmental pathways.
Methods: Data from a cross-sectional, naturalistic, France-based cohort comprising 4421 patients was used for this post-hoc study. A supervised learning framework was applied in binary classification experiments to predict 1) early AAO defined with either Gaussian Mixture Models (GMM) or age cut-offs in the range [14-25], 2) psychotic symptoms, and 3) both GMM-defined early AAO and psychotic symptoms from neurodevelopmental antecedents. Secondarily, an unsupervised learning approach was used to assess the overlap between data-driven labels and the clinical annotations adopted for supervised learning.
Results: The highest area under the ROC curve (AUROC) was attained for early AAO defined with low cut-offs, i.e. 14 up to 16 years (mean = 0.7327, sd = 0.0169 with XGBOOST at AAO ≤ 16); performance tapered off across all classification algorithms for higher cut-offs. Performance was moderate (mean AUROC test < 0.65) for all other target value definitions. The highest degree of overlap with data-driven clusters was for early AAO defined with low cut-offs, 14 up to 17 years (Normalized Mutual Information (NMI) = 0.41 for AAO ≤ 17) while the lowest was recorded for psychotic features (NMI = 0.29).
Conclusions: The use of very low cut-offs (i.e. below 17 years) is recommended as a means to map BD patients to distinctive neurodevelopmental pathways
Student: Leonardo Castorina
Proteins are essential to life on earth, performing most of the essential chemical reactions that make up life: from converting solar energy into chemical energy, to DNA replication, and to wide ranges of materials with various properties. Nature, however, has only explored a small portion of the protein universe, meaning there is a vast range of physically possible protein shapes that have never been produced. De novo protein design aims at using computational tools to explore this protein space. With deep learning methods becoming more efficient and lightweight, novel approaches have been proposed for the design of new proteins. Predicting using deep learning methods is much faster compared to current protein design methods which which make them competitive alternatives, to make the field more accessible.
Here, we produced Convolutional Neural Network and Graph Neural Network models for protein design, beating the current state-of-the-art. We also show how robust benchmarking is essential for presenting and comparing the performance metrics of different design methods and that accuracy, by itself, is meaningless. Finally, we test our best models with AlphaFold2, a state-of-the-art protein folding model, to determine whether the sequences produced by our models would theoretically fold into the predetermined 3-Dimensional shape, obtaining the majority of the predicted structures within 2 A of the target shape.
Student: Olivier Labayle Pabet
Since their introduction in 2005, Genome-wide association studies have become widespread in population genetics. While their contribution to past discoveries is undeniable, they are poorly suited to understand complex relationships between causal genetic variants and traits. Indeed, they often rely on too simplistic statistical assumptions such as linearity. Inference is then carried out, with no mathematical guarantees, by interpretation of the model’s parameters. This misspecification inevitably leads to wrong inferences, in which statistical confidence will grow with the dataset’s size. In this project we aim at leveraging the framework of Targeted Minimum Loss-Based Estimation (TMLE) for estimating the effects of interacting variants on traits using the UK-Biobank. This framework relies on 3 pillars. First, the quantity of scientific interest must be clearly stated from the beginning of the study in a model independent way. Second, only realistic assumptions should be made about the data-generating distribution, the statistical model will thus be non/semi-parametric. Finally, the estimation procedure is targeted towards the quantity of interest to minimize bias and variance of the estimator.
While TMLE has been widely used for the estimation of the Average Treatment Effect (ATE), in this work I provide for the first time an implementation of the method for interaction estimation. This is based on recent theoretical developments that will be reviewed in the first part of this report. The properties of the estimator will then be presented in simulation studies that illustrate the benefits of the method. The software was implemented in the Julia programming language using state of the art software development practices. It is now ready to be released to the community and to be used on the UK-Biobank for the next phase of the project.
Student: Rohan Gorantla
Supervisors: Oisin Mac Aodha and Pearse Keane
Medical imaging is an important aid for clinicians when diagnosing and treating patients. Diagnostic errors in medical image analysis can be frequent and error rates of as much as 33% have been reported in some cases. This can result in severe patient harm. With recent advances in Artificial Intelligence (AI) based Medical image analysis tools, the role of clinicians could evolve to being human authorities in assessing non-trivial or borderline cases. Thus, training the clinical students to develop their medical imaging knowledge using a variety of image patterns is imperative for them to curtail errors in the future. In this work, our goal is to build an adaptive computer-aided teaching system that can personalize a student’s learning experience by understanding their current knowledge and assisting them in rapidly gaining expertise by providing informative training examples. The majority of existing teaching systems focus on finding the optimal set of teaching images to minimize learning time without modelling the learner’s mental model. Those that do model the learner’s mental model make unrealistic assumptions, e.g., all learner’s have same mental model. Our work proposes an approach for computer-aided teaching that attempts to overcome these limitations by estimating the learner’s mental model. We further study the impact of incorporating the learner’s confidence about a particular response while teaching. We demonstrate that our proposed method is more robust than existing methods and is thus potentially more applicable to real-world settings involving the teaching of fine-grained visual concepts to human learners.
Student: Marcin Kedziera
Supervisors: Till Bachmann, Hakan Bilen, Peter Bankhead, Bob Fisher, Lukas Engelmann
This work explores the utility of deep learning segmentation and classification methods in rapid antimicrobial susceptibility testing (AST). Using the serial broth macrodilution methodology for AST, a dataset containing 5,000 phase-contrast microscopy images of nonpathogenic E. coli and S. carnosus strains, grown in varying concentrations of ciprofloxacin or amoxicillin, and imaged at regular time intervals was created. This dataset was then labelled and used to train a semantic segmentation model, capable of reliable cell segmentation after post-processing.
Two convolutional neural network binary classification models were also trained to distinguish between inhibited and uninhibited cells. The first was trained on whole images and did not achieve adequate performance on previously unseen data due to a source of bias present in the training set. The second classifier was trained on cropped images of individual cells, extracted from whole images using the aforementioned segmentation model. This classifier was able to correctly classify 76 out of 110, previously unseen, inhibited cells and 91 out of 110, previously unseen uninhibited cells. This research confirms the applicability of deep learning based image analysis methods to AST.
Student: Aleksandra Sobieska
Non-alcoholic fatty liver disease (NAFLD) is a serious ailment affecting around 25% of the adults worldwide and no pharmacological treatments have been officially approved for this illness. NAFLD in vitro models are a promising route to tackle this issue, but they suffer from low reproducibility. However, understanding the cell type composition in liver across the entire NAFLD spectrum could potentially improve the predictability of the model outcomes. To design an in vitro model with biologically relevant cell type proportions, three deconvolution algorithms (Scaden, MuSiC, BayesPrsim) are evaluated using synthetic data. Scaden achieved the best deconvolution accuracy results. It was then used to infer cell type proportions in bulk RNA-Seq samples spanning control and all NAFLD stages. The deconvolution results showed that hepatocyte proportions decrease and other cell type fractions increase as fibrosis becomes more severe. In the last fibrotic stage, fraction variability is the largest. Furthermore in the control group, Scaden underestimated endothelia and immune cell types, but it inferred hepatocyte and mesenchyma fractions reasonably well. Additionally, novel recommended cell type fractions across the full disease spectrum are reported and evaluated for an in vitro model, which consists of mesenchyma, hepatocytes and endothelia.
Student: Barry Ryan
Parkinson’s Disease (PD) is the second most common neurological disorder which affects ∼4% of the population over 80 years old and is estimated to impact ∼80 million people by 2040. This goal of this thesis is to uncover novel pathologies associated with PD by comparing sporadic incidence of PD to rare single gene mutations in familial cases. Two longitudinal statistical experiments were conducted: The first compares a genetic PD cohort who have one of three rare gene mutations, to healthy controls. The second compares idiopathic (sporadic) PD individuals to a healthy cohort. Gene expression profiles of genetic and idiopathic PD people were compared in two analyses: The first identified consistently dysregulated genes in across all time points. The second clusters co-expressed genes relating to PD at the 12 month time point. Consistently individually dysregulated genes in the genetic PD cohort were enriched in molecular pathways relating to oxidative stress and mitochondrial dysfunction. This finding corroborates known pathologies of PD resulting from mutations in the three genetic risk loci of SNCA, LRRK2 and GBA. This finding shows that a biological signal relating to a neurological disorder can be identified from peripheral blood RNA-seq samples. Consistently dysregulated genes were subsequently used to train a Logistic Regression algorithm. This algorithm achieved a classification accuracy of 89% between genetic PD individuals and healthy controls. A similar algorithm trained on idiopathic PD achieved an accuracy of 78% in comparison. Homogeneity within a cohort is required to identify a meaningful biological signal however, as very few differentially expressed genes were identified in the idiopathic PD cohort before 12 months. Convergence in dysregulated genes between the two cohorts was found after 12 months. These genes were enriched in inflammatory markers which is to be expected and reflects common high level symptoms of PD such as loss of motor function. The second analysis showed that clustering similarly expressed genes can overcome heterogeneity. This analysis identified a build up of α-Synuclein in the brain as being involved in both genetic and idiopathic PD, which is a known feature of PD pathology. It also revealed a medical history of hepatitis as a potential cause of oxidative stress in sporadic PD individuals which is the first time this finding has been reported in a gene expression dataset.
Student: Dominic Phillips
Understanding the conformational dynamics of large biomolecules is a fundamental problem in biology and of central importance in drug discovery. Molecular dynamics simulations are invaluable to tackling this problem, but struggle to scale to the timescales of biological interest. So-called enhanced sampling techniques ameliorate this by accelerating molecular kinetics along a system’s collective variables (CVs) - the degrees of freedom that govern conformational change. Traditionally, identifying CVs has relied on expert, system-specific knowledge. More recently, numerous machine learning models have been developed for the same purpose. However, designing evaluation methods to systematically compare these models has proved challenging.
In this work, we design and test a novel evaluation pipeline for assessing the quality of learnt CVs based on a diffusion-rate criterion. We validate this pipeline on test systems before extending it to alanine dipeptide. We then run standardised enhanced sampling simulations of alanine dipeptide to compare these same CVs. We observe that several algorithms learn almost identical CVs, yet still suffer from large systematic errors in the quality of the recovered free energy surface, indicating that the choices of trajectory featurisation and free-energy re-weighting are at least as significant as the choice of CV-learning algorithm.
Student: Ben Philps
Supervisors: Maria Valdes Hernandez, Chen Qin
I perform a comparative analysis of multiple state of the art uncertianty estimation techniques applied to the task of segmenting white matter hyperintensities in FLAIR and T1 MRI imaging. I find that much of the literature that does not model the joint distribution over labels silently fails for small lesions and is unsuitable for providing uncertainty maps to clinicians.
Student: Hans-Christof Gasser
Cytotoxic T-lymphocytes (CTL) are responsible for fighting maladies hiding within the body’s own cells - like viruses and cancer. Understanding how CTL work can aid in the development of new vaccines and therapies against those. CTL decide whether to terminate a surveilled cell or not based on the peptides presented by the MHC Class I (MHC-I) pathway on the cell’s surface. Because of this, understanding which peptides get presented and then go on to cause an immune reaction, is of high importance.
This thesis consists of two parts. The first one deals with the peptide presentation stage. In this we continue our work from last year using a more efficient network and carrying out an ablation study that examines the usefulness of the major histocompatibility complex (MHC) information, fixed amino acid (AA) encodings and structural MHC information for the peptide presentation task. We find that information about the MHC protein is highly valuable for the predictive model. However, surprisingly good models can be trained even without. We also find that providing the model with an extended MHC-I AA sequence over the MHC-I pseudo sequence used in most current prediction models today, did not have a beneficial impact. The same held true for providing structural information on the MHC-I protein or providing AA encodings based on biochemical properties. We conclude that the MHC-I pseudo sequence holds sufficient information for the prediction task and that there is sufficient data available so that the network can learn meaningful AA embeddings on its own.
The second part of the thesis deals with the elicitation of an immune response by a peptide presented by the MHC-I pathway. Here, one of the main challenges is the scarcity of training data. Because of this we trained a Variational Autoencoder (VAE) to generate new peptides. To assess their validity, we utilized three approaches.
First, we compared the lower dimensional representations of the generated sequences to those of known immunogenic and non-immunogenic ones. Then we assessed them by using a third party immunogenicity predictor. Finally, we retraining a model on a dataset enhanced by our newly generated sequences. We find, that our generated sequences tend to perform better than those generated by two simple baselines, and on many measures come close to the actually observed immunogenic ones.
Student: Xiao Yang
This dissertation offers a detailed account of the early history of mind and intelligence studies at the University of Edinburgh. In contrast to the mainstream narrative focusing on the research tradition associated with the term Artificial Intelligence, this paper focuses on exploring the terminology and boundaries of the discipline at a time when they were not yet established. I begin with an analysis of the informal natural philosophy discussion group formed in Edinburgh in the late 1950s. Next, I explore the subjects that different research groups focused on, including robotics in the Department of Machine Intelligence and Perception, experimental psychological theory in the School of Epstemics, and logic in the Metamathematics Unit. The archives of the University of Edinburgh Special Collections allow me to investigate the changes in personnel, funding, and efforts to cross innovative interdisciplinary research groups when both machine intelligence and cognitive science were present in one institution.
By employing the never-used archival material and combining bibliometrics methods, I investigate how these disputes over terminology unfolded in research. I show that the divergence in both terminology and organisation influenced and confined the conduct of research under different terms. Furthermore, I provide a methodological reflection on traditional archive analysis and bibliometrics-based quantification: guided by specific historical questions, it might be more effective to establish the two approaches in effective dialogue and to supply evidence that complements each other.
Student: Raman Dutt
Strong generalisation is a highly desirable property in neural networks, especially when being employed in high-risk domains such as healthcare. A key determinant of generalisation is whether the network has the right invariances to relevant nuisance factors. Data augmentation has emerged to be a practical way of imbuing invariances leading to a few popular suites of augmentation operators that are widely successful in popular natural image benchmarks in computer vision.
However, a similar exploration lacks for the medical image analysis domain. In this work, we firstly question if the invariances learned on natural images are optimal for medical imaging tasks. Second, we propose two different strategies to find the right set of invariances for a given medical imaging dataset and further question if these invariances differ on a per-dataset basis. Third, we attempt to find if a common augmentation policy can benefit all medical datasets. We also extend our experiments to the low-data regime and study how our results change in working with a limited number of images. Our findings show that medical imaging tasks benefit from augmentation schemes different than those successful on common natural imaging benchmarks. Secondly, different medical imaging datasets benefit from different augmentation policies.
Finally, using a shared augmentation policy common for all medical datasets comes close to the optimal performance and is better than adopting natural image augmentation schemes in most cases. This suggests that practitioners might want to deviate from using general augmentation policies for medical tasks, and provides additional motivation for future work on efficient and reliable algorithms for task-specific augmentation learning. The corresponding code for this work is publicly available here.
Student: Thibaut Goldsborough
Supervisors: Roly Megaw, Miguel O. Bernabeu
We show that the morphology of single cells dissociated from retina, bone marrow, choroid and colon tissues are predictive of molecular phenotypes such as DNA content and protein expression. We demonstrate this by using deep convolutional networks to predict a range of fluorescent markers indicative of molecular content. We find that morphological markers predictive of DNA content are partially shared across datasets allowing for generalist algorithms to predict markers on novel tissues. Finally, we introduce the use of self-supervised learning to extract meaningful morphological features from Imaging Flow Cytometry (IFC) datasets without requiring fluorescent stain labels during training. We show that trained deep learning models can be used to build morphological cell atlases that are indicative of cell molecular phenotypes.
Student: Yongshuo Zong
With machine learning having been widely applied for the automatic diagnosis of medical images in recent years, fairness issues have been also raised as machine learning models may be biased towards certain subgroups of people, e.g., giving lower diagnosis accuracy for female than that of male. While many fairness-aware algorithms have been proposed to address the issue, there is a lack of a framework to benchmark the performance of these algorithms for medical imaging. In this work, we evaluate a wide range of bias mitigation algorithms aiming at rebalancing the subgroups or removing the spurious correlations on medical imaging datasets.
Surprisingly, We find that most of the algorithms do not effectively address the fairness issue, and we deduce that there are confounding factors that lead to different data distributions of different subgroups. In contrast, we do find methods that seek a flat local minima during optimization in order to generalize better to different distributions can effectively improve the worst-case and overall performance. We measure the flatness to verify the conjecture. We perform both in-distribution and cross-domain evaluations to validate the fairness in different settings, and perform statistical testing for rigorous comparisons. Notably, we trained more than 5000 models of 11 state-of-the-art fairness promoting algorithms across 8 datasets with different sensitive attributes. Code will be publicly available to provide a user-friendly framework for benchmarking the fairness, hoping it can be beneficial for both machine learning and clinical communities.