ANC Workshop - Antoine Lain
Tuesday, 27th September 2022
Dynamic biomedical corpus generation and machine annotation a case study of Autism Spectrum Disorder
Abstract: I will present a dynamic computational pipeline for the generation of biomedical text corpora and the subsequent identification of embedded biomedical concepts using named entity recognition. I will first introduce Cadmus, a Python package that can perform full-text document retrieval in a customisable manner to generate biomedical corpora from the published research literature. I will then introduce ParallelPyMetaMap, a Python wrapper that enables flexible and parallelisable use of the NIH-NLM MetaMap tool. I will illustrate the utility of the pipeline by generating a full-text corpus describing the published literature for Autism Spectrum Disorders (ASD); Cadmus successfully retrieved 57,635/69,590 full-text publications (82.8%) related to ASD and indexed in NCBI PubMed as of March 2022. ParallelPyMetaMap identified 200 205 unique biomedical entities from 127 distinct semantic groups when applied to the ASD corpus. I will discuss the resulting annotated ASD corpus and the utility of a literature-driven approach to structuring biomedical domain knowledge for ASD.
Event type: Workshop
Date: Tuesday, 27th September 2022
Time: 11:00
Location: Online (please see email for Collaborate link)
Speaker(s): Antoine Lain
Chair/Host: Angus Chadwick