ANC Workshop - Antoine Lain

Tuesday, 27th September 2022

Dynamic biomedical corpus generation and machine annotation a case study of Autism Spectrum Disorder

Abstract:   I will present a dynamic computational pipeline for the generation of biomedical text corpora and the subsequent identification of embedded biomedical concepts using named entity recognition. I will first introduce Cadmus, a Python package that can perform full-text document retrieval in a customisable manner to generate biomedical corpora from the published research literature. I will then introduce ParallelPyMetaMap, a Python wrapper that enables flexible and parallelisable use of the NIH-NLM MetaMap tool. I will illustrate the utility of the pipeline by generating a full-text corpus describing the published literature for Autism Spectrum Disorders (ASD); Cadmus successfully retrieved 57,635/69,590 full-text publications (82.8%) related to ASD and indexed in NCBI PubMed as of March 2022. ParallelPyMetaMap identified 200 205 unique biomedical entities from 127 distinct semantic groups when applied to the ASD corpus. I will discuss the resulting annotated ASD corpus and the utility of a literature-driven approach to structuring biomedical domain knowledge for ASD.

Event type: Workshop

Date: Tuesday, 27th September 2022

Time: 11:00

Location: Online (please see email for Collaborate link)

Speaker(s): Antoine Lain

Chair/Host: Angus Chadwick