Expert curated knowledge resources are helping to power an AI revolution in biocatalysis, with the emergence of new methods to predict enzyme structure and function from sequence and design entirely new enzymes and pathways never seen in nature. In the first part of this talk we will explore some of the knowledge resources for biocatalysis developed in our group, including the UniProt Knowledgebase (UniProtKB, at www.uniprot.org), a reference resource of protein sequences and functional annotation covering over 240 million protein sequences from all branches of the tree of life, and Rhea (www.rhea-db.org), an expert curated knowledgebase of biochemical reactions based on the chemical ontology ChEBI (www.ebi.ac.uk/chebi/). While AI methods have enormous potential for biocatalysis research - and indeed almost all fields of scientific endeavour - they ultimately rely on expert curated knowledgebases like UniProtKB, Rhea, and others, to provide a reliable ground truth for training and benchmarking, but biocuration resources are scarce, and we cannot cover the whole scientific literature. Large Language Models (LLMs) such as GPT-4 and others may provide one route to better scaling expert curation, by automatically extracting structured knowledge from the scientific literature, but are themselves prone to errors or “hallucinations”. In the second part of this talk we will look at the development of a new curated domain-specific literature dataset for LLMs and other NLP methods EnzChemRED, which can boost the ability of LLMs to extract knowledge of enzyme functions from publications (https://arxiv.org/abs/2404.14209). These approaches may help us realize our ultimate goal, which is to capture the entire literature on enzyme functions in FAIR open knowledgebases like UniProtKB and Rhea.
Dr. Alan Bridge is director of the Swiss-Prot group at the SIB Swiss Institute of Bioinformatics. A biologist by training, he joined SIB in 2004 as a biocurator following post-doctoral studies at the Swiss Institute for Experimental Cancer Research (ISREC). He is a co-principal investigator of the UniProt resource of protein sequences an functional annotation, and principal investigator for the Rhea knowledgebase of biochemical reactions, the ENZYME resource for enzyme nomenclature, the SwissLipids knowledgebase for lipids and lipidomics, and the PROSITE and HAMAP resources for protein classification and annotation.