Huge amounts of biological data generated from plant genomes can now be more rapidly assessed alongside literature describing plant genes thanks to new tools developed by CSB and CS researchers.
Prof. Nicholas Provart founded the globally recognized Bio-Analytic Resource (BAR) server 20 years ago when he recognized the need to collate plant molecular data and provide clear visual summaries for researchers’ queries. Up to now, users had to apply their own interpretations to the data, but Provart and his team have added a new layer to the BAR, designated Gaia.
Gaia collates the interpretations of those who generated the data, summarizing the derived gene models and gene annotation using the latest techniques in machine learning and generative AI.
This curated and processed collection was judged important enough to be included in the Web Server issue of Nucleic Acids Research as “20 years of the Bio-Analytic Resource for Plant Biology”.
Alex Sullivan built the Gaia module for BAR based on data processed by Michael Lombardo, Emma Zhuang, and Ashley Christendat.
Provart acknowledges that “We were dependent on the open science movement, which freely provides all data in a clearly organized manner. Because of this initiative, Gaia can produce high quality, user-friendly output.”
Lombardo developed the GeneNet machine learning method to identify genetic model figures in 67,291 Arabidopsis papers from the PubMed Central database at the NIH.
GeneNet scans journal graphics to determine which figures are likely genetic models using a classification network with a success rate of around 95% (some positive figures are protein-protein interaction networks, not genetic models).
Optical Character Recognition was then used to pull out all the genes mentioned in the figures, which are aggregated to link them in the Gaia database.

As part of their undergraduate projects, Christendat and Zhuang used generative AI to analyze entries in the Singapore-based PlantConnectome, a GPT 3.5-based summary of >100k Arabidopsis abstracts linked on a gene-by-gene basis across publications.
Crucially, PlantConnectome maintained the association between extracted information and the paper that the information was derived from by submitting abstracts one-by-one for summarization.
Christendat and Zhuang took this machine-readable database and used skillful prompt engineering through the Llama3 LLM to format the data into a human readable form.
Llama3 was only usable with modifications made by the team. One of their challenges was to ensure the correct journal references were included in each summary.
“It was fascinating to apply my skills to extract meaningful biological data using the latest technology,” enthuses Christendat, a 3rd year student in Computer Science.
BAR is recognized as a Global Core Biodata Resource with two decades of reliable operation through continuous improvements and additions. This new addition shows that BAR will continue to thrive and advance plant science for years to come.