Abstract Detail

Comparative Genomics/Transcriptomics

Webster, Cynthia [1], Shrestha, Bikash [1], Zaman, Sumaira [1], Vuruputoor, Vidya [2], Bennett, Jeremy [1], Monyak, Daniel [3], Richter, Peter [1], Bhattarai, Akriti [4], Fetter, Karl [1], Wegrzyn, Jill [5].

EASEL (Efficient, Accurate, Scalable Eukaryotic modeLs), a tool for improvement of eukaryotic genome annotation.

The emergence of affordable high-throughput sequencing technologies has increased both the number and quality of eukaryotic genomes. Although reference genomes and their associated contiguity are increasingly accessible, an efficient and accurate workflow for structural annotation of protein coding genes remains a challenge. The prediction of gene boundaries, and the associated translation initiation start site (TIS), is difficult, especially for non-models with minimal genomic resources. Existing programs struggle with predicting less common gene structures (long introns, micro-exons), finding the preferred TIS location, and distinguishing pseudogenes. Plant genomes are especially difficult to annotate due to their larger size, often reflected by the number of repeats, pseudogenes, and polyploidy. Benchmarked approaches have yielded insufficient sensitivity and precision scores as a result of increased complexity. We present EASEL (Efficient, Accurate, Scalable Eukaryotic modeLs), a genome annotation tool that leverages machine learning, RNA folding, and functional annotations to enhance gene prediction accuracy (https://gitlab.com/PlantGenomicsLab/easel-augustus-training). EASEL utilizes AUGUSTUS with parameters optimized for prediction of gene models incorporating the extrinsic evidence supported by transcript and protein alignments. In specific, EASEL aligns high throughput short read data (RNA-Seq) and assembles putative transcripts via Stringtie2 and PsiCLASS. Frames are subsequently predicted through TransDecoder utilizing a gene family database (EggNOG) for refinement. Expressed Sequence Tag (EST) and protein hints are then generated by aligning refined transcripts and protein sequences to the genome. Prepared hints are independently used to train AUGUSTUS, and the resulting predictions are combined into a single gene set with AGAT. Implicated gene structures are further refined by start site prediction and functional annotation via machine learning and EnTAP, respectively. The machine learning algorithm uses a random forest classifier with RNA folding structure, free energy, consensus regulatory elements, and primary sequences as features while carefully crafting training data from alignments of conserved orthologs (BUSCO). This results in a full-scale workflow that balances efficiency and accuracy to generate high quality genome annotations.

Related Links:
EASEL Git

1 - University of Connecticut, EEB, 75 N Eagleville Rd, Storrs, CT, 06269, USA
2 - University of Connecticut, Department of Ecology and Evolutionary Biology, 75 N. Eagleville Road, , Unit 3043, Storrs, CT, 06269-3043, United States
3 - University of Connecticut, EEB, 75 N Eagleville Rd, Storrs, CT, 06269, United States
4 - University Of Connecticut, Ecology And Evolutionary Biology , 2152 Hillside Road, Unit 3046, Gant 401W, Storrs, CT, 06269, United States
5 - University Of Connecticut, EEB, 67 N. Eagleville Road, Unit 3124, Storrs, CT, 06269, United States

Keywords:
none specified

Presentation Type: Oral Paper
Number: CGT3003
Abstract ID:721
Candidate for Awards:Margaret Menzel Award