| Abstract Detail
Molecular Ecology Rosen, Austin Marshall [1]. Paralogy or reality?: Exploring gene assembly errors in a target enrichment dataset. De novo gene assembly of short read data is inherently difficult – often compared to the process of assembling a jigsaw puzzle. These difficulties can be exacerbated by DNA isolated from herbarium tissue, which is more degraded than fresh tissue. This case study describes four errors that occurred with the assembly of herbariomic target enrichment data in the Cirsium mohavense species complex (Asteraceae): inconsistent contig selection, artificial recombination, over-alignment, and inconsistent intron determination. These errors occurred in a significant portion of the dataset and were often a by-product of “undetected paralogs”: loci that likely contained paralogous sequences but did not trigger default paralog warnings by the assembly program, HybPiper. Default thresholds for identifying paralogy during the assembly process were insufficient for filtering such loci. The resulting gene assemblies were highly problematic, often including likely non-orthologous loci, and loci that potentially represent no true ortholog. Phylogenetic analysis and species tree inference of this dataset showed two fundamental problems: the ingroup did not form a clade and redundant samples in the dataset did not resolve as sister. Following a potential solution for dealing with paralogs outlined in the documentation for the assembly program, a custom target file was created in which putative paralogs were included as unique loci. The utility of this custom target file at reducing the proportion of assembly errors in the dataset is explored. Next, a final iteration of quality control was performed to create a dataset likely free of assembly errors. Phylogenetic analyses and species tree inferences were compared between the original dataset and the final dataset using several metrics. Finally, the new dataset is assessed for its ability to resolve the two fundamental problems identified in the original phylogenetic analysis.
1 - Colorado State University, Biology, 251 W Pitkin St., Fort Collins, CO, 80521, USA
Keywords: herbariomics molecular ecology gene assembly bioinformatics systematics phylogenetics Cirsium Target Enrichment Hyb-Seq short read data paralogy artificial recombination cryptic paralogs.
Presentation Type: Oral Paper Number: ME3002 Abstract ID:255 Candidate for Awards:None |