Building pangene sets from plant genome alignments confirms presence-absence variation

biorxiv(2023)

引用 0|浏览13
暂无评分
摘要
Consistent gene annotation in crops is becoming harder as genomes for new cultivars are frequently published. Gene sets from recently sequenced accessions have different gene identifiers to those on the reference accession, and might be of higher quality due to technical advances. For these reasons there is a need to define pangenes, which represent all known syntenic orthologues for a gene model and can be linked back to the original sources. A pangene set effectively summarizes our current understanding of the coding potential of a crop and can be used to inform gene model annotation in new cultivars. Here we present an approach (get\_pangenes) to identify and analyze pangenes that is not biased towards the reference annotation. The method involves computing Whole Genome Alignments (WGA), which are used to estimate gene model overlaps. After a benchmark on Arabidopsis , rice, wheat and barley datasets, we find that two different WGA algorithms (minimap2 and GSAlign) produce similar pangene sets. Our results show that pangenes recapitulate known phylogeny-based orthologies while adding extra core gene models in rice. More importantly, get\_pangenes can also produce clusters of genome segments (gDNA) that overlap with gene models annotated in other cultivars. By lifting-over CDS sequences, gDNA clusters can help refine gene models across individuals and confirm or reject observed gene Presence-Absence Variation. Documentation and source code are available at . Core ideas 1. Whole Genome Alignments capture overlapping gene models and genome segments. 2. A pangene represents homologous collinear gene models from different gene sets. 3. Lift-over can be used to refine gene models and to confirm gene Presence-Absence Variation. ### Competing Interest Statement Paul Flicek is a member of the Scientific Advisory Boards of Fabric Genomics, Inc. and Eagle Genomics, Ltd. * ACK : Ancestral Crucifer Karyotype ANI : Average Nucleotide Identity PAV : Presence Absence Variation WGA : Whole Genome Alignment.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要