Hybracter: Enabling Scalable, Automated, Complete and Accurate Bacterial Genome Assemblies

Microbial Genomics(2024)

Cited 0|Views7
No score
Abstract
Improvements in the accuracy and availability of long-read sequencing mean that complete bacterial genomes are now routinely reconstructed using hybrid (i.e. short- and long-reads) assembly approaches. Complete genomes allow a deeper understanding of bacterial evolution and genomic variation beyond single nucleotide variants (SNVs). They are also crucial for identifying plasmids, which often carry medically significant antimicrobial resistance (AMR) genes. However, small plasmids are often missed or misassembled by long-read assembly algorithms. Here, we present Hybracter which allows for the fast, automatic, and scalable recovery of near-perfect complete bacterial genomes using a long-read first assembly approach. Hybracter can be run either as a hybrid assembler or as a long-read only assembler. We compared Hybracter to existing automated hybrid and long-read only assembly tools using a diverse panel of samples of varying levels of long-read accuracy with manually curated ground truth reference genomes. We demonstrate that Hybracter as a hybrid assembler is more accurate and faster than the existing gold standard automated hybrid assembler Unicycler. We also show that Hybracter with long-reads only is the most accurate long-read only assembler and is comparable to hybrid methods in accurately recovering small plasmids. Data Summary 1. Hybracter is developed using Python and Snakemake as a command-line software tool for Linux and MacOS systems. 2. Hybracter is freely available under an MIT License on GitHub () and the documentation is available at Read the Docs (). 3. Hybracter is available to install via PyPI () and Bioconda (). A Docker/Singularity container is also available at . 4. All code used to benchmark Hybracter, including the reference genomes, is publicly available on GitHub () with released DOI () available at Zenodo. 5. The subsampled FASTQ files used for benchmarking are publicly available at Zenodo with DOI (). 6. All super accuracy simplex ATCC FASTQ reads sequenced as a part of this study can be found under BioProject PRJNA1042815. 7. All Hall et al. fast accuracy simplex and super accuracy duplex ATCC FASTQ read files (prior to subsampling) can be found in the SRA under BioProject PRJNA1087001. 8. All raw Lermaniaux et al. FASTQ read files and genomes (prior to subsampling) can be found in the SRA under BioProject PRJNA1020811. 9. All Staphylococcus aureus JKD6159 FASTQ read files and genomes can be found under BioProject PRJNA50759. 10. All Mycobacterium tuberculosis H37R2 FASTQ read files and genomes can be found under BioProject PRJNA836783. 11. The complete list of BioSample accession numbers for each benchmarked sample can be found in Supplementary Table 1. 12. The benchmarking assembly output files are publicly available on Zenodo with DOI (). 13. All Pypolca benchmarking outputs and code are publicly available on Zenodo with DOI (). Impact Statement Complete bacterial genome assembly using hybrid sequencing is a routine and vital part of bacterial genomics, especially for identification of mobile genetic elements and plasmids. As sequencing becomes cheaper, easier to access and more accurate, automated assembly methods are crucial. With Hybracter, we present a new long-read first automated assembly tool that is faster and more accurate than the widely-used Unicycler. Hybracter can be used both as a hybrid assembler and with long-reads only. Additionally, it solves the problems of long-read assemblers struggling with small plasmids, with plasmid recovery from long-reads only performing on par with hybrid methods. Hybracter can natively exploit the parallelisation of high-performance computing (HPC) clusters and cloud-based environments, enabling users to assemble hundreds or thousands of genomes with one line of code. Hybracter is available freely as source code on GitHub, via Bioconda or PyPi. ### Competing Interest Statement The authors have declared no competing interest.
More
Translated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined