BUSCO completeness myths debunked, part 5

26 Nov 2025 · 24 min read

Martin Kollmar

The BUSCO completeness check assesses the completeness and quality of a genome, transcriptome or proteome assembly by searching for highly conserved single-copy orthologues that should be present in a given lineage. By providing datasets for specific taxonomic lineages, BUSCO helps assess evolutionary completeness. In previous parts I discussed some differences between the BUSCO completeness check of a genome assembly and the genome annotation of that assembly, and showed examples of how the “reference” genes in the BUSCO data inflate completeness. Here, I will give you some insights into the frequently observed higher rate of duplicates in annotations compared to assemblies.

While the terms ‘complete’, ‘duplicated’, ‘fragmented’ and ‘missing’ imply clear, separate categories, there is no clear definition of what constitutes a complete, fragmented or missing gene, apart from the cut-offs applied. In fact, there is a large grey area between these categories, and the cut-offs have changed between the different software versions and are partly different between BUSCO and compleasm. The latest versions of BUSCO and compleasm are based on exactly the same raw data used for the final analysis and categorization, and yet the results are different. Although the cut-offs used to distinguish, for example, complete from fragmented genes are identical in both the assembly and annotation modes, each mode uses a different reference to define what constitutes a full-length gene. As a result, comparing the proportions of complete and fragmented BUSCOs between the two modes is essentially comparing apples to oranges.

What about the category ‘complete’ versus ‘duplicate’? The difference is based on the concept of BUSCO reference data, the groups of orthologous genes. A BUSCO ortholog group is expected to have exactly one copy per genome (based on evolutionary conservation across the lineage). If only one complete copy of this BUSCO gene is found, it is counted as ‘complete - single copy’, and if more than one complete copy is found, the gene is categorised as ‘complete - duplicated’. Duplicates therefore encompass all genes present in multiple copies, regardless of copy number. The general procedure for determining the presence/absence and completeness/fragmentation of a gene is identical in the two modes, assembly and annotation. In assembly mode, a fast genome annotation is performed and a protein fasta dataset is generated. In annotation mode, the protein fasta dataset is usually provided by an advanced annotation pipeline. The protein fasta datasets are then compared to the identical HMM profiles generated from the BUSCO ortholog groups using the hmmsearch tool from the HMMER software suite.

However, the way the hmmsearch tool is executed differs between the two modes, and this distinction applies to both BUSCO and Compleasm. In assembly mode, the protein fasta dataset is provided to hmmsearch as a continuous data stream. As a result, hmmsearch has no prior knowledge of how many sequences it will receive or when the stream will end. This streaming approach is designed to handle extremely large datasets efficiently and to minimize resource use. Consequently, when hmmsearch identifies a highly significant match, it adds it to the list of reported hits, while weaker matches are discarded to conserve computational resources and avoid redundant output. The selection process is guided by specific internal heuristics. In protein mode, the full protein fasta dataset is provided to hmmsearch allowing it to know the full scope of the search space immediately. As a consequence, all hits including weaker hits are reported in the output.

As a result of the different ways hmmsearch is executed in the two modes, the assembly mode HMMER output files include only the most significant hits, whereas the protein mode output files also retain weaker matches. Many of these weaker hits still exceed the BUSCO score thresholds (used in both BUSCO and Compleasm), which is why analyses run in protein mode systematically report a higher number of duplicates than those in assembly mode.

Conclusion

In summary, the substantially higher rate of duplicates observed in protein mode compared to assembly mode arises from technical differences in how hmmsearch is executed, rather than from any underlying biological cause.

Tags:

blog

Bugs in Bioinformatics Software: What They Are and Why You Should Care

Martin Kollmar

BUSCO completeness myths debunked, part 5

Conclusion

Related Posts

Bugs in Bioinformatics Software: What They Are and Why You Should Care

BUSCO completeness myths debunked, part 6

Functional genome annotation, dos and don’ts part 5

Inside the genome’s hidden layers: the curious case of nested genes

BUSCO completeness myths debunked, part 4

Functional genome annotation, dos and don’ts part 3

Functional genome annotation, dos and don’ts part 2

Why is GC content still being discussed in genome sequencing projects?

Reasons for annotated proteins without start and/or stop