Mendle logo
Log In
Mendle logo
Log In
BlogGlossary
byGoEnomics
Privacy PolicyImprint
Mendle logo
Log In
Mendle logo
Log In
blog

BUSCO completeness myths debunked, part 5

26 Nov 2025 · 24 min read
M
Martin Kollmar
Share
BUSCO completeness myths debunked, part 5

The BUSCO completeness check assesses the completeness and quality of a genome, transcriptome or proteome assembly by searching for highly conserved single-copy orthologues that should be present in a given lineage. By providing datasets for specific taxonomic lineages, BUSCO helps assess evolutionary completeness. In previous parts I discussed some differences between the BUSCO completeness check of a genome assembly and the genome annotation of that assembly, and showed examples of how the “reference” genes in the BUSCO data inflate completeness. Here, I will give you some insights into the frequently observed higher rate of duplicates in annotations compared to assemblies.

While the terms ‘complete’, ‘duplicated’, ‘fragmented’ and ‘missing’ imply clear, separate categories, there is no clear definition of what constitutes a complete, fragmented or missing gene, apart from the cut-offs applied. In fact, there is a large grey area between these categories, and the cut-offs have changed between the different software versions and are partly different between BUSCO and compleasm. The latest versions of BUSCO and compleasm are based on exactly the same raw data used for the final analysis and categorization, and yet the results are different. Although the cut-offs used to distinguish, for example, complete from fragmented genes are identical in both the assembly and annotation modes, each mode uses a different reference to define what constitutes a full-length gene. As a result, comparing the proportions of complete and fragmented BUSCOs between the two modes is essentially comparing apples to oranges.

What about the category ‘complete’ versus ‘duplicate’? The difference is based on the concept of BUSCO reference data, the groups of orthologous genes. A BUSCO ortholog group is expected to have exactly one copy per genome (based on evolutionary conservation across the lineage). If only one complete copy of this BUSCO gene is found, it is counted as ‘complete - single copy’, and if more than one complete copy is found, the gene is categorised as ‘complete - duplicated’. Duplicates therefore encompass all genes present in multiple copies, regardless of copy number. The general procedure for determining the presence/absence and completeness/fragmentation of a gene is identical in the two modes, assembly and annotation. In assembly mode, a fast genome annotation is performed and a protein fasta dataset is generated. In annotation mode, the protein fasta dataset is usually provided by an advanced annotation pipeline. The protein fasta datasets are then compared to the identical HMM profiles generated from the BUSCO ortholog groups using the hmmsearch tool from the HMMER software suite.

However, the way the hmmsearch tool is executed differs between the two modes, and this distinction applies to both BUSCO and Compleasm. In assembly mode, the protein fasta dataset is provided to hmmsearch as a continuous data stream. As a result, hmmsearch has no prior knowledge of how many sequences it will receive or when the stream will end. This streaming approach is designed to handle extremely large datasets efficiently and to minimize resource use. Consequently, when hmmsearch identifies a highly significant match, it adds it to the list of reported hits, while weaker matches are discarded to conserve computational resources and avoid redundant output. The selection process is guided by specific internal heuristics. In protein mode, the full protein fasta dataset is provided to hmmsearch allowing it to know the full scope of the search space immediately. As a consequence, all hits including weaker hits are reported in the output.

As a result of the different ways hmmsearch is executed in the two modes, the assembly mode HMMER output files include only the most significant hits, whereas the protein mode output files also retain weaker matches. Many of these weaker hits still exceed the BUSCO score thresholds (used in both BUSCO and Compleasm), which is why analyses run in protein mode systematically report a higher number of duplicates than those in assembly mode.

Conclusion

In summary, the substantially higher rate of duplicates observed in protein mode compared to assembly mode arises from technical differences in how hmmsearch is executed, rather than from any underlying biological cause.

Tags:

Related Posts

Functional genome annotation, dos and don’ts part 5
blog

Functional genome annotation, dos and don’ts part 5

M
Martin Kollmar
Inside the genome’s hidden layers: the curious case of nested genes
blog

Inside the genome’s hidden layers: the curious case of nested genes

M
Martin Kollmar
BUSCO completeness myths debunked, part 4
blog

BUSCO completeness myths debunked, part 4

M
Martin Kollmar
Functional genome annotation, dos and don’ts part 3
blog

Functional genome annotation, dos and don’ts part 3

M
Martin Kollmar
Functional genome annotation, dos and don’ts part 2
blog

Functional genome annotation, dos and don’ts part 2

M
Martin Kollmar
Why is GC content still being discussed in genome sequencing projects?
blog

Why is GC content still being discussed in genome sequencing projects?

M
Martin Kollmar
Reasons for annotated proteins without start and/or stop
blog

Reasons for annotated proteins without start and/or stop

M
Martin Kollmar
Why should you use tools that generate random predictions for functional annotations?
blog

Why should you use tools that generate random predictions for functional annotations?

M
Martin Kollmar
BUSCO completeness myths debunked, part 3
blog

BUSCO completeness myths debunked, part 3

M
Martin Kollmar