BUSCO completeness myths debunked, part 2

7 May 2025 · 36 min read

Martin Kollmar

The BUSCO completeness check assesses the completeness and quality of a genome, transcriptome or proteome assembly by searching for highly conserved single-copy orthologues that should be present in a given lineage. By providing datasets for specific taxonomic lineages, BUSCO helps assess evolutionary completeness. In Part 1, I discussed that the genes in more general taxa are not fully part of the gene set of the next specific taxa. For example, only part of the gene set of ‘vertebrates’ is also part of the gene set of ‘mammals’. In this part 2, I will take a closer look at the differences between the BUSCO completeness check of a genome assembly and the BUSCO completeness check of the genome annotation of that assembly.

Example gene set

The gene sets contain so-called refseqdb data. These variants are intended to represent the divergence of the respective gene across all subtaxa of the selected lineage. The sequences of the variants are collected from GenBank. However, GenBank contains mainly predicted sequences, and therefore most sequences are missing exonic regions and contain false sequence regions (as exonic predicted sequences that are actually intronic or intergenic). The example shows the alignment of the ten refseqdb members of a randomly selected gene from the laurasiatherian data set. One gene/protein is missing half of the 5‘/N-terminal sequence, three genes/proteins contain the wrong (false-positive) sequence of a wrongly predicted 5’/N-terminal exon, two genes/proteins contain the wrong (false-positive) sequence of a wrongly predicted 3'/C-terminal exon and two genes/proteins contain the wrongly predicted sequence somewhere in the middle, probably due to the wrong identification of intron splice sites. To summarise, six of the ten sequence variants contain errors.

Genome assembly evaluation

To assess the completeness of a genome assembly, the refseqdb members of each gene set are mapped to the genome sequence. The mapping of true sequences (true positives) is specific, while the false sequence regions (false positives) are non-specific and can be mapped anywhere. The figure shows the gene structure reconstruction of the refseqdb members of the example gene set. For comparison, the true gene structure supported by RNA-Seq data is shown above. This example illustrates several features of the BUSCO approach: all genes from the refseq_db data are mapped to the genome, regardless of whether a variant is missing significant parts of the true sequence or whether the variants contain long stretches of incorrect sequence. The ‘best’ hit according to the BUSCO evaluation of the mapping scores is the hit based on the shortest sequence in the gene set in which the N-terminal half of the sequence is missing. A completeness of xx% therefore does not mean that xx% of the genes were found completely, but only that one of the variants was mapped. Whether this variant misses part of the correct sequence and/or contains an incorrect (false-positive) sequence region is irrelevant. Whether the most closely related variant was mapped completely or partially is irrelevant. In the example, the 5‘ half of the gene could be missing in a fragmented assembly, and BUSCO would still classify this gene as completely present (’complete single-copy gene").

mapping of the example busco gene set against the genome assembly

Genome annotation evaluation

To assess the completeness of a genome annotation, the annotated protein sequences are compared to protein profiles created for each of the genes in a BUSCO lineage dataset. Let's take the example gene and check which parts of the sequence would be categorised as ‘complete’, ‘fragmented’ or ‘missing’. For simplicity, I have separated the complete sequence into the CDS parts and combined the CDS regions as if only the respective CDS regions had been identified during gene prediction. This eliminates additional effects due to misidentified intron splice sites. The analysis illustrates further features of the BUSCO approach: the protein profiles represent the smallest consensus proportion of the refseq_db members. The protein profile may not even represent one third of the actual sequence (see example). Accordingly, the annotated protein may even contain more than two thirds of the true sequence, and yet the corresponding gene is classified as ‘missing’. In contrast, the annotated protein might contain only a short fragment of the true sequence, with most of it missing and/or an additional random sequence due to a wrong prediction of the sequence, and the corresponding gene is categorised as ‘complete’. The categorisation as ‘fragment’ as opposed to ‘complete’ only means that a shorter part corresponding to the region contained in the protein profile is present in the annotated protein. The gene/protein categorised as ‘fragment’ could in fact contain a much larger part of the true sequence compared to an annotated protein categorized as ‘complete’.

completeness of parts of the example genes according to busco scores

Conclusion

BUSCO shows some value when used to compare different genome assemblies. Lower completeness indicates more fragmented assemblies. However, comparing assemblies with completeness differences in the single-digit percentage range requires more effort and more sophisticated methods. Comparing genome annotations or a genome annotation with a genome assembly based on BUSCO completeness is definitely insufficient. Certainly, most researchers would favour annotations where 70-80% of the gene sequences are present, even if the corresponding genes are classified as ‘fragment’ or ‘missing’, compared to annotations where only 30-40% of the gene sequences are present but these are classified as ‘complete’. Using the best BUSCO hits as a target function for gene prediction is definitely a bad idea. This will lead to more fragmented gene annotations.

Tags: