The BUSCO completeness check assesses the completeness and quality of a genome, transcriptome or proteome assembly by searching for highly conserved single-copy orthologues that should be present in a given lineage. By providing datasets for specific taxonomic lineages, BUSCO helps assess evolutionary completeness. In a previous post l discussed the much higher number of fragments in protein mode compared to assembly mode that results from the different reference to define a full-length gene in each mode.
The intuitive understanding of the term “complete” when applied to a gene is that the identified gene candidate is biologically intact, structurally whole, and functionally plausible. Minor truncations at the terminal ends may be acceptable, but overall the gene is expected to represent the full functional unit. BUSCO is widely used to assess genome assembly completeness based on the presence of lineage-specific genes, and a “complete” BUSCO is therefore commonly interpreted as evidence that the corresponding gene locus is present and largely intact in the assembly.
However, this interpretation does not reflect how “complete” is defined in BUSCO’s assembly mode. In this mode, the reference is not the gene locus in the genome but the BUSCO reference sequence itself. All sequences in the BUSCO reference datasets are, by definition, treated as complete, regardless of whether they represent only a fragment of the true full-length gene, lack essential regions, or include unrelated sequence due to misannotation. As a result, “completeness” in assembly mode refers only to how well the reference sequence can be mapped, not to the biological integrity of the gene. I will discuss the substantial variability and limitations of the BUSCO reference data in more detail in a future newsletter.
Beyond these large-scale differences between BUSCO “complete” genes and true gene loci, numerous smaller discrepancies are also observed. Even in high-quality, high-coverage genome assemblies, approximately one to eight percent of BUSCO genes classified as “complete” in assembly mode contain frame shifts, indicating missing or additional nucleotides that alter the reading frame of the inferred protein sequence. One to two percent of the “complete” genes contain in-frame stop codons, which are typically interpreted as signs of mutations or local regions of low coverage or sequence quality. However, RNA-Seq mapping and high-quality genome annotations demonstrate that these issues usually do not originate from the genome assemblies themselves, but from errors in the BUSCO reference sequences. When BUSCO reference sequences contain incorrect regions, these are still forcibly matched during the mapping process. For example, if a BUSCO reference sequence was derived from a gene prediction that missed an intron, the resulting sequence includes an intron read-through. When such a sequence is mapped to a closely related genome, it may introduce a frame shift or an in-frame stop codon at the position corresponding to the intron. Taken together, these observations point to limitations in the BUSCO reference data rather than deficiencies in the genome assemblies.
Analysis of the mapped BUSCO sequences further shows that 20 to 70 percent of genes classified as “complete” lack either a start codon, a stop codon, or both. This indicates that BUSCO reference sequences are far less conserved than implied by the concept of lineage-specific conserved single-copy genes. The proportion of missing start and stop codons is highest when the analyzed genome belongs to a lineage that is underrepresented or absent in the BUSCO reference datasets. Notably, even for genomes that are part of the BUSCO reference data, 20 to 40 percent of “complete” BUSCO genes still lack proper start and stop codons, highlighting intrinsic limitations of the miniprot-based mapping approach used in assembly mode.