by

Privacy Policy Imprint

BUSCO completeness myths debunked

5 May 2025 · 11 min read

M

Martin Kollmar

Share

The BUSCO completeness check assesses the completeness and quality of a genome, transcriptome, or proteome assembly by searching for highly conserved single-copy orthologs that are expected to be present in a given lineage. Completeness is assessed by analysing presence versus absence (‘missing‘), quality is assessed by analysing complete versus partial matches (‘fragments‘). By providing data sets for specific taxonomic lineages, BUSCO helps to assess evolutionary completeness. From an evolutionary perspective, you would expect that a ‘single-copy gene‘ in a more general taxon should also be a single-copy gene in a more specific taxon.

overlap of busco lineages

Accordingly, the data sets of the more general taxa should have (and do have) fewer genes, and the expectation would be that the data sets of the more specific taxa would contain all the genes of the more general taxa. However, the latter is not the case. There is a large overlap between the datasets, but the datasets of the more general taxa also contain genes that are not present in the datasets of the more specific taxa.

overlap of busco lineages

The example shows mammalian lineages, but the same is true for any comparison of more general and more specific lineages.

Does this matter at all? It is important not to compare apples with oranges. This happens when the data set used and the version of the data set are not specified. A ‘duplicated’, ‘fragmented’ or ‘missing’ gene in the analysis with e.g. the mammalian dataset is not the same ‘duplicated’, ‘fragmented’ or ‘missing’ gene in the analysis with a dataset of a sublineage. The completeness values of major lineages and more specific lineages are not comparable or consecutive.

Tags:

Related Posts

Functional genome annotation, dos and don’ts part 3

Functional genome annotation, dos and don’ts part 2

Why is GC content still being discussed in genome sequencing projects?

Reasons for annotated proteins without start and/or stop

Why should you use tools that generate random predictions for functional annotations?

BUSCO completeness myths debunked, part 3

BUSCO completeness myths debunked, part 2

What’s the Smallest Gene in Your Body?

Functional genome annotation, dos and don’ts