Functional genome annotation, dos and don’ts

5 May 2025 · 19 min read

Martin Kollmar

Functional annotation is the process of assigning biological meaning to identified gene sequences. It involves predicting the roles of genes, transcripts, and proteins by comparing them to known databases, identifying protein-coding regions, functional domains, motifs, and gene ontology (GO) terms. Functional annotation also includes the identification of regulatory elements, pathways, and interactions, providing insights into the biological processes, molecular functions, and cellular components associated with each gene.

Protein and gene names are usually assigned on the basis of homology to named proteins/genes. Homology is determined by comparing the predicted genes with data from comprehensive databases such as GenBank, SwissProt and UniProt or species databases such as FlyBase, TAIR, Xenbase, ZFIN and others using tools such as BLAST. Of course, the result of the naming process depends on the parameters of the tool (e.g. BLAST) and the status of the database. At this point we will discuss some notes on the databases, problems with the tools will be discussed in a future newsletter.

It is obvious that the naming depends heavily on the content of the database and its version (date of publication). The best hit in a database, which is normally used to assign the new name, is not the best hit in another database.

The naming of proteins and genes is inconsistent in the literature. Orthologous genes can be named differently in humans and mice or Arabidopsis and rice, for example. Eukaryotic proteins are often proteins with multiple domains, and their naming may depend on one of these domains. For example, the motor protein myosin is ubiquitous in eukaryotes and is characterised by the so-called myosin motor domain, which binds to actin, hydrolyses ATP and generates force to move along actin filaments. There are over 80 different classes of myosins, each with a unique domain architecture and function. One fungal-specific class contains a chitin synthase domain and is therefore often classified and referred to in the fungal world as chitin synthase V/VII. From the perspective of the myosin community, the two fungal homologues are therefore both referred to as myosin-17, while from the chitin synthase perspective they are classified as chitin synthase V and chitin synthase VII.

Naming by including subfamilies/classes/subclasses/group classifiers leads to enormous confusion and misnomers, as the classification of protein families requires a thorough phylogenetic analysis. Even if the scientific community has agreed on a classification scheme after many years, it is possible that different subcommunities or individual research groups already use several different naming schemes and have already included them in the public databases by submitting the corresponding sequences.

Apart from the misleading naming of proteins by specific classes/subfamilies without proper verification of the orthology/paralogy, the most annoying thing is the abberant misuse of useless additions such as “hypothetical”, “potential”, “probable” , “related” and others.

To summarise, the use of standard software parameters and a large database such as GenBank or UniProt without extensive post-processing leads to extremely misleading functional annotation. Entire protein names are inappropriate, as are most assignments to classes and subfamilies, which ultimately leads to misdirection in subsequent analyses, both experimental and in silico.

Tags:

blog

Bugs in Bioinformatics Software: What They Are and Why You Should Care

Martin Kollmar

Functional genome annotation, dos and don’ts

Related Posts

Bugs in Bioinformatics Software: What They Are and Why You Should Care

BUSCO completeness myths debunked, part 6

Functional genome annotation, dos and don’ts part 5

BUSCO completeness myths debunked, part 5

Inside the genome’s hidden layers: the curious case of nested genes

BUSCO completeness myths debunked, part 4

Functional genome annotation, dos and don’ts part 3

Functional genome annotation, dos and don’ts part 2

Why is GC content still being discussed in genome sequencing projects?

Reasons for annotated proteins without start and/or stop