Today, I will share some surprising numbers on recent human genome annotations. The accuracy of human genome annotation plays a critical role in whole genome, transcriptome, and exome sequencing studies, directly influencing the biological and clinical interpretation of experimental data. Generally, the more complete and precise the annotation, the more reliable downstream analyses become — such as spliced read alignment in RNA-seq data or variant calling in genome sequencing projects. Conversely, inflated estimates of exon or transcript numbers can reduce the robustness of gene expression and abundance measurements.
Gene annotations also guide the selection of genomic regions in exome and targeted sequencing, as well as in genome editing strategies using tools like CRISPR/Cas9. As genomic technologies continue to enter clinical settings, the precision of annotations becomes increasingly important for accurate diagnosis and effective patient treatment. Moreover, high-quality gene annotation is essential for drug development. It enables the design of gene therapies, facilitates transcript targeting with antisense oligonucleotides, and supports the modulation of exon splicing through small molecules.
Total numbers
The total number of human protein-coding genes is estimated to be around 20,500, based on data from leading annotation projects such as Ensembl, GenBank (RefSeq), and GENCODE. However, the CCDS (Consensus Coding Sequence) dataset, which represents a consensus among these groups, includes only about 19,100 genes. This indicates ongoing uncertainty regarding the protein-coding status of approximately 2,000 genes. Notably, biochemical evidence supporting these genes — according to resources like neXtProt and the Human Protein Atlas — is even more limited.
Shortest gene
The smallest gene in the human genome is the TRDD1 gene (T Cell Receptor Delta Diversity 1) with eight nucleotides and a translated peptide consisting of two amino acids: EI (glutamate-isoleucine). TRDD1 is shortly followed by TRDD2 (T Cell Receptor Delta Diversity 2) with nine nucleotides, which are translated into PSY (proline-serine-tyrosine), and IGHD7-27 (Immunoglobulin Heavy Diversity 7-27) with eleven nucleotides, which are translated into LTG (leucine-threonine-glycine). In fact, 26 genes in the GENCODE annotation code for only ten amino acids or less. Guess how complicated it is to find ‘homologues’ for these genes in other mammalian genomes or genes of this length in any eukaryote. 191 of the annotated genes code for 50 amino acids or less.
