The human genome encodes approximately 21,000 protein-coding genes, yet the actual number of unique protein species, or proteoforms, arising from these genes is vastly larger. A proteoform refers to a specific molecular form of a protein, encompassing all variations due to alternative splicing, post-translational modifications (PTMs), single nucleotide polymorphisms (SNPs), and proteolytic processing. Theoretically, estimates suggest that upwards of 10²⁷ different human proteoforms could exist (Smith et al., 2013), largely driven by combinatorial PTMs. However, this astronomical figure is unlikely to reflect biological reality. In practice, far fewer proteoforms are observed, and the rate of discovery has slowed, indicating a possible plateau in proteoform identification using current technologies. This essay explores why the theoretical number is biologically unrealistic and why current databases may soon reach a limit in cataloguing new proteoforms.
1. Theoretical Estimates and Their Assumptions
The massive estimate of 10²⁷ proteoforms is based on a combinatorial model:
-
Assume each protein can be modified at numerous potential sites.
-
Each site can have multiple different PTMs (e.g., phosphorylation, acetylation, methylation).
-
If, for instance, a protein has 10 modifiable sites, each with 5 possible PTMs, the number of combinations is 5¹⁰ = 9,765,625—for just one protein.
-
When extended to thousands of proteins, the total combinations skyrocket.
However, this assumes:
-
All modifications are independent and equally likely.
-
All possible combinations are biochemically feasible and biologically relevant.
-
Each proteoform exists in detectable quantities.
These assumptions are problematic. Biological systems are not random PTM generators—they are highly regulated, energy-constrained, and context-dependent.
2. Biological Constraints on Proteoform Diversity
2.1 Functional Redundancy and Selection Pressure
Nature selects functional utility, not theoretical possibility. Many PTMs are:
-
Mutually exclusive (e.g., phosphorylation vs. glycosylation at the same residue).
-
Dependent on specific cellular states or compartments.
-
Dynamically regulated, appearing transiently.
Moreover, not all proteoforms provide a fitness advantage. Those without functional significance are unlikely to be retained or even produced in detectable quantities. Proteins are also under evolutionary pressure to maintain specific functions; hence, random modifications are often deleterious and removed via natural selection.
2.2 Limited Enzymatic Machinery
Each PTM requires specific enzymes (e.g., kinases, acetyltransferases, methyltransferases). The number of such enzymes is limited, and their activity is regulated in time, space, and by substrate specificity. This biochemical infrastructure cannot support infinite combinations.
2.3 Tissue and Context Specificity
Proteoform diversity is heavily dependent on cell type, developmental stage, external stimuli, and disease state. A PTM common in neurons may be absent in liver cells. Thus, while the genome is shared, the proteomic output is highly contextual. This restricts the real-world variety of proteoforms at any given time and place.
3. Technological Limitations in Proteoform Detection
Current proteomic technologies, primarily mass spectrometry (MS), are incredibly powerful, but still fall short in fully resolving proteoform complexity.
3.1 Bottom-Up Proteomics
Most proteomics is performed using bottom-up MS, where proteins are digested into peptides before analysis. This approach:
-
Loses information about full-length proteoforms.
-
Cannot determine co-occurrence of PTMs on the same protein molecule.
-
Struggles with isoform resolution, especially for closely related splice variants.
3.2 Top-Down Proteomics
Top-down MS analyzes intact proteins and is more suitable for proteoform characterization. However, it:
-
Is technically challenging.
-
Has limited throughput and sensitivity.
-
Is better suited for abundant, small proteins than for large or membrane-bound proteins.
Due to these limitations, many low-abundance or transient proteoforms remain undetectable.
3.3 Bias Toward Abundant and Stable Proteins
Current databases are biased toward:
-
Stable, abundant, and easily extractable proteins.
-
PTMs that are chemically stable or enriched (e.g., phosphorylation).
Rare, unstable, or cell-state-specific proteoforms are often missed.
4. Database Saturation and Plateauing Discovery
Proteomic databases like UniProt, Human Proteoform Atlas, and PeptideAtlas have grown rapidly over the past decade. However, the rate of new proteoform discovery has begun to slow.
Reasons include:
-
Redundancy in identified proteoforms.
-
Re-discovery of already catalogued variants.
-
Increasing difficulty in detecting novel, rare, or low-abundance species.
-
Limited biological samples from diverse conditions or tissues.
This suggests that proteoform discovery is reaching an asymptote, at least with current methodologies.
5. Defining What “Counts” as a Unique Proteoform
Another critical point is the definition of a proteoform. Should a change in one PTM site qualify? Or only if it alters function or localization?
Some researchers propose focusing on functional proteoforms, which affect:
-
Protein interactions
-
Stability
-
Localization
-
Activity
By this stricter definition, many theoretical combinations may be biologically irrelevant and should not be counted.
6. Moving Forward: Beyond the Plateau
To overcome current limitations, future advances may include:
-
Improved top-down proteomics with higher resolution and throughput.
-
Single-cell proteomics to detect context-dependent proteoforms.
-
Integration of multi-omics data (genomics, transcriptomics, PTM profiling).
-
Machine learning to predict biologically plausible proteoforms.
-
Wider sampling across non-model organisms, rare tissues, and disease states.
However, even with better tools, it is unlikely that we will uncover anywhere near 10²⁷ proteoforms. Instead, we may refine our understanding of a functionally relevant subset, perhaps numbering in the millions—not trillions or beyond.
While the theoretical number of human proteoforms—driven by combinatorial PTMs—can reach astronomical figures, the actual number observed in vivo is far lower. Biological constraints such as enzyme specificity, cellular context, and evolutionary selection limit the diversity of proteoforms that are functionally expressed. Technological limitations, particularly in mass spectrometry, further constrain our ability to detect and catalog these molecules. Consequently, the number of proteoforms found in current databases is likely to plateau, reflecting a convergence between biological relevance and technological capability. Understanding and defining the biologically functional proteoforms, rather than chasing all combinatorial possibilities, may offer a more productive direction for proteomics research.
References (Harvard Style)
Smith, L.M. and Kelleher, N.L., 2013. Proteoform: a single term describing protein complexity. Nature Methods, 10(3), pp.186–187.
Aebersold, R. and Mann, M., 2016. Mass-spectrometric exploration of proteome structure and function. Nature, 537(7620), pp.347–355.
Liu, X., de Vreede, L., Kuster, B. and van Breukelen, B., 2021. On the number of proteoforms in the human proteome. Journal of Proteome Research, 20(7), pp.3425–3433.
Zolg, D.P., Wilhelm, M., Schnatbaum, K., Zerweck, J., Knaute, T., Delanghe, B., Bailey, D.J., Gessulat, S., Ehrlich, H.C., Weininger, M. and Yu, P., 2017. Building ProteomeTools based on a complete synthetic human proteome. Nature Methods, 14(3), pp.259–262.


Leave a Reply