The International Quest to Complete the Human Proteome
Two decades after sequencing the human genome, scientists continue their search for the elusive proteins that complete our molecular blueprint.
When the Human Genome Project revealed our genetic blueprint in 2003, scientists were surprised to find that protein-coding DNA occupies less than 2% of our genome 3 . From this small portion, approximately 20,000 protein-coding genes were identified . These genes provide the instructions for building proteins, the fundamental workhorses of all biological processes.
However, having the instructions doesn't mean we know all the products. The Human Proteome Organization (HUPO) launched the Human Proteome Project (HPP) in 2010 to systematically map and characterize all human proteins 3 . This global initiative split into complementary approaches: the Biology/Disease-driven HPP (B/D-HPP), which studies proteins in the context of health and disease, and the Chromosome-Centric HPP (C-HPP), which takes a unique gene-by-gene approach.
Focuses on proteins in the context of health and disease, studying how proteins function in biological systems and pathological conditions.
Takes a systematic gene-by-gene approach, dividing responsibility for human chromosomes among international research teams.
The term "missing proteins" refers to those proteins that are predicted to exist based on genetic evidence but lack sufficient experimental confirmation at the protein level 2 3 . To classify protein evidence, researchers use a Protein Evidence (PE) level system:
Experimental evidence confirms protein existence
Evidence only at RNA transcript level
Predicted based on similarity to proteins in other species
Predicted from gene models and computational analysis
Questionable protein-coding genes with little supporting evidence
The hunt for missing proteins has yielded impressive results. According to the 2024 HPP report, protein expression has now been credibly detected (PE1) for 18,138 of the 19,411 GENCODE protein-coding genes, representing 93% of the predicted human proteome 6 . The number of missing proteins has been reduced to just 1,273 6 .
This progress represents a dramatic acceleration from earlier years. As recently as 2016, researchers had confidently identified only 16,518 proteins, with 2,663 remaining missing 8 .
| Year | PE1 Proteins (Confirmed) | Missing Proteins | Percentage Complete |
|---|---|---|---|
| 2016 | 16,518 | 2,663 | 86% |
| 2023 | 18,397 | 1,381 | 93% |
| 2024 | 18,138* | 1,273 | 93% |
*Note: The 2024 number uses the updated GENCODE protein-coding gene list of 19,411 (previously 19,778) 6 .
"Achieving the unambiguous identification of 93% of predicted proteins encoded from across all chromosomes represents remarkable experimental progress on the Human Proteome parts list," noted the 2023 HPP report 7 .
Finding the last remaining missing proteins has required increasingly sophisticated approaches. Scientists face multiple challenges: these proteins may be expressed only in specific tissues, at low abundance, during brief developmental windows, or under unique physiological conditions 3 .
One of the most productive approaches has involved deep-diving into tissues with unusual genetic activity. Researchers discovered that testis tissue contains by far the largest number of tissue-specific transcripts—nearly 50 times more than any other tissue 8 . This makes it a veritable goldmine for finding proteins that are elusive elsewhere in the body.
Survey public databases to identify missing proteins with transcript evidence in specific tissues
Choose promising tissues (like testis) that show high expression of target genes
Use enzymes like trypsin to break proteins into measurable peptides
Identify peptides based on their mass-to-charge ratio
Apply HPP guidelines requiring at least two unique peptides of ≥9 amino acids each
Contribute findings to central resources like PeptideAtlas and UniProtKB
This method has proven highly successful. In one year alone, researchers identified 166 previously missing proteins in testis and 89 more in spermatozoa 8 .
| Tool/Reagent | Function in Protein Discovery |
|---|---|
| Mass Spectrometers | High-sensitivity instruments that identify proteins by measuring peptide masses |
| Trypsin | Enzyme that digests proteins into smaller, analyzable peptides |
| Alternative Proteases | Specialized enzymes used when trypsin cannot digest certain proteins |
| Chromatography Systems | Separate complex peptide mixtures before mass analysis |
| Specific Tissues (e.g., testis) | Biological sources rich in rarely expressed proteins |
| Custom Protein Databases | Reference databases built from RNA-seq data to aid peptide identification |
| Antibody Reagents | Used in orthogonal methods to confirm protein presence independently |
| Cell Line Models | Engineered systems expressing target proteins for controlled study |
Finding the missing proteins is only the beginning of the story. Scientists now recognize that an even greater challenge lies in what they term the "dark proteome"—those proteins for which we have insufficient information on expression, structure, or function 1 .
To address this challenge, the C-HPP launched the neXt-CP50 initiative in 2018, aiming to characterize the function of 50 poorly understood proteins (uPE1 proteins) over three years 1 . This pilot project brought together 15 teams from 12 countries to tackle "dark proteins" that have been detected but whose functions remain mysterious 1 .
"As researchers try to detect the missing proteins, there are still other regions of the dark proteome to explore. When you account for those variations, there could be millions of varieties of proteins inside each of us."
As the low-hanging fruit disappears, finding the remaining missing proteins requires increasingly innovative approaches. Researchers are developing new technologies that may finally illuminate the darkest corners of the proteome.
Traditional mass spectrometry has limitations—it's destructive and can miss subtle variations. New platforms, like one being commercialized by Nautilus Biotechnology, take a different approach: immobilizing proteins onto a surface and using fluorescent reagents to repeatedly probe what's there . This allows for more comprehensive detection of protein variations.
Other groups are exploring nanopore protein sequencing, which pulls proteins through small openings and reads amino acid sequences by changes in electrical conductivity . This method can potentially keep track of chemical modifications that would be missed by conventional approaches.
With 93% of the protein parts list confirmed, the HPP is increasingly focusing on the Grand Challenge Project: "A Function for Every Protein" 6 . In 2024, researchers introduced a new Function Evidence (FE) scoring system that ranks our understanding of each protein's molecular functions 6 . This parallel system to the PE evidence codes will help prioritize proteins needing functional characterization.
"The recent development of a Function Evidence (FE) score represents a key step in the pursuit of the HPP Grand Challenge Project, 'A Function for Every Protein,'" states the 2024 HPP report 6 .
| Category | Number of Proteins |
|---|---|
| Total Protein-Coding Genes | 19,411 |
| PE1 Proteins (Confirmed) | 18,138 |
| PE1 with MS Evidence | 17,407 |
| Missing Proteins (PE2+3+4) | 1,273 |
| Dubious Proteins (PE5) | 600 |
The Chromosome-Centric Human Proteome Project represents one of the most ambitious scientific collaborations in biology. From dividing the human chromosomes among international teams to establishing rigorous standards for protein identification, the C-HPP has created a framework that has dramatically accelerated our understanding of the human molecular machinery.
"For me, darkness is a metaphor for what we don't understand."
While the number of truly missing proteins dwindles, the work is far from over. The scientific community now faces the more complex challenge of understanding the dance of proteins in our cells—how they interact, when they're modified, and what roles they play in health and disease. Each missing protein found and characterized represents another light turned on in the vast darkness of biological uncertainty, bringing us closer to a complete understanding of what it means to be human at the molecular level.