Hunting the Missing Proteins

The International Quest to Complete the Human Proteome

Two decades after sequencing the human genome, scientists continue their search for the elusive proteins that complete our molecular blueprint.

The Blueprint and the Building Blocks: From Genes to Proteins

When the Human Genome Project revealed our genetic blueprint in 2003, scientists were surprised to find that protein-coding DNA occupies less than 2% of our genome 3 . From this small portion, approximately 20,000 protein-coding genes were identified . These genes provide the instructions for building proteins, the fundamental workhorses of all biological processes.

However, having the instructions doesn't mean we know all the products. The Human Proteome Organization (HUPO) launched the Human Proteome Project (HPP) in 2010 to systematically map and characterize all human proteins 3 . This global initiative split into complementary approaches: the Biology/Disease-driven HPP (B/D-HPP), which studies proteins in the context of health and disease, and the Chromosome-Centric HPP (C-HPP), which takes a unique gene-by-gene approach.

Biology/Disease-driven HPP

Focuses on proteins in the context of health and disease, studying how proteins function in biological systems and pathological conditions.

Chromosome-Centric HPP

Takes a systematic gene-by-gene approach, dividing responsibility for human chromosomes among international research teams.

The Mystery of the Missing Proteins

The term "missing proteins" refers to those proteins that are predicted to exist based on genetic evidence but lack sufficient experimental confirmation at the protein level 2 3 . To classify protein evidence, researchers use a Protein Evidence (PE) level system:

PE1: Conclusive Evidence

Experimental evidence confirms protein existence

PE2: Transcript Evidence

Evidence only at RNA transcript level

PE3: Inferred from Homology

Predicted based on similarity to proteins in other species

PE4: Predicted from Models

Predicted from gene models and computational analysis

PE5: Uncertain or Dubious

Questionable protein-coding genes with little supporting evidence

Missing Proteins are those falling into PE2, PE3, and PE4 categories—they should exist, but we haven't reliably detected them yet.

Lighting Up the Dark Proteome: Remarkable Progress

The hunt for missing proteins has yielded impressive results. According to the 2024 HPP report, protein expression has now been credibly detected (PE1) for 18,138 of the 19,411 GENCODE protein-coding genes, representing 93% of the predicted human proteome 6 . The number of missing proteins has been reduced to just 1,273 6 .

This progress represents a dramatic acceleration from earlier years. As recently as 2016, researchers had confidently identified only 16,518 proteins, with 2,663 remaining missing 8 .

Progress in Human Proteome Mapping

Year PE1 Proteins (Confirmed) Missing Proteins Percentage Complete
2016 16,518 2,663 86%
2023 18,397 1,381 93%
2024 18,138* 1,273 93%

*Note: The 2024 number uses the updated GENCODE protein-coding gene list of 19,411 (previously 19,778) 6 .

Human Proteome Completion Progress
0% 93% Complete 100%

"Achieving the unambiguous identification of 93% of predicted proteins encoded from across all chromosomes represents remarkable experimental progress on the Human Proteome parts list," noted the 2023 HPP report 7 .

The Search Strategy: How Scientists Hunt Missing Proteins

Finding the last remaining missing proteins has required increasingly sophisticated approaches. Scientists face multiple challenges: these proteins may be expressed only in specific tissues, at low abundance, during brief developmental windows, or under unique physiological conditions 3 .

The Testis Strategy: Mining a Biological Goldmine

One of the most productive approaches has involved deep-diving into tissues with unusual genetic activity. Researchers discovered that testis tissue contains by far the largest number of tissue-specific transcripts—nearly 50 times more than any other tissue 8 . This makes it a veritable goldmine for finding proteins that are elusive elsewhere in the body.

Missing Protein Discovery Workflow
Data Mining

Survey public databases to identify missing proteins with transcript evidence in specific tissues

Sample Selection

Choose promising tissues (like testis) that show high expression of target genes

Protein Digestion

Use enzymes like trypsin to break proteins into measurable peptides

Mass Spectrometry Analysis

Identify peptides based on their mass-to-charge ratio

Stringent Validation

Apply HPP guidelines requiring at least two unique peptides of ≥9 amino acids each

Data Sharing

Contribute findings to central resources like PeptideAtlas and UniProtKB

This method has proven highly successful. In one year alone, researchers identified 166 previously missing proteins in testis and 89 more in spermatozoa 8 .

The Toolkit for Protein Hunters

Tool/Reagent Function in Protein Discovery
Mass Spectrometers High-sensitivity instruments that identify proteins by measuring peptide masses
Trypsin Enzyme that digests proteins into smaller, analyzable peptides
Alternative Proteases Specialized enzymes used when trypsin cannot digest certain proteins
Chromatography Systems Separate complex peptide mixtures before mass analysis
Specific Tissues (e.g., testis) Biological sources rich in rarely expressed proteins
Custom Protein Databases Reference databases built from RNA-seq data to aid peptide identification
Antibody Reagents Used in orthogonal methods to confirm protein presence independently
Cell Line Models Engineered systems expressing target proteins for controlled study

Beyond Detection: The Deeper Challenges of the Dark Proteome

Finding the missing proteins is only the beginning of the story. Scientists now recognize that an even greater challenge lies in what they term the "dark proteome"—those proteins for which we have insufficient information on expression, structure, or function 1 .

The neXt-CP50 Initiative: Shedding Light on Function

To address this challenge, the C-HPP launched the neXt-CP50 initiative in 2018, aiming to characterize the function of 50 poorly understood proteins (uPE1 proteins) over three years 1 . This pilot project brought together 15 teams from 12 countries to tackle "dark proteins" that have been detected but whose functions remain mysterious 1 .

Layers of the Dark Proteome
  • Protein Isoforms: Alternative splicing produces multiple variations from a single gene
  • Post-Translational Modifications: Chemical additions that alter protein function
  • Uncharacterized PE1 Proteins: Detected proteins with unknown functions
  • Proteoforms: All different molecular forms a protein can take

"As researchers try to detect the missing proteins, there are still other regions of the dark proteome to explore. When you account for those variations, there could be millions of varieties of proteins inside each of us."

2022 Report
Dark Proteome Challenge: While we've detected 93% of proteins, understanding their functions, modifications, and interactions remains a monumental task.

The Future of Protein Discovery: New Technologies and Paradigms

As the low-hanging fruit disappears, finding the remaining missing proteins requires increasingly innovative approaches. Researchers are developing new technologies that may finally illuminate the darkest corners of the proteome.

Next-Generation Protein Analysis

Traditional mass spectrometry has limitations—it's destructive and can miss subtle variations. New platforms, like one being commercialized by Nautilus Biotechnology, take a different approach: immobilizing proteins onto a surface and using fluorescent reagents to repeatedly probe what's there . This allows for more comprehensive detection of protein variations.

Other groups are exploring nanopore protein sequencing, which pulls proteins through small openings and reads amino acid sequences by changes in electrical conductivity . This method can potentially keep track of chemical modifications that would be missed by conventional approaches.

The Shift to Functional Annotation

With 93% of the protein parts list confirmed, the HPP is increasingly focusing on the Grand Challenge Project: "A Function for Every Protein" 6 . In 2024, researchers introduced a new Function Evidence (FE) scoring system that ranks our understanding of each protein's molecular functions 6 . This parallel system to the PE evidence codes will help prioritize proteins needing functional characterization.

"The recent development of a Function Evidence (FE) score represents a key step in the pursuit of the HPP Grand Challenge Project, 'A Function for Every Protein,'" states the 2024 HPP report 6 .

Current Status of the Human Proteome (2024)
Category Number of Proteins
Total Protein-Coding Genes 19,411
PE1 Proteins (Confirmed) 18,138
PE1 with MS Evidence 17,407
Missing Proteins (PE2+3+4) 1,273
Dubious Proteins (PE5) 600
The End of the Beginning

The Chromosome-Centric Human Proteome Project represents one of the most ambitious scientific collaborations in biology. From dividing the human chromosomes among international teams to establishing rigorous standards for protein identification, the C-HPP has created a framework that has dramatically accelerated our understanding of the human molecular machinery.

"For me, darkness is a metaphor for what we don't understand."

Sean O'Donoghue of the Garvan Institute of Medical Research

While the number of truly missing proteins dwindles, the work is far from over. The scientific community now faces the more complex challenge of understanding the dance of proteins in our cells—how they interact, when they're modified, and what roles they play in health and disease. Each missing protein found and characterized represents another light turned on in the vast darkness of biological uncertainty, bringing us closer to a complete understanding of what it means to be human at the molecular level.

References