Discover how protein expression patterns are revolutionizing gene function prediction and revealing the secrets of previously uncharacterized proteins
Imagine exploring a vast library where most books are written in a language you can't read—this is the challenge scientists face when studying microbial communities like the human gut microbiome. Despite tremendous advances in genetic sequencing, up to 70% of proteins in these communities remain complete mysteries, their functions unknown.
This vast landscape of uncharacterized genes has been called biology's "dark matter"—an invisible force that undoubtedly shapes health and disease but evades our understanding.
The traditional approach to identifying gene functions—studying proteins in isolation—has proven inadequate for addressing this challenge at scale. However, a revolutionary new strategy is now cracking the code: using protein expression patterns to predict what these unknown genes do. By observing how proteins work together in living systems, scientists are developing a powerful "guilt-by-association" method that is rapidly illuminating biology's dark matter 3 .
The percentage of proteins in microbial communities with unknown functions, representing biology's "dark matter".
The principle that proteins with similar expression patterns likely have related functions.
Dynamic protein expression data provides clues to function that sequence alone cannot reveal.
For decades, scientists have primarily relied on sequence similarity to predict protein function. If a newly discovered gene looked similar to a well-characterized one, researchers would infer it had a similar function. This approach has successfully annotated millions of genes, but it hits a wall when encountering completely novel sequences with no recognizable similarities to known proteins 3 .
The challenge is particularly acute in microbial communities, where genetic novelty is the rule rather than the exception. As one researcher notes, "Even in the well-characterized Escherichia coli pangenome, only 37.6% of protein families were annotated with biological process terms" that describe their functions 3 . In less studied species, the annotation gap is even more dramatic.
The new approach to function prediction embraces a simple but powerful principle: proteins that work together tend to be expressed together. Just as friends who attend the same events likely share common interests, proteins with similar expression patterns across different conditions likely participate in related biological processes 3 .
This method doesn't require prior knowledge of a protein's structure or evolutionary relationships. Instead, it leverages high-dimensional biological data to build association networks. Scientists can then infer functions for unknown proteins based on their "neighbors" in these networks—proteins with established functions that show similar expression patterns 3 .
The shift from static sequence analysis to dynamic expression patterns represents a fundamental change in how we predict protein function, moving from "what it looks like" to "what it does with others."
In 2025, a groundbreaking study published in Nature Biotechnology introduced FUGAsseM (Function predictor of Uncharacterized Gene products by Assessing high-dimensional community data in Microbiomes), a method specifically designed to predict protein functions in microbial communities 3 . The research team applied their approach to data from the Integrative Human Microbiome Project (HMP2/iHMP), which included 1,595 gut metagenomes and 800 metatranscriptomes from 109 participants with and without inflammatory bowel disease.
The researchers' innovation was a two-layered machine learning system that integrated multiple types of evidence. The first layer consisted of individual random forest classifiers trained on different data types—coexpression patterns, genomic proximity, sequence similarity, and domain-domain interactions. The second layer combined these predictions into a single confidence score, weighting each evidence type according to its reliability for predicting specific functions 3 .
Researchers gathered metatranscriptomes (community-wide RNA data) and metagenomes (DNA data) from the same microbial communities 3 .
They clustered similar protein sequences into families across 336 microbial species with at least 500 protein families each 3 .
For each protein family, the system computed association scores based on coexpression patterns, genomic proximity, sequence similarity, and domain interactions 3 .
The two-layer random forest system integrated these evidence types to generate function predictions with confidence scores 3 .
Predictions were benchmarked against known protein functions and compared to existing state-of-the-art methods 3 .
The success of the FUGAsseM approach was staggering—researchers predicted high-confidence functions for more than 443,000 protein families that were previously uncharacterized. Perhaps most impressively, these included over 33,000 protein families with weak or no homology to known proteins, precisely the cases where traditional methods fail completely 3 .
| Category | Description | % of Total |
|---|---|---|
| SC | Strong homology to characterized proteins with informative terms | 14.3% |
| SNI | Strong homology to characterized proteins with noninformative terms | ~11.9% |
| SU | Strong homology to uncharacterized UniProtKB proteins | ~60.5% |
| RH | Remote homology to UniProt proteins | ~8.0% |
| NH | No homology to UniProt proteins | ~1.7% |
| Method | Approach | Best For | Limitations |
|---|---|---|---|
| Traditional Sequence Similarity | Compares protein sequences to databases | Proteins with clear evolutionary relationships | Fails for novel sequences without homologs |
| FUGAsseM | Integrates multiple evidence types including coexpression | Microbial communities, novel proteins | Requires multiple data types from the same community |
| Single-Organism Methods | Optimized for isolated microorganisms | Well-studied model organisms | Poor performance on community-derived proteins |
Behind these advances in function prediction lies an array of sophisticated tools and technologies that enable researchers to generate and analyze protein expression data.
| Tool or Technology | Function | Application in Function Prediction |
|---|---|---|
| Metatranscriptomics | Profiles all RNA molecules in microbial communities | Reveals coexpression patterns for guilt-by-association prediction 3 |
| STRING Database | Compiles protein-protein association networks | Provides known interactions for validation and integration 1 |
| Genetically Encoded Affinity Reagents (GEARs) | Tags and manipulates proteins in living cells | Enables visualization and perturbation of protein function 6 |
| Mass Spectrometry | Identifies and quantifies proteins | Provides direct protein expression data complementary to transcriptomics |
| exvar R Package | Analyzes gene expression and genetic variation | User-friendly tool for processing expression data 5 |
| InMoose Python Library | Implements differential expression analysis | Enables identification of condition-specific protein expression |
The STRING database aggregates known and predicted protein-protein associations, including direct (physical) and indirect (functional) interactions. This resource is invaluable for validating predictions made by methods like FUGAsseM 1 .
Genetically Encoded Affinity Reagents (GEARs) use short epitope tags and specialized binders to visualize and manipulate endogenous proteins in living organisms, providing crucial insights into protein localization and interaction partners 6 .
While methods like FUGAsseM were developed for microbial communities, the approach of using expression data for function prediction has far broader implications. The same principles are being applied to human cells, model organisms, and environmental samples where uncharacterized proteins abound.
The integration of protein expression data with other emerging technologies is particularly promising. Spatial proteomics, for instance, allows researchers to map protein expression within intact tissues while maintaining structural context. As one expert notes, "This spatial information is key to understanding cellular functions and disease processes" 4 .
The practical implications of these advances extend directly to human health. Large-scale proteomics projects are already linking protein expression patterns to disease states and treatment responses. For example, the U.K. Biobank Pharma Proteomics Project is analyzing 600,000 samples to uncover associations between protein levels, genetics, and disease phenotypes 4 .
The potential for drug development is equally significant. As one researcher explains, "With proteomics, you cannot get to causality. There can be many reasons why proteins are moving in the same or opposite direction. But if you have genetics, you can also get to causality" 4 . This combination of proteomic and genetic data is accelerating the identification of novel drug targets and biomarkers for precision medicine.
The revolution in protein function prediction represents more than just a technical advance—it signifies a fundamental shift in how we explore the biological world. By moving beyond sequence alone to embrace the dynamic patterns of protein expression, scientists have developed a powerful lens for examining biology's dark matter.
As these methods continue to evolve, integrating ever more diverse data types and leveraging increasingly sophisticated machine learning approaches, we stand on the threshold of a new era of discovery. The thousands of proteins once relegated to biology's shadows are now stepping into the light, revealing their roles in health, disease, and the fundamental processes of life.
The message is clear: to understand what genes do, we must watch them in action—observing how their protein products work together in the complex dance of biological systems. This shift from static sequence to dynamic function is illuminating the dark corners of biology and expanding our understanding of life's molecular machinery.