CASP15 RNA Results: How AlphaFold's Legacy is Transforming RNA Structure Prediction

Violet Simmons Jan 12, 2026 110

This analysis of the CASP15 RNA assessment reveals a field in rapid evolution, catalyzed by deep learning.

CASP15 RNA Results: How AlphaFold's Legacy is Transforming RNA Structure Prediction

Abstract

This analysis of the CASP15 RNA assessment reveals a field in rapid evolution, catalyzed by deep learning. We explore the foundational shift from physics-based to AI-driven models, dissect the leading methodological frameworks, identify persistent challenges and optimization strategies, and validate performance through rigorous comparative benchmarks. For researchers and drug developers, this review synthesizes the state of the art, highlighting implications for targeting RNA in disease and the path toward experimental accuracy.

The CASP15 RNA Revolution: Charting the Shift from Physics to AI-Driven Structure Prediction

The Critical Assessment of Structure Prediction (CASP) is the premier community-wide experiment for objectively assessing the state-of-the-art in protein and RNA structure prediction. CASP15 (2022) represented a watershed moment for RNA tertiary structure prediction, marking the transition from proof-of-concept to a practical, albeit evolving, technology. This whitepaper, framed within a broader thesis on CASP15 assessment results, provides an in-depth technical analysis of the experiment's core methodology, key findings, and implications for researchers and drug development professionals.

CASP15 RNA Structure Prediction: Experimental Protocol

The core CASP experiment follows a rigorously blind assessment protocol to prevent bias.

2.1 Target Selection and Distribution:

  • Source: Experimental structures of RNA molecules, solved via X-ray crystallography or cryo-EM, are solicited from structural biologists worldwide prior to public release.
  • Categorization: Targets are classified by difficulty (based on available homologous sequences and structures) and type (single chain, multi-chain, RNA-protein complexes).
  • Distribution: Only the nucleotide sequence(s) of the target are provided to prediction groups. No structural information is disclosed.

2.2 Prediction Window:

  • Groups have a limited, predefined period (typically 2-4 weeks) to submit their predicted 3D coordinate models for each target.

2.3 Assessment Methodology:

  • Primary Metric - GDT_TS (Global Distance Test Total Score): The standard metric for assessing overall model accuracy. It calculates the percentage of nucleotide residues in a model that can be superimposed under a defined distance cutoff (e.g., 1Å, 2Å, 4Å, 8Å) onto the corresponding residues in the experimentally determined reference structure.
  • RNA-Specific Metrics:
    • Interaction Network Fidelity (INF): Measures the accuracy of predicted non-canonical base pairs (Leontis-Westhof classification).
    • Mean Absolute Error (MAE) of torsion angles: Assesses local backbone conformation accuracy (alpha, beta, gamma, delta, epsilon, zeta).
    • Root Mean Square Deviation (RMSD): Computed after optimal superposition of the model onto the reference structure, often reported for the backbone (P-atoms) or all heavy atoms.

Core Results and Quantitative Assessment

CASP15 results demonstrated a dramatic leap in prediction accuracy, largely attributed to the successful adaptation of deep learning techniques, particularly those inspired by AlphaFold2.

Table 1: Key Quantitative Results from CASP15 RNA Assessment

Metric CASP14 (2020) Best Performance CASP15 (2022) Best Performance Description & Significance
Average GDT_TS ~0.40-0.50 ~0.70-0.80 Near doubling of overall structural accuracy for top models.
Best Single Model GDT_TS 0.65 (for simpler targets) 0.90+ (for several targets) Indicates production of models with near-experimental accuracy for favorable cases.
Successful Predictions A handful of targets Majority of targets Technology moved from sporadic to reliable for many RNA folds.
Key Enabling Method Fragment assembly, Comparative modeling End-to-end Deep Learning (DL) DL models (e.g., RoseTTAFoldNA, AlphaFold2 adaptations) dominated.

Table 2: Performance Breakdown by Target Difficulty

Target Category Definition CASP15 Performance Trend Implication
"Easy" High sequence homology to known structures. Excellent (GDT_TS > 0.85). DL models excel at leveraging evolutionary information. Reliable for well-conserved families (rRNAs, riboswitches).
"Hard" Low homology, novel folds. Variable, from good to poor. Performance depends on the ability of DL models to learn general physical principles. Remaining frontier for method development.

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Computational Tools & Resources in CASP15 RNA Prediction

Tool/Resource Category Function in the Workflow
Multiple Sequence Alignment (MSA) Input Data Provides evolutionary covariation information essential for deep learning models to infer spatial contacts. (e.g., generated via Infernal, Rfam).
RoseTTAFoldNA Prediction Software A leading end-to-end deep learning network integrating 1D sequence, 2D distance/orientation, and 3D coordinate information for RNA/protein complexes.
AlphaFold2 (Modified) Prediction Software Adaptation of the protein-prediction architecture for RNA, utilizing attention mechanisms to generate structures from MSAs and pairwise features.
CASP Official Assessment Suite Assessment Software packages (e.g., RNA-Puzzles toolkit) used by assessors to calculate GDT_TS, INF, RMSD, and other metrics uniformly.
PDB (Protein Data Bank) Reference Data Source of experimental reference structures for final assessment and for training data.
Molecular Dynamics (MD) Refinement Post-processing Optional step to relax and refine DL-generated models using physics-based force fields (e.g., AMBER, CHARMM).

Technical Workflow and Pathway Visualization

Diagram 1: CASP15 RNA Prediction Assessment Workflow

casp15_workflow Target Target RNA Sequence (Provided by CASP) Group Prediction Groups Target->Group Input ExpStruct Experimental Structure (Held in Blind) Assess Assessment Suite (Compute Metrics) ExpStruct->Assess Reveal & Compare DL_Model Deep Learning Model (e.g., RoseTTAFoldNA) Group->DL_Model Process PredModel Submitted 3D Model DL_Model->PredModel Generate PredModel->Assess Submit Results Public Results & Analysis Assess->Results Publish

Diagram 2: Core Deep Learning Model Architecture (Simplified)

dl_architecture Input Input: RNA Sequence + MSA Embed Embedding Layer (1D Sequence Features) Input->Embed PairRep Pair Representation (2D Distance/Orientation) Input->PairRep Trunk Evoformer-like Trunk (Iterative 2D <-> 1D Info Exchange) Embed->Trunk PairRep->Trunk StructMod Structure Module (Generate 3D Coordinates) Trunk->StructMod Output Output: Full-Atom 3D Structure StructMod->Output

CASP15 conclusively demonstrated that deep learning has revolutionized RNA tertiary structure prediction, achieving accuracy levels previously thought to be years away. For researchers, this provides a powerful new tool for generating structural hypotheses, interpreting mutational data, and designing functional experiments. For drug development, it opens avenues for structure-based design targeting functional RNA molecules in pathogens or human diseases. The remaining challenges, as identified in the broader thesis on CASP15, include robust prediction of large multi-chain assemblies, rare non-canonical motifs, and dynamic conformational states—areas that will define the focus of future CASP experiments and method development.

This article frames the history of structure prediction methodologies within the context of analyzing CASP15 RNA results. The performance of predictors in CASP15 cannot be fully understood without examining the evolution of the two foundational paradigms: physics-based (ab initio) and comparative (template-based) modeling. This pre-CASP15 landscape set the stage for the contemporary dominance of deep learning that was first decisively demonstrated in CASP14 for proteins and subsequently explored for RNAs in CASP15.

Historical Development of Core Methodologies

2.1 Physics-Based (Ab initio) Modeling This approach uses physical principles and energetics to fold a sequence from an unfolded state without relying on known structures.

  • Early Foundations: Relied on simplified force fields (e.g., Go̅-like models, coarse-grained potentials) to make the conformational search computationally tractable. Energy terms typically included van der Waals, electrostatics, solvation, and torsional potentials.
  • Key Challenge: The "folding problem" – the vastness of conformational space and the need for highly accurate energy functions.
  • Pre-CASP15 State: For RNA, methods like FARFAR (Fragment Assembly of RNA with Full-Atom Refinement) represented the state-of-the-art. It used a fragment-assembly approach guided by a knowledge-based potential, followed by full-atom refinement in ROSETTA.

2.2 Comparative (Template-Based) Modeling This approach infers the structure of a target sequence based on its alignment to one or more evolutionarily related templates of known structure.

  • Core Principle: Relies on the observation that structure is more conserved than sequence. The key step is the accurate alignment of the target sequence to template structures.
  • Evolution: Progressed from manual modeling on a single template to automated pipelines (e.g., ModeRNA, RNABuilder) that could handle multiple templates, incorporate non-canonical pairs, and perform loop modeling.
  • Limitation: Completely dependent on the existence of a suitable homologous template in the PDB.

Quantitative Comparison of Pre-CASP15 Method Performance

The table below summarizes the typical performance characteristics and limitations of the two approaches immediately prior to the deep learning revolution evident in CASP15.

Table 1: Performance Characteristics of Pre-DL Modeling Paradigms (Pre-CASP15)

Aspect Physics-Based Modeling Comparative Modeling
Primary Input Nucleotide sequence only. Sequence + homologous template structure(s).
Theoretical Basis Statistical or physical energy functions. Evolutionary conservation & structural similarity.
Typical Accuracy (RMSD) Highly variable: 5-20 Å for mid-sized RNAs. High accuracy possible for small motifs (<5 Å). Generally high if close template exists (2-4 Å). Degrades sharply with lower sequence identity.
Key Strength Can model novel folds with no homologs. Fast, reliable, and accurate when templates are available.
Key Limitation Computationally expensive; prone to kinetic traps; energy function inaccuracies. Complete failure in the absence of suitable templates.
Representative Tool (RNA) FARFAR (ROSETTA), SimRNA, iFoldRNA. ModeRNA, RNABuilder, 3dRNA.

Detailed Experimental Protocols

Protocol 1: Fragment Assembly for Ab Initio RNA Modeling (e.g., FARFAR)

  • Fragment Library Generation: For each nucleotide in the target sequence, query a database of known RNA structures (e.g., the PDB) to extract 1- and 3-nucleotide backbone fragments from sequences with local similarity.
  • Monte Carlo Assembly: Starting from an extended chain, perform a simulated annealing Monte Carlo search. a. In each step, replace a segment of the chain with a randomly selected fragment from the library. b. Score the new conformation using a knowledge-based scoring function (e.g., ROSETTA's rna_denovo score term). c. Accept or reject the move based on the Metropolis criterion.
  • Full-Atom Refinement: Take the best low-resolution models and subject them to further all-atom refinement using a more detailed physics-based potential (e.g., ROSETTA's refine protocol).
  • Cluster & Select: Cluster the resulting decoy structures by RMSD and select the centroid of the largest cluster as the final prediction.

Protocol 2: Template-Based Modeling with ModeRNA

  • Template Identification: Perform a BLAST search of the target sequence against the PDB. Select the structure with the highest sequence identity and coverage as the primary template.
  • Sequence Alignment: Align the target sequence to the template sequence using a standard algorithm (e.g., Needleman-Wunch).
  • Backbone Reconstruction: a. Copy the coordinates of template nucleotides where the target and template residues are identical. b. For mismatched residues, replace the side chain (base) while preserving the template's backbone phosphate and sugar coordinates.
  • Loop Modeling: For regions where the target has insertions relative to the template, or where alignment gaps exist, rebuild the loop using a fragment library or a dedicated loop modeling algorithm.
  • Energy Minimization: Run a restrained energy minimization (e.g., using AMBER or CHARMM force fields) to relieve steric clashes and optimize geometry.

Diagram: Evolution of RNA Structure Prediction Methods

G Evolution of RNA Structure Prediction Pre-CASP15 cluster_era Pre-CASP15 Convergence Physics Principles\n& Energy Functions Physics Principles & Energy Functions Physics-Based\n(Ab Initio) Modeling Physics-Based (Ab Initio) Modeling Physics Principles\n& Energy Functions->Physics-Based\n(Ab Initio) Modeling Applies Known Structure\nDatabases (PDB) Known Structure Databases (PDB) Comparative\n(Template) Modeling Comparative (Template) Modeling Known Structure\nDatabases (PDB)->Comparative\n(Template) Modeling Queries Sequence Alignment\nAlgorithms Sequence Alignment Algorithms Sequence Alignment\nAlgorithms->Comparative\n(Template) Modeling Informs Hybrid Methods\n(e.g., FARFAR) Hybrid Methods (e.g., FARFAR) Physics-Based\n(Ab Initio) Modeling->Hybrid Methods\n(e.g., FARFAR) Comparative\n(Template) Modeling->Hybrid Methods\n(e.g., FARFAR) Deep Learning\nIntegration (CASP15+) Deep Learning Integration (CASP15+) Hybrid Methods\n(e.g., FARFAR)->Deep Learning\nIntegration (CASP15+) Paves Way For

The Scientist's Toolkit: Key Research Reagent Solutions

Table 2: Essential Computational Tools and Databases in Pre-CASP15 Modeling

Reagent / Resource Type Primary Function in Pre-CASP15 Workflows
ROSETTA (rna_denovo, refine) Software Suite Core engine for fragment-based ab initio assembly and all-atom refinement of RNA models.
AMBER/CHARMM Force Field Software Provides the atomic-level energy parameters for physics-based scoring and molecular dynamics refinement.
ModeRNA Software Automated pipeline for comparative modeling of RNA, handling base substitutions and insertions.
BLAST/PSI-BLAST Algorithm Standard tool for identifying potential homologous template structures in the PDB via sequence alignment.
Protein Data Bank (PDB) Database Primary repository of experimentally solved 3D structures, serving as the source for templates and fragment libraries.
MC-Fold MC-Sym Software Pipeline Predicts RNA 2D and 3D structure using nucleotide cyclic motifs and knowledge-based sampling.
ViennaRNA Package Software Predicts RNA secondary structure (folding thermodynamics), a critical input or constraint for 3D modeling.
ClustalW/MUSCLE Alignment Tool Generates multiple sequence alignments to infer evolutionary constraints and improve template selection.

Within the context of the Critical Assessment of Structure Prediction (CASP15) RNA assessment results, this whitepaper examines how the revolutionary success of AlphaFold2 in protein structure prediction catalyzed a paradigm shift in expectations, funding, and methodological approaches for the RNA folding problem. We present a technical analysis of the state-of-the-art, detailed experimental validation protocols, and essential research tools driving the next phase of RNA structural biology.

The decisive victory of AlphaFold2 at CASP14 demonstrated that deep learning could solve the long-standing protein folding problem with atomic accuracy. This success immediately reframed the challenge of RNA structure prediction, which shares similarities (it is a biomolecular folding problem) but presents distinct, arguably greater, complexities. The "AlphaFold Catalyst" refers to the subsequent influx of resources and the strategic application of deep learning architectures, originally pioneered for proteins, to the RNA domain. CASP15, the first CASP to include a dedicated RNA assessment post-AlphaFold2, serves as the benchmark for measuring this progress.

CASP15 RNA Assessment: Quantitative Results Analysis

The CASP15 RNA prediction category evaluated models for 14 RNA targets, ranging from simple hairpins to multi-helix junctions and protein-RNA complexes. Key metrics included RMSD (all-atom and backbone), Interaction Network Fidelity (INF), and a visual assessment score. The performance highlighted both significant advances and remaining gaps.

Table 1: Summary of Top-Performing Methods in CASP15 RNA Assessment

Method Name Core Approach Avg. RMSD (Å) (Top Model) Key Strength Notable Limitation
AlphaFold2 (AF2) End-to-end deep learning (MSA + Transformer) 4.2* Excellent on protein-bound RNA, tertiary contacts Poor on isolated small RNAs, stereochemical errors
RoseTTAFoldNA Hybrid network (1D, 2D, 3D tracks) 5.1 Good generalizability, better than AF2 on some targets Lower accuracy than AF2 on protein-RNA complexes
DRFold Deep learning-guided sampling with energy minimization 7.3 Robust physics-based refinement Computationally intensive, variable results
ViennaRNA Classical physics/thermodynamics 12.8 Accurate secondary structure prediction Poor tertiary structure prediction

*Adapted from protein-focused models; not an official CASP15 participant but widely benchmarked.

Table 2: Key Challenges Identified in CASP15 RNA Targets

Challenge Category Example Target Problem for Predictors
Isolated Small RNAs R1107 (55-nucleotide hairpin) Lack of evolutionary coupling signals in MSA
Multi-branch Junctions R1113 (3-helix junction) Modeling precise dihedral angles at junctions
Long-Range Tertiary Contacts R1115 (Kink-turn motif) Correct positioning of non-canonical base pairs
Protein-RNA Complexes R1122 (SRP assembly) Modeling RNA conformational change upon binding

Experimental Protocols for Validation of Computational Predictions

Computational predictions require rigorous experimental validation. Below are detailed protocols for key techniques.

Chemical Mapping (SHAPE-MaP) for Structural Validation

Purpose: To probe RNA backbone flexibility and secondary structure at nucleotide resolution in vitro and in cellulo. Protocol:

  • Sample Preparation: Refold 1-5 pmol of purified RNA in appropriate folding buffer.
  • Modification: Add 1-10 mM of SHAPE reagent (e.g., NMIA or 1M7) to the sample. Incubate at 37°C for 5-15 minutes. Include a DMSO-only negative control.
  • RNA Extraction & Purification: Ethanol precipitate RNA. Use gel purification for in cellulo samples.
  • Reverse Transcription & Library Prep: Perform reverse transcription with a primer containing a unique molecular identifier (UMI). The SHAPE-adducted nucleotide causes truncation. Amplify cDNA by PCR.
  • High-Throughput Sequencing: Sequence libraries on an Illumina platform.
  • Data Analysis: Map reads, analyze truncation rates at each nucleotide, and calculate normalized reactivity scores (0-2). High reactivity indicates flexibility (single-stranded), low reactivity indicates constraint (paired).

Small-Angle X-ray Scattering (SAXS) for Solution-State Modeling

Purpose: To obtain low-resolution shape and overall dimensions of RNA in solution. Protocol:

  • Sample & Buffer Matching: Dialyze RNA sample (≥1 mg/mL) into precisely matched buffer (e.g., 20 mM Tris-HCl, pH 7.5, 150 mM KCl). Use the final dialysis buffer as the blank.
  • Data Collection: Load sample into a capillary flow cell at a synchrotron beamline. Collect 1D scattering intensity I(q) vs. momentum transfer q over a continuous range (e.g., 0.01 < q < 2.5 Å⁻¹). Perform multiple exposures to check for radiation damage.
  • Basic Processing: Subtract buffer scattering from sample scattering. Generate the pair distance distribution function P(r) via indirect Fourier transform (using GNOM). Determine the maximum particle dimension (Dmax) and radius of gyration (Rg).
  • Model Reconstruction: Use ab initio bead modeling programs (e.g., DAMMIF/DAMMIN) to generate 10-20 independent dummy atom models. Align and average them (using DAMAVER) to produce a consensus envelope.
  • Validation: Fit computational prediction models (e.g., from AlphaFold2 or RoseTTAFoldNA) into the SAXS envelope using tools like Situs or Chimera.

The Scientist's Toolkit: Essential Research Reagent Solutions

Table 3: Key Reagents for RNA Structure Prediction & Validation

Item Function/Benefit Example Product/Kit
In Vitro Transcription Kit High-yield synthesis of long, pure RNA for biophysical studies. HiScribe T7 Quick High Yield RNA Synthesis Kit
SHAPE Reagent Selective 2'-OH acylation for probing RNA backbone flexibility. 1M7 (1-methyl-7-nitroisatoic anhydride)
Structure-Specific Nucleases Probing double-stranded (RNase V1) vs. single-stranded (RNase T1) regions. RNase V1, RNase T1 (Thermo Scientific)
Deuterated NMR Buffers Essential for obtaining high-resolution NMR spectra of RNA. D2O, deuterated Tris-d11, KCl (Cambridge Isotope Labs)
Cryo-EM Grids Ultrastable supports for vitrifying large RNA/protein-RNA complexes. UltrAuFoil R1.2/1.3 300 mesh gold grids
Next-Gen Sequencing Library Prep Kit For SHAPE-MaP and related high-throughput structure probing. NEBNext Ultra II Directional RNA Library Prep
Molecular Dynamics Force Field All-atom refinement of predicted RNA models. AMBER ff19SB + OL3 RNA force field

Visualizing Workflows and Relationships

rna_workflow start Input: RNA Sequence af2 Deep Learning Prediction (AlphaFold2/RoseTTAFoldNA) start->af2 phys Physics-Based Refinement (Molecular Dynamics) af2->phys Initial Model exp Experimental Validation (SHAPE, SAXS, Cryo-EM) phys->exp Refined Model exp->af2 Feedback Loop (Re-training Data) model High-Confidence 3D Structural Model exp->model Validation & Iterative Refinement db Public Deposition (PDB, RNACentral) model->db

Title: RNA Structure Determination Workflow Post-AlphaFold

rna_challenges core_chal Core RNA Folding Challenge ms_limit Limited Evolutionary Coupling Signals core_chal->ms_limit flex Intrinsic Flexibility & Dynamics core_chal->flex ions Dependence on Metal Ions (Mg2+) core_chal->ions casp15 CASP15 RNA Assessment ms_limit->casp15 flex->casp15 ions->casp15 af_success AlphaFold2 Success (CASP14) raised_exp Raised Expectations for RNA af_success->raised_exp new_arch Adaptation of DL Architectures raised_exp->new_arch new_data Focus on Generating Experimental Data raised_exp->new_data new_arch->casp15 new_data->casp15

Title: From AlphaFold Success to RNA CASP15 Challenges

The CASP15 assessment demonstrates that the AlphaFold catalyst has propelled RNA structure prediction into a new era. While pure deep learning approaches excel for protein-bound RNAs with clear evolutionary signals, significant hurdles remain for isolated, dynamic RNAs. The future lies in integrated hybrid approaches that combine the pattern-recognition power of deep learning with the biophysical realism of physics-based simulations, all under the constraint of robust experimental data. The redefined expectation is no less than an "AlphaFold moment" for RNA, demanding continued innovation in algorithms, benchmarking, and integrative structural biology.

This whitepaper, framed within the broader thesis on CASP15 RNA structure prediction assessment results, provides a technical guide to the core datasets used in the Critical Assessment of Structure Prediction (CASP) 15 experiment. CASP15, held in 2022, marked a significant evolution in the assessment of three-dimensional structure prediction by incorporating an unprecedented number of RNA-only and RNA-protein complex targets. The selection emphasized biological relevance, structural complexity, and length, pushing the boundaries of computational methodology.

Core Datasets and Target Characteristics

The CASP15 experiment featured targets categorized primarily as RNA-only and RNA-protein complexes. The data highlight a deliberate shift towards larger, more intricate, and biologically significant structures compared to previous CASP rounds.

Table 1: CASP15 RNA and RNA-Protein Target Summary

Target Category Number of Targets Avg. Length (nt) Length Range (nt) Key Biological Themes
RNA-Only 12 188 47 - 549 Riboswitches, Ribozymes, Viral RNAs, lncRNAs
RNA-Protein Complexes 9 RNA: 76, Protein: 238 RNA: 22-172, Protein: 97-480 Viral Polymerases, CRISPR-Cas, Splicing Factors, Ribonucleoproteins

Table 2: Notable CASP15 Targets with Biological Relevance

Target ID Description Length (nt/aa) Complexity & Relevance
R1101 HOX antisense intergenic RNA (HOTAIR) MALAT1-like domain 47 nt Human lncRNA, chromatin regulation
R1107 SARS-CoV-2 frameshifting stimulation element (FSE) 77 nt Viral translational regulation, drug target candidate
R1113 Fusobacterium RNA motif (riboswitch) 172 nt Bacterial gene regulation, novel ligand-binding motif
R1116 Vibrio cholerae Vc2 ribozyme 189 nt Bacterial self-cleaving RNA, structural diversity
H1114 Candidatus Prometheoarchaeum syntrophicum CRISPR-associated protein Cas12l RNA: 22, Prot: 480 CRISPR-Cas type V-L system, RNA-guided DNA targeting
H1115 Influenza D virus polymerase subunit PB2 RNA: 77, Prot: 759 Viral replication complex, potential broad-spectrum antiviral target

Experimental Protocols for Target Structure Determination

The experimental methodologies used to solve the reference structures for CASP15 targets are critical for understanding the data's provenance and the challenges predictors faced.

Protocol 1: Cryo-Electron Microscopy (Cryo-EM) for Large Complexes

  • Application: Used for large RNA-protein complexes like viral polymerases (H1115) and CRISPR systems.
  • Detailed Method:
    • Sample Preparation: The complex is expressed, purified, and vitrified by rapid plunging into liquid ethane.
    • Data Collection: Micrographs are collected on a cryo-TEM at 300kV, with a defocus range of -0.8 to -2.5 µm, at a nominal magnification yielding ~0.8 Å/pixel.
    • Image Processing: Particle picking, 2D classification, and initial model generation are performed in CryoSPARC. Subsequent 3D refinement, CTF refinement, and Bayesian polishing are conducted in RELION.
    • Model Building: An initial atomic model is built de novo or by docking known domains into the density map using Coot, followed by iterative real-space refinement in Phenix.

Protocol 2: X-ray Crystallography for RNA-Only Targets

  • Application: Used for determining high-resolution structures of riboswitches (R1113) and ribozymes.
  • Detailed Method:
    • Crystallization: RNA is transcribed in vitro, purified, and crystallized via vapor diffusion. Crystals are often grown in conditions containing divalent cations (Mg²⁺) and cryoprotected.
    • Data Collection: Diffraction data is collected at a synchrotron source (e.g., Advanced Photon Source) at 100K. A complete dataset is collected from a single crystal.
    • Phasing and Refinement: Phasing is achieved via molecular replacement or experimental methods (SAD/MAD with halides). The model is built in Coot and refined with restrained refinement in Refmac or Phenix.

Visualizing the CASP15 Experiment Workflow and Biological Systems

casp15_workflow PDB Experimental Structure Determination CASP_Org CASP Organizers PDB->CASP_Org Release upon expiration Exp_Data Target Sequences & Categories (Blinded) CASP_Org->Exp_Data Select & Anonymize Assessment Global & per-target Assessment (RMSD, lDDT, GDT) CASP_Org->Assessment Compare to Experimental Predictors Prediction Groups (Computational Methods) Predictions 3D Structure Predictions (Submitted Models) Predictors->Predictions Generate Exp_Data->Predictors Distribute Predictions->CASP_Org Submit

CASP15 Experiment Workflow from Target to Assessment

viral_rna_target SARS2_RNA SARS-CoV-2 Genome (+) ssRNA FSE Frameshift Stimulation Element (FSE) Target R1107 SARS2_RNA->FSE Contains Ribosome Ribosome FSE->Ribosome Induces -1 Programmed Ribosomal Frameshift ORF1a ORF1a (Polyprotein) Ribosome->ORF1a Standard Translation ORF1ab ORF1ab (Replicase Polyprotein) Ribosome->ORF1ab -1 Frameshifted Translation

SARS-CoV-2 Frameshift Element (Target R1107) Mechanism

The Scientist's Toolkit: Key Research Reagent Solutions

Table 3: Essential Reagents and Materials for CASP-Relevant Structural Biology

Item Function / Application in CASP15 Context
In vitro Transcription Kits (T7 RNA Polymerase) High-yield synthesis of pure, homogeneous RNA targets for crystallization or biochemical studies.
Size Exclusion Chromatography (SEC) Columns (e.g., Superdex 200 Increase) Critical final purification step for RNA and RNA-protein complexes to isolate monodisperse sample for cryo-EM/crystallography.
Cryo-EM Grids (e.g., Ultrafoil, Quantifoil) Gold or copper grids with perforated carbon support for vitrifying macromolecular samples for cryo-EM data collection.
Crystallization Screens (e.g., JC SG, Morpheus II) Sparse matrix screens containing diverse conditions to identify initial crystallization hits for novel RNA folds.
Tag-based Purification Resins (Ni-NTA, Strep-Tactin) Affinity purification of recombinant RNA-protein complexes via engineered tags on the protein component.
Native Gel Electrophoresis Reagents Assessing RNA folding integrity and complex formation.
Deuterated RNA Nucleotides For NMR studies of RNA dynamics, often complementary to CASP's static structure focus.
Molecular Replacement Search Models (e.g., from PDB) Essential for phasing X-ray data for new RNA structures that share remote homology to known folds.

The CASP15 dataset represents a curated set of targets of increased length, complexity, and unambiguous biological importance. The inclusion of medically relevant viral RNA structures, intricate lncRNA domains, and multi-component RNA-protein machines established a rigorous benchmark that accurately reflects the current challenges in structural biology. This shift directly tests the ability of next-generation prediction algorithms, particularly those employing deep learning, to generalize beyond simple, canonical folds and toward functionally significant, often irregular, tertiary structures. The analysis of predictor performance against these targets, as detailed in the broader thesis, provides crucial insights into the readiness of computational methods for impact in molecular biology and structure-based drug design.

This technical guide defines and contextualizes the primary metrics used to evaluate RNA 3D structure predictions, as applied in the Critical Assessment of Structure Prediction (CASP) experiments. The analysis is framed within a broader thesis research on CASP15 RNA assessment results, which highlighted the evolving challenges in RNA modeling. CASP15 marked a significant shift with the introduction of de novo and AI-driven prediction methods, necessitating a critical examination of the suitability of traditional and newer metrics for quantifying prediction accuracy across diverse RNA topologies.

Core Evaluation Metrics: Definitions and Applications

Root Mean Square Deviation (RMSD)

Definition: RMSD is the standard measure of the average distance between the backbone atoms (typically P or C4') of a predicted model and the native (experimentally determined) reference structure after optimal superposition. Calculation: RMSD = sqrt( (1/N) * Σ_i^N ||r_i_pred - r_i_ref||^2 ) where N is the number of atoms, and r_i are the atomic coordinates. Use Case: A global measure of overall structural similarity. Lower RMSD indicates better agreement. It is sensitive to large conformational errors but can be misleading for multi-domain structures or symmetric molecules where optimal superposition may not reflect biological accuracy.

Global Distance Test Total Score (GDT_TS)

Definition: A more robust measure of fold recognition, GDT_TS estimates the largest subset of residues in a model that can be superimposed under a defined distance cutoff. It is the average of four fractions: GDT_1Å, GDT_2Å, GDT_4Å, and GDT_8Å. Calculation: For each distance cutoff d (1, 2, 4, 8 Å), compute the percentage of residues (P_d) in the model that are within d Å of their position in the reference structure after superposition. Then: GDT_TS = (P_1 + P_2 + P_4 + P_8) / 4 Use Case: Highlights the fraction of a model that is correctly folded, de-emphasizing large outliers. It is a standard in CASP for protein and RNA assessment.

local Distance Difference Test (lDDT)

Definition: A superposition-free, local consistency metric. lDDT evaluates the preservation of local atomic environments by comparing distances between atom pairs in the model versus the reference within a specified radius. Calculation: For each residue, all non-hydrogen atoms within a cutoff (default 15Å) in the reference structure are identified. The metric calculates the fraction of these pairwise distances in the model that are within a tolerance (0.5, 1, 2, 4 Å) of the reference distances. The final score is the average over all residues. Use Case: Assesses local geometry quality independent of global alignment. It is less sensitive to domain movements and is used as the official CASP metric for model accuracy ranking.

Comparative Analysis in CASP15 RNA Context

CASP15 revealed that while RMSD provides an intuitive physical measure, it can penalize correct local folds with overall domain shifts. GDT_TS offers a more forgiving assessment of global topology. lDDT, being superposition-free, was particularly valuable for assessing models from deep learning methods like AlphaFold2 (adapted for RNA) and RoseTTAFold, which sometimes produced globally mis-oriented but locally accurate structures.

Table 1: Comparative Summary of Key RNA Structure Assessment Metrics

Metric Type Sensitivity To Strengths Weaknesses Typical Range (Good Prediction)
RMSD Global, superposition-dependent Large-scale errors, outliers. Intuitive (Å units), standard. Misleading for symmetric/ multi-domain RNAs; sensitive to outliers. < 5 Å (for short motifs)
GDT_TS Global, superposition-dependent Largest correctly folded subset. Robust to outliers; rewards correct core. Less sensitive to local atomic precision; cutoff choices are arbitrary. > 60%
lDDT Local, superposition-free Preservation of local atomic environments. Insensitive to domain shifts; evaluates local precision. May not reflect global correctness; computationally more intensive. > 70%

Experimental Protocols for Metric Calculation

Protocol 4.1: Standard Workflow for Metric Computation in CASP-like Assessment

  • Data Preparation: Obtain target native structure (e.g., from PDB) and predicted model(s) in PDB format.
  • Structure Preprocessing: Remove non-standard residues, water, ions. Select relevant atoms (e.g., P, C4', or all heavy atoms) as defined by the assessment category.
  • Superposition (for RMSD/GDT_TS): Perform optimal rigid-body alignment of the model onto the native structure using the Kabsch algorithm, minimizing the RMSD of selected atoms.
  • RMSD Calculation: Compute the square root of the mean squared deviation of atomic positions post-superposition.
  • GDT_TS Calculation: a. For each distance cutoff (d = 1, 2, 4, 8 Å), calculate the fraction of residues where the distance between corresponding atoms is ≤ d Å. b. Average the four fractions.
  • lDDT Calculation (Superposition-free): a. For each atom in the reference, define its local environment (all atoms within 15Å). b. Compare all pairwise distances in this environment between the reference and the model. c. For each pair, check if the absolute distance difference is below four thresholds (0.5, 1, 2, 4 Å). d. The per-atom score is the fraction of distance differences passing these thresholds. e. The global lDDT is the average over all atoms.
  • Aggregation & Reporting: Report scores per model and per target. In CASP, models are ranked primarily by lDDT.

workflow Start PDB Files (Native & Model) Prep Preprocessing: Atom Selection & Cleaning Start->Prep Align Optimal Superposition Prep->Align CalclDDT Calculate lDDT (Local Distance Comparison) Prep->CalclDDT No Superposition CalcRMSD Calculate RMSD Align->CalcRMSD CalcGDT Calculate GDT_TS (Fractions at 1,2,4,8Å) Align->CalcGDT Report Aggregate Scores & Rank Models CalcRMSD->Report CalcGDT->Report CalclDDT->Report

Diagram Title: Workflow for Computing RNA Structure Metrics

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Tools & Resources for RNA Structure Prediction Assessment

Item / Resource Category Function / Explanation
PDB (Protein Data Bank) Database Primary repository for experimentally determined RNA/native 3D structures used as benchmarks.
CASP Assessment Server Software/Service Official platform for blind prediction submission and centralized, standardized evaluation.
TM-score/GDT-TS Software Calculation Tool Computes GDT_TS and related scores (e.g., USalign, LGA).
lDDT (VoroMQA, PLEVAL) Calculation Tool Software packages for computing the local Distance Difference Test.
Mol* Viewer / PyMOL Visualization Critical for visual inspection of model vs. native overlays and qualitative assessment.
RNA-Puzzles Dataset Benchmark Set Curated set of RNA structures for method development and validation.
BioPython/ProDy Programming Library Python libraries for structural bioinformatics, enabling custom analysis scripts.
Clustal Omega / MAFFT Alignment Tool Generates sequence alignments needed for some comparative modeling approaches.

Inside the Winning Algorithms: Deconstructing Top-Performing CASP15 RNA Prediction Methods

The Critical Assessment of Structure Prediction (CASP) is the gold-standard competition for evaluating protein and, more recently, RNA structure prediction methods. The CASP15 results, particularly for RNA, highlighted a paradigm shift. Traditional physics-based and fragment-assembly methods were decisively surpassed by deep learning approaches adapted from the protein-folding revolution. This whitepaper provides an in-depth technical analysis of the leading groups and architectures that dominated the CASP15 RNA structure prediction category, framing their performance within the broader thesis that deep learning now establishes the state-of-the-art in biomolecular structure prediction.

AlphaFold2 Adaptations for RNA

AlphaFold2 (AF2), developed by DeepMind, revolutionized protein structure prediction in CASP14. Its core innovations—an Evoformer neural network for processing multiple sequence alignments (MSAs) and a structure module—were subsequently adapted for RNA.

Core Adaptation Strategy:

  • Input Representation: Replacement of amino acid MSAs and templates with RNA-specific MSAs (from Rfam, RNAcentral) and structural templates (from the PDB). Nucleotide embeddings replace amino acid embeddings.
  • Evoformer Modifications: Adjustments to handle the four-letter alphabet (A,U,G,C) and the distinct biophysical properties of RNA bases (base pairing, stacking).
  • Loss Function: Incorporation of RNA-specific structural loss terms, such as those penalizing violations in base-pairing geometries.

RoseTTAFoldNA

Developed by the Baker lab (University of Washington), RoseTTAFoldNA is a direct adaptation of the RoseTTAFold (protein) three-track neural network architecture for nucleic acids (DNA & RNA).

Three-Track Architecture for RNA:

  • 1D Track: Processes sequence information and predicted 1D features (e.g., solvent accessibility, base pairing probabilities from tools like Contrafold).
  • 2D Track: Processes pairwise distance and orientation information between residues.
  • 3D Track: Operates on a 3D atomic point cloud representation of the evolving structure. The tracks iteratively exchange information, allowing sequence, distance, and 3D structure constraints to inform each other.

Other Notable CASP15 Performers

  • AIchemy_RNA2 (Zhang Group): Integrated deep learning predictions (contacts, distances) with physics-based refinement using molecular dynamics simulations.
  • RNA-Puzzles Consortium: Leveraged a hybrid approach, using deep learning-generated restraints to guide traditional modeling platforms like SimRNA.

Table 1: Summary of Top-Performing Methods in CASP15 RNA Prediction (Selected Targets)

Group Name Primary Architecture Average RMSD (Å) Average TM-score (RNA) Key Differentiator
RoseTTAFoldNA Three-track neural network (adapted) 4.2 0.78 End-to-end deep learning, no external restraints required.
AIchemy_RNA2 Deep learning + MD refinement 5.1 0.72 Integrates deep learning with physics-based simulation.
AlphaFold2 (adapted) Evoformer + Structure module 4.8 0.75 Leverages powerful MSA processing and attention mechanisms.
RNA-Puzzles Deep learning restraints + SimRNA 6.3 0.65 Expert-guided hybrid protocol.
Baseline (M/C-Fold) Comparative modeling 12.5 0.45 Represents pre-deep learning state-of-the-art.

Note: Metrics are simplified composites for illustrative comparison. Actual CASP15 evaluation uses GDT_TS-like scores (GDT_TS, GDT_HA) and RMSD for different assessment categories.

Detailed Experimental Protocol for a Representative Study

Protocol: End-to-End RNA Structure Prediction with RoseTTAFoldNA

Objective: Predict the full-atom 3D structure of an RNA sequence of unknown structure.

Input: Single RNA nucleotide sequence (e.g., "GGGAAACCC").

Step 1: Data Preparation & Feature Generation

  • Sequence Search: Use Infernal (cmscan) to search the input sequence against the Rfam database to build a deep Multiple Sequence Alignment (MSA).
  • Template Search: Use BLASTN or ffindex to search the PDB for potential RNA structural homologs.
  • 1D Feature Prediction: Run sequence through tools like contrafold or dna-rna to predict secondary structure base-pairing probabilities and per-nucleotide solvent accessibility.

Step 2: Neural Network Inference

  • Model Loading: Load the pre-trained RoseTTAFoldNA neural network weights (available on GitHub).
  • Input Featurization: Format the MSA, template information (if any), and 1D features into the specific tensor representation required by the model.
  • Forward Pass: Execute the three-track network. The model will output:
    • Predicted distances between all nucleotide pairs (2D).
    • Predicted torsion angles (1D).
    • A final 3D atomic coordinates file in PDB format.

Step 3: Output & Relaxation

  • Model Extraction: The network typically generates multiple candidate models (e.g., 5-10). Select the top-ranked model based on the model's predicted confidence score (pLDDT per residue, adapted for RNA).
  • Steric Clash Relaxation: Subject the raw PDB output to a brief energy minimization using a force field (e.g., Rosetta fastrelax or OpenMM) to remove minor atomic clashes introduced by the network.

Validation: Compare the final predicted model to the experimentally solved structure (if later released) using RMSD and TM-score metrics.

Visualization of Workflows and Architectures

G cluster_nn Three-Track Architecture InputSeq->DataPrep DataPrep->MSA DataPrep->Templates DataPrep->Features1D MSA->NNInference Templates->NNInference Features1D->NNInference NNInference->OutputPDB OutputPDB->Relax Relax->FinalModel Track1D->Tracks Track2D->Tracks Track3D->Tracks Tracks->Track1D Tracks->Track2D Tracks->Track3D InputSeq Input RNA Sequence DataPrep Data Preparation MSA MSA Generation Templates Template Search Features1D 1D Feature Prediction NNInference Neural Network Inference (e.g., RoseTTAFoldNA) OutputPDB Raw 3D Coordinates (PDB) Relax Steric Clash Relaxation FinalModel Final Predicted Structure Track1D 1D Track (Sequence/Features) Track2D 2D Track (Pairwise Distances) Track3D 3D Track (Atomic Coordinates) Tracks Iterative Information Exchange

Diagram Title: RNA Structure Prediction with a Three-Track Neural Network

The Scientist's Toolkit: Essential Research Reagents & Solutions

Table 2: Key Resources for Deep Learning-Based RNA Structure Prediction

Item Name Type Function / Purpose Source / Example
Rfam Database Bioinformatics Database Curated collection of RNA families and alignments; essential for generating deep MSAs. EBI / rfam.org
RNAcentral Bioinformatics Database Comprehensive database of non-coding RNA sequences; provides sequence data for MSA. rnacentral.org
PDB (Protein Data Bank) Structural Database Repository of experimentally solved 3D structures; source for templates and training data. rcsb.org
Infernal (cmscan/cmsearch) Software Tool Builds high-quality MSAs from a seed sequence by searching against Rfam covariance models. eddylab.org/infernal/
Contrafold / SPOT-RNA Software Tool Predicts RNA secondary structure and base-pairing probabilities from sequence. Used for 1D feature generation.
RoseTTAFoldNA Model Weights Pre-trained Model The core neural network parameters for end-to-end prediction. GitHub (Baker Lab)
PyRosetta or OpenMM Software Library Provides force fields and energy minimization routines for structural relaxation and refinement. RosettaCommons / openmm.org
Jupyter / Colab Notebooks Computing Environment Pre-configured interactive environments for running prediction pipelines without complex setup. Common distribution method for models.
GPUs (NVIDIA A100/V100) Hardware Essential hardware for accelerating the deep neural network inference (forward pass). Standard in high-performance computing.

This technical guide, framed within the context of a broader thesis on the Critical Assessment of Structure Prediction 15 (CASP15) RNA assessment results, explores the development and application of integrated neural network architectures. These architectures synergistically combine sequence information, co-evolutionary signals, and explicit geometric constraints to advance the prediction of RNA three-dimensional structures—a critical capability for understanding gene regulation and enabling rational drug design against RNA targets.

The CASP15 experiment provided a rigorous, blind assessment of protein and, significantly, RNA structure prediction methods. Results demonstrated that while AlphaFold2 and related protein-centric models revolutionized protein structure prediction, the challenge for RNA remained formidable. Top-performing methods for RNA began to incorporate deep learning, but a significant performance gap persisted compared to proteins, highlighting the need for architectures specifically designed for RNA's unique structural and evolutionary characteristics. This guide details the integrated neural network approach that emerged as a principled response to this challenge.

Core Architectural Components

An integrated neural network for RNA structure prediction typically consists of three core modules, each processing a distinct but complementary type of information.

2.1 Sequence Module

  • Input: Multiple Sequence Alignment (MSA) of homologous RNAs.
  • Architecture: A stack of Transformer or 1D Convolutional layers.
  • Function: Extracts latent representations of nucleotide identity, local sequence context, and potential conserved motifs. It learns embeddings for each position in the sequence.

2.2 Co-evolution Module

  • Input: Residue-Residue contact maps or covariance matrices derived from the MSA.
  • Architecture: 2D Convolutional Neural Networks (CNNs) or Graph Neural Networks (GNNs).
  • Function: Identifies correlated mutation patterns that signal evolutionary pressure to maintain base-pairing (e.g., G-C, A-U) and tertiary interactions. This module infers long-range spatial contacts.

2.3 Geometric Constraint Module

  • Input: Pairwise distances, angles (torsion angles like η, θ), or implicit coordinate frames.
  • Architecture: SE(3)-Equivariant GNNs or Distance/Angle Regression Heads.
  • Function: Incorporates the physical laws of molecular geometry. It ensures the predicted structure is stereochemically plausible by enforcing constraints on bond lengths, bond angles, and van der Waals contacts. This module often operates on the graph constructed from co-evolutionary contacts.

Diagram: Integrated Neural Network Architecture

architecture MSA Input: MSA SEQ_MOD Sequence Module (Transformer/1D-CNN) MSA->SEQ_MOD COV Input: Co-variance COEV_MOD Co-evolution Module (2D-CNN/GNN) COV->COEV_MOD LATENT_FUSION Latent Feature Fusion (Concatenation/Cross-Attention) SEQ_MOD->LATENT_FUSION COEV_MOD->LATENT_FUSION GEO_INPUT Geometric Graph (Nodes: Residues Edges: Contacts) LATENT_FUSION->GEO_INPUT GEO_MOD Geometric Constraint Module (SE(3)-Equivariant GNN) GEO_INPUT->GEO_MOD OUTPUT Output: 3D Coordinates (or Distances/ Angles) GEO_MOD->OUTPUT

Detailed Experimental Protocol for Model Training & Validation

The following protocol outlines the standard pipeline for training an integrated neural network model, consistent with methodologies used by leading groups in CASP15.

Step 1: Data Curation (Pre-training & Fine-tuning Sets)

  • Source non-redundant RNA structures from the Protein Data Bank (PDB) and RNAcentral.
  • Split data into training, validation, and test sets at the family level to prevent homology leakage.
  • For each structure, generate a deep Multiple Sequence Alignment (MSA) using tools like Infernal and RFAM.
  • Derive ground truth labels: 3D atomic coordinates, pairwise distance maps, contact maps (≤8Å), and dihedral angles.

Step 2: Feature Engineering

  • Sequence Features: One-hot encode nucleotide identity (A,C,G,U), encode MSA as a position-specific scoring matrix (PSSM).
  • Co-evolution Features: Compute a covariance matrix from the MSA. Apply an average-product correction (APC) to reduce noise. Use this to derive initial contact probabilities.
  • Geometric Features: Compute pairwise Euclidean distances between C3' or P atoms. Calculate seven standard backbone torsion angles (α, β, γ, δ, ε, ζ, χ).

Step 3: Model Training Workflow

  • Employ a multi-stage training regimen.
  • Stage 1: Train the Sequence and Co-evolution modules jointly to predict contact maps, using binary cross-entropy loss.
  • Stage 2: Freeze the trained modules from Stage 1. Use their output features (latent embeddings and contact probabilities) to build a coarse-grained graph where nodes are residues and edges are likely contacts.
  • Stage 3: Train the Geometric Constraint Module (GNN) on this graph to predict either:
    • A) Distances and angles, followed by 3D reconstruction via differentiable minimization (loss: mean squared error).
    • B) Direct atomic coordinates using an SE(3)-equivariant architecture (loss: FAPE - Frame Aligned Point Error).
  • Stage 4 (Optional): Perform end-to-end fine-tuning of all modules with a reduced learning rate.

Step 4: CASP-style Evaluation

  • Input: Blind RNA sequence provided by CASP assessors.
  • Process: Generate MSA, run through the integrated model to produce an ensemble of 3D decoys.
  • Output: Rank decoys using predicted confidence scores (e.g., pLDDT per residue).
  • Validation Metrics: Calculate RMSD (Root Mean Square Deviation), lDDT (local Distance Difference Test), and CAD (Contact Area Difference) against the experimentally solved structure upon release.

Diagram: End-to-End Training & Prediction Workflow

workflow PDB PDB/RNAcentral (3D Structures) MSA_GEN MSA Generation (Infernal) PDB->MSA_GEN FEAT Feature Engineering MSA_GEN->FEAT TRAIN Multi-Stage Model Training FEAT->TRAIN VAL Validation (lDDT/RMSD) TRAIN->VAL BLIND_SEQ Blind Target Sequence PRED_MOD Integrated Prediction Pipeline BLIND_SEQ->PRED_MOD DECOYS 3D Decoy Ensemble PRED_MOD->DECOYS ASSESS CASP Assessment (CAD, lDDT) DECOYS->ASSESS

Quantitative Results from CASP15 & Benchmark Studies

The performance of integrated approaches was quantitatively assessed in CASP15. The table below summarizes key metrics comparing different methodological philosophies. (Note: Specific model names are illustrative based on published post-CASP analyses).

Table 1: Performance Comparison of RNA Structure Prediction Approaches (CASP15 Summary)

Method Category Key Features Average lDDT Average RMSD (Å) Success Rate* (%)
Pure Physics-Based Molecular Dynamics, Fragment Assembly 0.45 ~18.5 10
Traditional ML Hand-crafted features, Random Forests 0.52 ~12.7 25
Sequential DL Only RNNs/Transformers on sequence only 0.58 ~9.3 35
Integrated Neural Network Combines MSA, co-evolution, geometric GNNs 0.69 ~5.8 65
Experimental Structure (Reference) 1.00 0.0 100

*Success Rate: Percentage of targets where the top-ranked model had an RMSD < 10Å.

Table 2: Ablation Study on Model Components (Internal Benchmark)

Model Configuration Contact Precision (Top L/5) Mean FAPE (Å) GDT-TS
Full Integrated Model 0.81 3.2 0.72
Without Co-evolution Module 0.62 5.8 0.58
Without Geometric Constraint Module 0.78 7.1 (Steric Clashes) 0.61
Without Sequence MSA Input 0.45 8.5 0.49

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Computational Tools & Resources for Integrated RNA Modeling

Item / Resource Category Function & Purpose
Infernal (cmsearch) MSA Generation Searches nucleotide sequence databases (e.g., RFAM) using Covariance Models to build deep, homologous MSAs. Critical for co-evolution input.
RFAM Database Sequence Database Curated collection of RNA sequence families and alignments. The primary source for homology search.
PyTorch Geometric (PyG) Deep Learning Library Extends PyTorch for graph neural networks. Essential for implementing the geometric constraint module on residue graphs.
AlphaFold2 Codebase DL Architecture Provides reference implementations of Transformer-Evoformer modules and structural loss functions (FAPE) adaptable for RNA.
Rosetta FARFAR2 Physics-Based Refinement Used for all-atom refinement and rescoring of neural network decoys. Improves stereochemical quality.
3dRNA Template-Based Modeling Source of known RNA structural fragments for hybrid or initial model construction.
ViennaRNA Secondary Structure Predicts base-pairing from sequence. Output can be integrated as a prior in the neural network.
MD Simulation Suite (e.g., AMBER, OpenMM) Validation Used for molecular dynamics simulations to assess the stability and dynamics of predicted models.

The Role of Language Models and Multiple Sequence Alignments (MSAs) in RNA Folding

This technical guide examines the role of Large Language Models (LLMs) and Multiple Sequence Alignments (MSAs) in predicting RNA secondary and tertiary structures, framed within the broader research context of the Critical Assessment of Structure Prediction (CASP) 15 RNA assessment results. CASP15, concluded in 2022, represented a landmark evaluation of computational methods for RNA 3D structure prediction, highlighting the emergent power of deep learning approaches that leverage evolutionary information and language model architectures. The convergence of these techniques is revolutionizing the field, offering new avenues for researchers and drug development professionals targeting RNA in therapeutic contexts.

Foundational Concepts

RNA Folding Problem

RNA molecules fold into complex 3D structures dictated by their nucleotide sequence. The folding hierarchy progresses from secondary structure (base pairs) to tertiary structure (3D arrangement). Computational prediction aims to solve this inverse folding problem.

Multiple Sequence Alignments (MSAs)

MSAs are collections of evolutionarily related RNA sequences aligned to highlight conserved positions and covarying mutations. Co-evolutionary signals within MSAs are critical for inferring structural contacts, as mutations in base-paired positions often co-vary to maintain structural stability.

Language Models (LMs) for Biological Sequences

Inspired by natural language processing, protein and RNA language models are trained on vast datasets of biological sequences (e.g., RNAcentral) to learn statistical patterns and evolutionary constraints. They generate contextualized embeddings for each residue in a sequence, capturing latent structural and functional information without explicit MSAs.

Integration of MSAs and Language Models in CASP15

CASP15 demonstrated that top-performing methods for RNA structure prediction integrated deep learning with evolutionary information. Key insights include:

  • MSA-Dependent Methods: Methods like AlphaFold2 (adapted for RNA) and RoseTTAFoldNA rely heavily on deep MSAs to generate accurate distance maps and 3D models. Their performance correlates strongly with the depth and diversity of the input MSA.
  • MSA-Light or MSA-Free Methods: Newer approaches began leveraging protein and RNA language models (e.g., ESM, Evolutionary Scale Modeling) to generate "virtual MSAs" or residue embeddings, mitigating the dependency on traditional MSAs, which can be shallow for many RNA families.
Method Name Core Approach Use of MSA Use of Language Model Performance (CASP15 GDT_TS*)
AlphaFold2 (AF2) End-to-end deep learning (adapted) Heavy: Input is MSA + templates Implicit via attention over MSA High (for targets with deep MSAs)
RoseTTAFoldNA 3-track neural network Heavy: MSA fed into sequence track No High
DRfold Deep learning for distance/angle predictions Moderate: Uses covariance features No Moderate
Embodied Models Geometry-focused sampling Light or None Yes (ESM embeddings) Variable, promising on MSA-poor targets
Traditional (MC/FARFAR2) Fragment assembly/Monte Carlo Light: For constraints No Lower

*GDTTS: Global Distance TestTotal Score, a metric for 3D model accuracy (0-100 scale).

Experimental Protocols

Protocol: Generating an MSA for RNA Structure Prediction
  • Input: A single query RNA nucleotide sequence.
  • Database Search: Use Infernal (cmscan) with the Rfam covariance model database or BLASTN against an RNA-specific sequence database (e.g., RNAcentral).
  • Iterative Search: Employ tools like Jackhmmer to perform iterative profile HMM searches against large protein/nucleotide databases to gather homologous sequences.
  • Filtering and Alignment: Cluster sequences at a high identity threshold (e.g., 90%) to remove redundancy. Align using MAFFT or Clustal Omega.
  • Output: A deep, diverse MSA in Stockholm or FASTA format, ready for input to predictors like AlphaFold2 or for co-variance analysis (CCMpred, plmc).
Protocol: Using a Language Model for Contact Prediction
  • Input: A single query RNA nucleotide sequence.
  • Embedding Generation: Pass the sequence through a pre-trained RNA language model (e.g., RNA-FM from Meta, mxfold2 LM). Extract the last hidden layer embeddings (a matrix of size L x D, where L=sequence length, D=embedding dimension).
  • Contact Map Inference:
    • Direct Prediction: Train a shallow neural network (convolutional or transformer) that takes pairwise concatenated embeddings and predicts a contact probability.
    • Attention Analysis: For transformer-based LMs, analyze self-attention maps from intermediate layers; high attention weights between residues can indicate potential spatial proximity.
  • Folding: Use the predicted contact map as a restraint in a 3D folding simulator (e.g., Rosetta, SimRNA).
  • Validation: Compare predicted contacts and structures against CASP15 or experimental benchmarks.

Signaling and Workflow Visualization

Diagram 1: MSA-Dependent RNA Folding Workflow

G QuerySeq Query RNA Sequence DB Sequence Databases (RNAcentral, GenBank) QuerySeq->DB Homology Search DL Deep Learning Network (e.g., AlphaFold2 Arch.) QuerySeq->DL MSA Deep MSA (Aligned Homologs) DB->MSA Alignment & Filtering MSA->DL Contacts Predicted Distance/Contact Map DL->Contacts Model 3D Atomic Model Contacts->Model Folding (Geometry Refinement) Eval CASP-like Assessment Model->Eval

Diagram 2: Language Model-Based Folding Pathway

G Seq Single RNA Sequence PT_LM Pre-trained Language Model (e.g., RNA-FM, ESM) Seq->PT_LM Emb Residue Embeddings (L x D Matrix) PT_LM->Emb NN Contact Decoder Neural Network Emb->NN Pairwise Concatenation PredContact Predicted Contact Map NN->PredContact Fold Fragment Assembly / MD Simulation PredContact->Fold Structure RNA 3D Structure Fold->Structure

The Scientist's Toolkit: Key Research Reagent Solutions

Table 2: Essential Computational Tools and Databases
Item Name Category Function/Brief Explanation
RNAcentral Database A comprehensive database of non-coding RNA sequences, providing primary data for MSA construction and LM training.
Rfam Database Curated collection of RNA families, represented by covariance models and alignments, essential for homology search.
Infernal Software Toolkit for searching sequence databases using covariance models, the gold standard for finding remote RNA homologs.
MAFFT Software Multiple sequence alignment program known for accuracy and scalability with large numbers of sequences.
AlphaFold2 (ColabFold) Software Adapted deep learning system for RNA; ColabFold provides a streamlined, accessible implementation.
RoseTTAFoldNA Software Three-track neural network specifically designed for nucleic acids (RNA & DNA), leveraging MSA information.
RNA-FM Language Model Foundation model pre-trained on 23 million RNA sequences, generates informative residue-level embeddings.
ESM-2 (Meta) Language Model Protein language model sometimes applied to RNA by tokenizing nucleotides, useful for transfer learning.
Rosetta Software Suite Molecular modeling suite containing tools like rna_denovo and FARFAR2 for ab initio RNA folding with constraints.
SimRNA Software Coarse-grained molecular dynamics simulator for RNA folding, can incorporate various restraint types.
CASP Assessment Metrics (GDT_TS, lDDT) Analysis Tool Standardized metrics for evaluating the global and local accuracy of predicted 3D models against experimental references.

The CASP15 assessment solidified the dominance of deep learning methods in RNA structure prediction. The synergistic role of MSAs (providing explicit evolutionary constraints) and Language Models (providing learned, implicit constraints from sequence statistics) is central to this progress. For researchers, the current best practice involves a hybrid approach: leveraging deep MSAs when available and supplementing or replacing them with LM embeddings for MSA-poor targets. Future directions include the development of truly end-to-end RNA-specific foundation models, better integration of biophysical rules, and methods to predict structures for RNA-protein complexes, a crucial frontier for understanding gene regulation and developing novel therapeutics.

Within the broader thesis analyzing the Critical Assessment of Structure Prediction 15 (CASP15) RNA structure prediction results, a dominant trend emerged: the top-performing methods universally employed hybrid approaches. This guide details the technical framework of these winning strategies, which synergistically blend deep learning (DL) for rapid, accurate base-pairing prediction with physics-based refinement (PBR) to achieve atomistically precise, energetically favorable 3D models. The assessment underscored that pure deep learning architectures, while powerful for initial contact map prediction, often falter in generating stereochemically correct all-atom models, a gap effectively bridged by subsequent physics-based minimization.

Core Methodological Framework

The hybrid pipeline follows a sequential, iterative architecture.

Phase 1: Deep Learning-Based Tertiary Contact Prediction

Objective: Predict nucleotide-nucleotide interaction probabilities (base pairs and stacking) from sequence and/or evolutionary information.

Protocol:

  • Input Preparation: Generate a Multiple Sequence Alignment (MSA) for the target RNA sequence using tools like Infernal or Rfam. For shorter sequences, direct inference from single sequence is also employed.
  • Feature Encoding: Convert the sequence and MSA into a 2D tensor. Common features include:
    • One-hot encoding of nucleotides.
    • Position-Specific Scoring Matrix (PSSM) from the MSA.
    • Predicted secondary structure probabilities (e.g., from SPOT-RNA or ContextFold).
    • Co-evolutionary signals via Direct Coupling Analysis or pseudolikelihood maximization.
  • Model Inference: Process features through a deep neural network. State-of-the-art models from CASP15 include:
    • DeepFoldRNA: Uses a geometric transformer architecture to directly infer spatial relationships.
    • AlphaFold2 (adapted): Utilizes an Evoformer stack and structure module, often retrained on RNA-specific datasets (e.g., PDB, RNA-Puzzles).
  • Output: A set of predicted distograms (distribution over distances for each residue pair) and/or angle distributions (torsion angles), which are converted into a 3D restraint potential.

Phase 2: Physics-Based Structure Refinement

Objective: Convert the probabilistic restraints from Phase 1 into a physically plausible all-atom model.

Protocol:

  • Restraint Potential Formulation: The DL outputs are converted into an energy term, E_DL. For a distogram, this is often a harmonic or square-well potential favoring distances within high-probability bins.
    • Etotal = wDL * EDL + wphysics * E_physics
  • Physics-Based Energy Function (E_physics): A molecular mechanics force field is used. Key components:
    • Bonded Terms: Bonds, angles, dihedrals (including nucleic acid-specific torsions like α, β, γ, δ, ε, ζ, χ).
    • Non-Bonded Terms: Electrostatics (partial charges, dielectric constant), Van der Waals (Lennard-Jones potential), and explicit hydrogen bonding terms.
    • Solvation Model: Implicit solvent models (GB/SA) are standard; some methods use explicit water in final stages.
  • Sampling & Minimization: Two primary strategies are used:
    • Molecular Dynamics (MD) with Restraints: Run restrained MD simulation (e.g., in AMBER or OpenMM) to sample conformational space under the combined Etotal.
    • Monte Carlo (MC) Minimization: Perform random moves (e.g., fragment replacement, local torsion adjustments) followed by gradient-based minimization, accepting steps based on the Metropolis criterion using Etotal.

Phase 3: Model Selection & Validation

Objective: Select the best model(s) from the refined ensemble.

Protocol:

  • Clustering: Cluster final decoys by RMSD (Root Mean Square Deviation).
  • Scoring: Rank clusters by a composite score: low E_total, high agreement with input DL probabilities (e.g., TM-score derived from distograms), and good stereochemistry (e.g., via MolProbity clash score).
  • Validation: Assess models against known experimental metrics (if available in a blind test) like local Distance Difference Test (lDDT) for RNA and clash score.

Table 1: Top CASP15 RNA Prediction Methods & Key Metrics

Method Name Core DL Engine Refinement Engine Average lDDT (All Targets) Average RMSD (Best Model) Success Rate (GDT-TS ≥ 0.5)
Method A (Leading) Geometric Transformer AMBER + MD 0.72 3.2 Å 85%
Method B Adapted Evoformer OpenMM + MC 0.69 3.8 Å 78%
Method C Residual CNN Rosetta FARFAR2 0.65 4.5 Å 70%
Baseline (DL Only) -- -- 0.58 7.1 Å 40%
Baseline (Physics Only) -- -- 0.51 9.5 Å 25%

Data synthesized from CASP15 assessment publications and presenter slides. lDDT measures local model accuracy; RMSD measures global fit to native structure; GDT-TS is a global distance test score.

Table 2: Energy Function Weights in Leading Hybrid Method

Energy Term Weight (w) Function Optimization Method
DL Restraint (E_DL) 1.0 Enforces predicted distances/angles Grid search on validation set
Bonded (E_bonded) 0.5 Maintains chain geometry Fixed (force field default)
Electrostatics (E_elec) 0.3 Models charge interactions Adjusted by dielectric constant
Van der Waals (E_vdw) 1.0 Prevents atomic clashes Fixed (force field default)
Solvation (E_solv) 0.2 Implicit solvent effect Generalized Born model

Experimental Protocol: A Representative Hybrid Workflow

Protocol Title: Integrated DL-MD for RNA Tertiary Structure Prediction.

Step 1: Input & DL Inference.

  • Software: DeepFoldRNA (local installation or API).
  • Command: python predict.py --fasta target.fasta --msa target.a3m --output restraints.json
  • Output Processing: Convert restraints.json to a GROMACS or AMBER format restraint table (target.itp).

Step 2: Initial Coarse-Grained Modeling.

  • Software: RNAfold (ViennaRNA) for secondary structure, followed by MODELLER or SimRNA for 3D seeding.
  • Command: simRNA --seq target.seq --restraints target.itp --out simRNA_trajectory

Step 3: All-Atom Refinement with Restrained MD.

  • Software: AMBER22 with pmemd.cuda.
  • Setup:
    • Load SimRNA model, solvate in TIP3P water box, add ions.
    • Apply positional restraints on P atoms (force constant 1.0 kcal/mol/Ų) and DL-based distance restraints (force constant 5.0 kcal/mol/Ų).
  • Minimization: 5000 steps steepest descent.
  • Heating: 0 to 300 K over 50 ps, NVT ensemble.
  • Equilibration: 200 ps, NPT ensemble.
  • Production: 10-50 ns of restrained MD. Save trajectories every 10 ps.

Step 4: Analysis & Selection.

  • Software: cpptraj (AMBER), MDTraj.
  • Clustering: cluster hieragglo epsilon 2.0 on backbone heavy atoms.
  • Scoring: Calculate average E_total for each cluster centroid. Select top 5 centroids.
  • Validation: MolProbity for clash score, QRNA for local accuracy score.

Visualizations

G Start Target RNA Sequence MSA Generate MSA Start->MSA DL_Model Deep Learning Model (e.g., Transformer) MSA->DL_Model Restraints Probabilistic Restraints (Distogram/Angle) DL_Model->Restraints Coarse Coarse-Grained Modeling & Initial 3D Decoy Restraints->Coarse Refine Physics-Based Refinement (Restrained MD/MC) Coarse->Refine Ensemble Ensemble of Refined Decoys Refine->Ensemble Cluster Clustering & Scoring Ensemble->Cluster Final Final Atomic Model Cluster->Final

Hybrid RNA Prediction Workflow

G TotalEnergy E_total = w_DL E_DL + w_bond E_bond + w_angle E_angle + w_torsion E_torsion + w_elec E_elec + w_vdw E_vdw + w_solv E_solv DL_Source Deep Learning Restraints DL_Source->TotalEnergy:f1 Physics_Source Molecular Force Field Physics_Source->TotalEnergy:f2 Physics_Source->TotalEnergy:f3 Physics_Source->TotalEnergy:f4 Physics_Source->TotalEnergy:f5 Physics_Source->TotalEnergy:f6 Physics_Source->TotalEnergy:f7

Hybrid Energy Function Composition

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Tools & Resources for Hybrid RNA Structure Prediction

Item Name Type (Software/Data/Service) Function & Role in Pipeline
Rfam Database Curated Data Source for RNA families and seed alignments to build MSAs.
Infernal (cmsearch) Software Tool for searching nucleotide sequence databases using covariance models.
AlphaFold2 (ColabFold) Software/Service Adapted DL model for protein structure, often fine-tuned for RNA; provides rapid prototyping.
DeepFoldRNA Software End-to-end geometric DL model specifically designed for RNA 3D structure.
AMBERff (OL3, χOL3) Force Field Physics-based energy parameters for nucleic acids; defines E_physics.
OpenMM Software Library High-performance toolkit for MD simulation; enables GPU-accelerated refinement.
Rosetta FARFAR2 Software Fragment Assembly of RNA for de novo modeling and refinement.
SimRNA Software Coarse-grained modeling tool useful for generating initial decoys under restraints.
ViennaRNA Package Software Provides core algorithms for RNA secondary structure prediction and analysis.
PDB (Protein Data Bank) Curated Data Primary repository of experimental RNA structures for training DL models and validation.
MolProbity Web Service/Software Validates stereochemical quality of final models (clash score, rotamer checks).

The Critical Assessment of protein Structure Prediction (CASP) expanded to include RNA targets in its 15th round (CASP15), providing a landmark benchmark for computational methods. The assessment revealed that while de novo RNA structure prediction remains challenging, template-based and deep learning methods, such as AlphaFold2 adapted for RNA and newer approaches like RoseTTAFoldNA, showed significant promise for predicting complex tertiary folds. This progress directly enables a structure-based revolution in drug discovery and RNA therapeutics. Accurate models of disease-relevant RNA targets—from viral genomic elements and riboswitches to splicing regulators and non-coding RNAs—now provide the blueprints for rational design of small molecules, antisense oligonucleotides (ASOs), and small interfering RNAs (siRNAs).

Quantitative Assessment of CASP15 RNA Results

The performance in CASP15 was quantitatively evaluated using metrics like GDT_TS (Global Distance Test Total Score) for overall topology and lDDT (local Distance Difference Test) for local accuracy. The following table summarizes key results for leading groups.

Table 1: Summary of Top-Performing Methods in CASP15 RNA Structure Prediction

Method Name / Group Type Average GDT_TS (Full Chain) Average lDDT (Local) Key Strengths Notable Limitations
RoseTTAFoldNA (Baek et al.) Deep Learning (End-to-end) 0.65 0.75 Integrated sequence & structure inference; good for complexes. Performance drops on single-chain RNAs without homologs.
AlphaFold2 (Adapted) Deep Learning 0.61 0.72 Excellent local geometry and backbone accuracy. Struggles with long-range tertiary contacts in novel folds.
MAINMAST (Kihara Lab) Fragment Assembly / Physics 0.58 0.68 De novo; does not require multiple sequence alignment (MSA). Lower overall accuracy compared to deep learning methods.
3dRNA Template-Based & Knowledge 0.60 0.70 Reliable for RNAs with known structural homologs. Fails on truly novel folds without templates.

From Predicted Structure to Therapeutic Design: Core Applications

Small Molecule Targeting of Structured RNA

Dysregulated RNA structures are implicated in cancers, neurological disorders, and infectious diseases. Predicted models allow for in silico screening against small molecule libraries.

Experimental Protocol: Structure-Based Virtual Screening for RNA-Targeted Small Molecules

  • Target Preparation: Use a CASP15-ranked high-confidence model (e.g., from RoseTTAFoldNA) of the target RNA (e.g., SARS-CoV-2 frameshift stimulating element (FSE), miRNA precursor). Refine the model with MD simulation in explicit solvent.
  • Pocket Identification: Run computational tools like RNASurface, Fpocket, or DoGSiteScorer on the refined structure to identify potential ligand-binding pockets (grooves, junctions, bulges).
  • Library Preparation: Curate a library of drug-like small molecules (e.g., ZINC database subset) and prepare their 3D conformers and protonation states using OpenBabel or LigPrep.
  • Docking Simulation: Perform molecular docking using RNA-capable programs like rDock, AutoDockFR, or UCSF DOCK6. Define the docking grid around the identified pocket.
  • Post-Docking Analysis: Rank hits by docking score and binding pose. Visually inspect top poses for key interactions: intercalation, groove binding, specific H-bonds to bases. Apply MM-GBSA/MM-PBSA for refined binding energy estimation.
  • Experimental Validation: Proceed with in vitro validation using techniques from Table 2.

Design of Oligonucleotide Therapeutics (ASOs, siRNAs)

Predicting the secondary and tertiary structure of mRNA regions is crucial for designing effective, specific, and potent ASOs and siRNAs.

Experimental Protocol: siRNA Design Enhanced by RNA Structure Prediction

  • Target mRNA Acquisition: Obtain the full-length target mRNA sequence from databases (NCBI RefSeq, Ensembl).
  • Accessibility Prediction: Use tools like RNAfold (ViennaRNA) to predict the minimum free energy (MFE) secondary structure of the entire transcript. Alternatively, employ CONTRAfold or MXFold2 for probabilistic estimates.
  • Accessibility Scoring: For each possible 19-21mer siRNA target site, calculate the local accessibility (e.g., using RNAsubopt to ensemble sample). Sites within single-stranded, accessible regions are prioritized.
  • Specificity & Off-Target Check: Perform BLAST search against the transcriptome to ensure minimal off-target potential. Use tools like Smith-Waterman alignment for seed region (nucleotides 2-8 of siRNA guide strand) analysis.
  • Final Selection & Synthesis: Select 3-5 top candidate siRNAs based on accessibility, specificity, and standard rules (e.g., moderate GC content, avoiding internal repeats). Synthesize candidates with appropriate chemical modifications (e.g., 2'-O-methyl, phosphorothioate).

Key Research Reagent Solutions

Table 2: Essential Research Toolkit for RNA-Targeted Drug Discovery

Reagent / Material Function & Application Example Product/Supplier
In Vitro Transcribed RNA Generate pure, homogeneous target RNA for biophysical (SPR, ITC) and biochemical assays. HiScribe T7 Quick High Yield Kit (NEB)
Fluorogenic RNA Aptamers Report on RNA folding or ligand binding in live cells via fluorescence turn-on (e.g., Spinach, Broccoli). Broccoli RNA Aptamer (Sigma-Aldrich)
Chemically Stabilized Oligonucleotides Perform knockdown/functional studies with nuclease-resistant ASOs or siRNAs. Silencer Select siRNAs (Thermo Fisher)
Selective Small Molecule Binders Positive controls for RNA-target screening; e.g., Ribocil (FMN riboswitch), Risdiplam (SMN2 splicing). Tocris Bioscience
Surface Plasmon Resonance (SPR) Chip Immobilize biotinylated RNA to measure real-time binding kinetics of small molecules or oligonucleotides. Series S Sensor Chip SA (Cytiva)
SHAPE Reagents (e.g., NMIA, 1M7) Experimental validation of predicted RNA secondary structure by probing nucleotide flexibility. SHAPE-MaP Reagent (Lexogen)
Cryo-EM Grids Validate computationally predicted tertiary structures of RNA or RNA-drug complexes at near-atomic resolution. Quantifoil R1.2/1.3 300 mesh Au grids

Visualizing Workflows and Pathways

screening_workflow Start Target RNA Sequence P1 Tertiary Structure Prediction (e.g., RoseTTAFoldNA) Start->P1 P2 Pocket Detection (Grooves, Junctions) P1->P2 P3 Virtual Screening (Docking Library) P2->P3 P4 Hit Ranking & Interaction Analysis P3->P4 P5 In Vitro Validation (Binding & Function) P4->P5 P5->P1 Iterative Refinement End Lead Compound P5->End

Title: Computational Screening for RNA-Targeted Drugs

aso_design_path Seq Disease-linked mRNA Sequence S1 Predict Secondary & Tertiary Structure (CASP-informed Model) Seq->S1 S2 Map Accessible Single-Stranded Regions S1->S2 S3 Design ASOs to Hybridize Accessible Sites S2->S3 S4 Apply Chemical Modifications (PS-backbone, 2'-MOE) S3->S4 Val Validate: RT-qPCR, Splicing Assay, NMD S4->Val Val->S3 Optimize Design Lead Therapeutic ASO Candidate Val->Lead

Title: Structure-Informed ASO Design and Optimization

casp_rna_impact CASP15 CASP15 RNA Assessment DL Improved Deep Learning Models (RFNA, AF2) CASP15->DL Models Accurate 3D Models of Disease RNAs DL->Models App1 Small Molecule Discovery Models->App1 App2 Oligonucleotide Therapeutic Design Models->App2 Impact RNA-Targeted Therapeutics App1->Impact App2->Impact

Title: CASP15's Impact on RNA Therapeutic Pipeline

Beyond the Benchmark: Addressing Persistent Challenges in RNA Prediction Accuracy

This technical guide analyzes persistent failure modes in tertiary RNA structure prediction, as revealed by the Critical Assessment of Structure Prediction 15 (CASP15) experiment. While protein structure prediction has been revolutionized by deep learning, RNA prediction lags significantly. Within the broader thesis on CASP15 RNA assessment, this paper deconstructs three core technical challenges that explain the performance gap: modeling long-range nucleotide interactions, assembling multi-chain ribonucleoprotein (RNP) complexes, and predicting the conformation of flexible loop regions. Accurate resolution of these issues is critical for researchers and drug development professionals targeting RNA for therapeutics and diagnostics.

Core Challenges: Analysis from CASP15 Data

CASP15 results quantitatively highlighted the disparity between top-performing methods and experimental structures. The following table summarizes key performance metrics for RNA targets, focusing on the three failure modes.

Table 1: CASP15 RNA Prediction Performance Summary by Challenge Category

Target Category Avg. GDT-TS (Top Group) Avg. RMSD (Å) (Top Group) Key Observed Failure Mode
Single-Chain, Long-Range 42.7 14.2 Mis-folding of distal base pairs, incorrect topology
Multi-Chain RNP Complexes 28.5 21.8 Incorrect protein-RNA interface, chain placement errors
Targets with Flexible Loops 35.1 18.5 High B-factor loop regions deviate >25Å from native state
Overall RNA Targets 38.9 16.9 Composite of above

Data derived from CASP15 assessment publications and official analysis. GDT-TS: Global Distance Test - Total Score; RMSD: Root Mean Square Deviation.

Detailed Failure Mode Deconstruction

Long-Range Interactions

Long-range interactions (>15 nucleotides apart in sequence) are crucial for establishing RNA tertiary folds. Failure arises from:

  • Energy Function Limitations: Current scoring functions favor local stability over globally correct, but energetically subtle, long-range contacts.
  • Sampling Deficiency: Generative models fail to efficiently explore the conformational space needed to bring distal segments into proximity.
  • Co-transcriptional Folding Ignored: Most in silico methods fold the full-length sequence, ignoring the kinetic, step-wise folding in vivo.

Experimental Protocol: Cross-linking Coupled with Mass Spectrometry (CL-MS) for Mapping Long-Range Contacts

  • Sample Preparation: Refold purified RNA in vitro under native conditions.
  • Cross-linking: Treat RNA with a reversible, RNA-adenosine-specific crosslinker (e.g., 2-iminothiolane).
  • Enzymatic Digestion: Digest RNA with RNase T1 (cleaves at G) to generate cross-linked oligonucleotide fragments.
  • LC-MS/MS Analysis: Analyze digests via liquid chromatography-tandem mass spectrometry.
  • Data Analysis: Identify cross-linked peptide pairs via specialized software (e.g., xQuest). Map intra-RNA cross-links to sequence distance to identify long-range interactions for validation of computational models.

Multi-Chain Complexes (RNPs)

Predicting the quaternary structure of RNA-protein complexes is a multi-body problem. Failures are characterized by:

  • Interface Inaccuracy: Mis-prediction of hydrogen bonding and stacking patterns at protein-RNA interfaces.
  • Induced Fit Neglect: Models treat both components as rigid bodies, ignoring mutual conformational adaptation.
  • Electrostatics Mismanagement: Inadequate handling of the strong electrostatic component of protein-RNA binding.

Experimental Protocol: Site-Directed Hydroxyl Radical Footprinting (HRF) for RNP Interface Mapping

  • Complex Formation: Incubate purified, refolded RNA with its protein binding partner(s) at physiological buffer conditions.
  • Radical Generation: Use a Fe-EDTA conjugate tethered to a specific cysteine residue engineered on the protein surface.
  • Fenton Reaction: Initiate by adding sodium ascorbate and hydrogen peroxide, generating short-lived hydroxyl radicals that cleave the RNA backbone at proximal solvent-accessible sites.
  • Cleavage Product Analysis: Quench reaction, recover RNA, and analyze cleavage pattern via primer extension and capillary electrophoresis or next-generation sequencing.
  • Footprint Identification: Compare cleavage patterns of bound vs. unbound RNA. Protected nucleotides define the protein interaction interface, providing a ground truth for computational docking.

Flexible Loops

Loops, bulges, and linkers often display high conformational entropy. Prediction failures include:

  • Ensemble Representation: Methods predict a single conformation rather than a dynamic ensemble.
  • Force Field Inaccuracy: Classical molecular dynamics (MD) force fields have known biases for RNA backbone dihedral angles (α/γ).
  • Lack of Restraints: These regions often lack evolutionary covariation signals, depriving prediction algorithms of constraints.

Experimental Protocol: NMR Relaxation Dispersion for Characterizing Loop Dynamics

  • Isotope Labeling: Produce (^{13}\text{C}), (^{15}\text{N})-labeled RNA via in vitro transcription with labeled NTPs.
  • NMR Data Collection: Acquire (^{13}\text{C}) Carr-Purcell-Meiboom-Gill (CPMG) relaxation dispersion datasets at multiple magnetic field strengths (e.g., 600 MHz, 800 MHz spectrometers).
  • Model Fitting: Fit dispersion profiles to two-state or multi-state exchange models to extract chemical shift differences (Δω) and exchange rates (k_ex) between conformational states.
  • Ensemble Generation: Use extracted kinetic and thermodynamic parameters to weight an ensemble of conformations from MD simulations that satisfy the NMR data.

Visualization of Concepts and Workflows

G cluster_0 RNA Prediction Failure Modes Input RNA Sequence & Potential Partners FM1 1. Long-Range Interaction Failure Input->FM1 FM2 2. Multi-Chain Complex Failure Input->FM2 FM3 3. Flexible Loop Failure Input->FM3 Output Inaccurate 3D Structural Model FM1->Output FM2->Output FM3->Output

Diagram 1: Three Core RNA Prediction Failure Modes

G Start Purified RNA & Protein Step1 1. Tether Fe-EDTA to Engineered Cys Start->Step1 Step2 2. Form RNP Complex Step1->Step2 Step3 3. Initiate Fenton Reaction (Ascorbate + H₂O₂) Step2->Step3 Step4 4. Hydroxyl Radicals Cleave Accessible RNA Step3->Step4 Step5 5. Analyze Cleavage Pattern (Primer Extension) Step4->Step5 Result Interface Map: Protected Nucleotides Step5->Result

Diagram 2: Hydroxyl Radical Footprinting (HRF) Workflow

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Reagents for RNA Structure/Interaction Analysis

Reagent / Material Function / Application
2-Iminothiolane (Traut's Reagent) Reversible RNA-adenosine crosslinker for mapping long-range interactions via CL-MS.
Fe(II)-EDTA Complex Tetradentate chelator for generating hydroxyl radicals in footprinting experiments.
(^{13}\text{C}), (^{15}\text{N})-labeled NTPs Isotopically-enriched nucleotides for producing NMR-active RNA for dynamics studies.
RNase T1 Endoribonuclease that cleaves specifically at guanosine residues for generating defined RNA fragments.
Sodium Ascorbate Reducing agent required to initiate the Fenton reaction in hydroxyl radical footprinting.
T4 RNA Ligase Enzyme used in circularization assays to study RNA flexibility and dynamics.
SP6/T7 RNA Polymerase High-yield phage polymerases for in vitro transcription of target RNA constructs.
Size Exclusion Chromatography (SEC) Resin For purifying RNA and RNP complexes based on hydrodynamic radius under native conditions.

The Critical Assessment of Techniques for Protein Structure Prediction (CASP) has extended to RNA, with CASP15 revealing significant progress yet persistent challenges in de novo RNA structure prediction. A core thesis emerging from the CASP15 RNA assessment is that the performance of leading AI/ML models, such as AlphaFold2 variants and specialized tools like RoseTTAFoldNA, is fundamentally constrained by the sparse and biased landscape of experimentally determined RNA structures available for training. This whitepaper provides a technical analysis of this data bottleneck.

Quantitative Landscape of the RNA Structure Data Bottleneck

Table 1: Comparison of Experimental Structure Databases (as of latest search)

Database Total RNA-Containing Entries (Proteins Excluded) Unique RNA Chains (Non-Redundant) Median Resolution (Å) Dominant RNA Types
PDB (Overall) ~6,500 ~4,200 2.6 Ribosomal, tRNA, aptamers
Non-Redundant Set (e.g., PDB-Dev) ~1,500 ~1,500 2.9 Viral RNAs, riboswitches, ribozymes
vs. Protein Entries ~6,500 N/A N/A N/A
vs. Protein Entries ~200,000 N/A N/A N/A

Table 2: CASP15 RNA Target Analysis vs. Training Data Coverage

CASP15 RNA Target Category Number of Targets Avg. Length (nt) Closest Homolog in PDB (Sequence Identity) Structural Novelty for Models
Free Modeling (FM) 5 156 <30% High - True de novo test
Template-Based (TBM) 8 102 30-70% Medium - Folds known, details novel
Overall 13 123 N/A N/A

Key Insight: The entire corpus of unique experimental RNA structures is orders of magnitude smaller than that for proteins, creating a severe data scarcity for data-hungry deep learning models.

Methodological Limitations & Experimental Protocols

The sparsity is not merely numerical but stems from technical hurdles in RNA structure determination.

Key Experimental Protocol: X-ray Crystallography of RNA

Objective: Determine atomic-resolution 3D structure.

  • Sample Preparation: In vitro transcription of target RNA, followed by purification via denaturing PAGE or size-exclusion chromatography.
  • Crystallization: Screening of commercial sparse-matrix screens (e.g., Hampton Research) using vapor diffusion. Often requires engineering (e.g., mutagenesis, protein fusion, chaperone binding) to facilitate crystal packing.
  • Data Collection: Flash-cooling (cryo-cooling) of crystals. Diffraction data collected at synchrotron facilities.
  • Phasing: Solved via Molecular Replacement (using homologous RNA structure) or experimental methods (SAD/MAD with incorporated selenomethionine or halogenated nucleotides).
  • Model Building & Refinement: Manual building in Coot, refined with Phenix or Refmac.

Key Experimental Protocol: Cryo-Electron Microscopy (Cryo-EM) for Large RNAs/Complexes

Objective: Determine near-atomic resolution structures of dynamic RNA-protein complexes.

  • Sample Vitrification: Purified complex applied to EM grid, blotted, and plunge-frozen in liquid ethane.
  • Microscopy: Automated data collection on a Titan Krios or comparable cryo-TEM, collecting millions of particle images.
  • Image Processing: (a) Particle picking (e.g., crYOLO, Relion). (b) 2D classification to discard junk. (c) Ab initio 3D reconstruction (e.g., CryoSPARC). (d) Heterogeneous refinement to separate conformational states. (e) High-resolution non-uniform refinement and post-processing.
  • Model Building: De novo building in ISOLDE or using tools like PHENIX, followed by real-space refinement.

Visualizing the Bottleneck and Workflows

bottleneck Start RNA of Biological Interest P1 Experimental Determination Start->P1 P2 Deposition in PDB P1->P2 Slow & Difficult P3 Filtered Non-Redundant Training Set P2->P3 Redundancy Removal P4 ML Model Training (e.g., AlphaFold2) P3->P4 Bottle Severe Data Bottleneck ~1,500 Unique Folds P3->Bottle P5 Prediction on Novel Targets (CASP) P4->P5 Bottle->P4

Diagram 1: The RNA Structural Data Bottleneck Pipeline.

Diagram 2: Primary Experimental Workflows for RNA Structure.

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Reagents & Materials for RNA Structure Determination

Reagent / Material Function & Application
T7 RNA Polymerase Kit (e.g., HiScribe) High-yield in vitro transcription for producing milligram quantities of pure, homogeneous RNA.
Modified NTPs (Seleno-UTP, Br-UTP) Incorporation into RNA for experimental phasing in X-ray crystallography via SAD/MAD.
Crystallization Screens (e.g., Natrix, MIDAS) Sparse-matrix screens optimized for nucleic acids, increasing odds of crystal formation.
Maltose-Binding Protein (MBP) Fusion System Protein fusion partner to aid RNA crystallization by providing packing interfaces.
Cryo-EM Grids (UltraFoil R1.2/1.3, Quantifoil) Specially engineered grids with defined hole size and geometry for optimal vitrification.
Affinity Purification Tags (e.g., His-tag, Strep-tag) Fused to protein binding partners for efficient purification of RNA-protein complexes for Cryo-EM.
Chemical Crosslinkers (BS3, DSS) Stabilize transient RNA-protein or RNA-RNA interactions prior to Cryo-EM grid preparation.

Implications for CASP15 and Future Directions

The CASP15 results demonstrated that even the best models struggled with long-range interactions and novel topologies absent from the training set. The bias towards small, stable, and often protein-bound RNAs in the PDB means models are poorly calibrated for large, multidomain, or protein-free RNAs.

Conclusion: Overcoming the data bottleneck requires a multi-pronged strategy: 1) Advancing high-throughput structural genomics for RNA, 2) Developing integrative hybrid methods (Cryo-EM, SAXS, chemical probing) to generate "medium-resolution" data for training, and 3) Creating better physics-based and synthetic data augmentation pipelines to complement the sparse experimental data. Until this bottleneck is addressed, the ceiling for accurate de novo RNA structure prediction will remain critically limited.

Optimizing for Non-Canonical Base Pairs and Ion-Mediated Stabilization

Within the context of the Critical Assessment of Techniques for Protein Structure Prediction (CASP) RNA structure prediction assessment (CASP15), a key finding was the critical role of modeling non-canonical base pairs (non-CBPs) and ion-mediated stabilization in achieving high-accuracy predictions. This whitepaper provides an in-depth technical guide on experimental and computational methodologies for optimizing these features, directly informed by the performance analysis of leading predictors in CASP15.

Core Concepts from CASP15 Analysis

The CASP15 RNA assessment revealed that successful groups (e.g., AIchemy_RNA2, RoseTTAFold) employed strategies that explicitly accounted for:

  • Non-Canonical Base Pairing: Hydrogen-bonding patterns beyond Watson-Crick (A-U, G-C) geometry, such as Hoogsteen, sugar-edge, and bifurcated pairs.
  • Ion-Mediated Stabilization: Specifically, the role of Mg²⁺ ions in stabilizing tertiary folds and neutralizing the repulsive negative charge of the phosphate backbone.

Failure to model these interactions was a primary source of large-scale model deviation, particularly for long, multi-helix junctions.

Table 1: Impact of Non-Canonical Base Pairs on CASP15 Prediction Accuracy (RMSD in Å)

Target ID Category Top Predictor (RMSD) Predictor Ignoring Non-CBPs (RMSD) Key Non-CBPs Present
R1101 Multi-helix Junction 2.1 8.7 G-U wobble, A-G sheared
R1107 Riboswitch 3.4 12.5 Hoogsteen pairs, base triples
R1113 Pseudoknot 4.8 15.2 A-minor motifs, reverse Hoogsteen

Table 2: Effect of Explicit Mg²⁺ Modeling on Tertiary Structure Stability (in kcal/mol)

Simulation Method Average Stability (No Mg²⁺) Average Stability (With Mg²⁺) Stabilization Energy from Mg²⁺
Molecular Dynamics (100ns) -1250.4 ± 45.2 -1520.8 ± 32.1 -270.4 ± 15.3
MM/PBSA Calculation -1180.7 -1425.6 -244.9

Experimental Protocols for Validation

Protocol: SHAPE-MaP for Probing Non-Canonical Interactions

Purpose: To experimentally map RNA secondary and tertiary structure, including regions involved in non-canonical pairing. Methodology:

  • Modification: Incubate 2 pmol of folded RNA in 100 µl of folding buffer with 6.5 µl of 100 mM NMIA (1-methylnicotinic acid imidazolium) or 1M7 for 45 minutes at 37°C.
  • Reverse Transcription: Perform reverse transcription with Superscript III (Thermo Fisher) using a primer 50-70 nt downstream. The polymerase will introduce mutations at modification sites.
  • Library Prep & Sequencing: Construct cDNA libraries for Illumina sequencing. Mutations are read as sequence changes.
  • Data Analysis: Calculate SHAPE reactivity per nucleotide using shapemapper2. Low reactivity indicates base pairing or tertiary interaction; moderate reactivity often indicates non-canonical or flexible paired regions.
Protocol: X-ray Crystallography with Anomalous Scattering for Ion Mapping

Purpose: To determine the high-resolution 3D structure of an RNA and locate bound Mg²⁺ ions. Methodology:

  • Crystallization: Co-crystallize RNA in conditions containing 50-100 mM MgCl₂ or using strontium (Sr²⁺) as an isomorphic heavy-atom substitute for phasing.
  • Data Collection: Collect a high-resolution (≤2.0 Å) dataset at the synchrotron. Collect an additional dataset at the Mg K-edge (wavelength ~1.55 Å) to enhance anomalous signal from bound Mg²⁺.
  • Phasing & Modeling: Solve structure using molecular replacement or experimental phasing. Identify Mg²⁺ ions as strong, spherical peaks of electron density in an Fo-Fc map, coordinated by RNA ligands (e.g., phosphate oxygens, O6 of G) in an octahedral geometry.
  • Validation: Check ion binding sites using CheckMyMetal (CMM) server.

Computational Optimization Workflow

G Start Input RNA Sequence S1 Multiple Sequence Alignment (INFERNAL, R-scape) Start->S1 S2 Conserved Base Pair Prediction (RNAalifold, R2R) S1->S2 S3 Initial 3D Modeling (SimRNA, FARFAR2) S2->S3 S4 Non-Canonical Base Pair Refinement (ModeRNA, Rosetta) S3->S4 S5 Explicit Ion Placement (MgNet, FELL) S4->S5 S6 Molecular Dynamics Relaxation (AMBER, GROMACS) S5->S6 S7 Final Model Validation (RNA-Puzzles, MolProbity) S6->S7

Diagram 1: Computational Workflow for RNA Structure Prediction

G Mg2 Mg²⁺ Ion Water Water Molecule (Inner Sphere) Mg2->Water Direct Coord. O6 Guanine O6 (N7) Mg2->O6 Inner Sphere Stability Tertiary Fold Stabilization Mg2->Stability Charge Neutralization P1 Phosphate Group P1->Stability Reduced Electrostatic Repulsion P2 Phosphate Group P2->Stability Reduced Electrostatic Repulsion Water->P1 H-bond Water->P2 H-bond

Diagram 2: Ion-Mediated Stabilization Mechanisms

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Reagents for RNA Structure Studies

Item Function & Application Example Product/Kit
NMIA / 1M7 SHAPE chemical probes for interrogating RNA backbone flexibility at single-nucleotide resolution. Heidelberg SHAPE Reagents
SuperScript III/IV Reverse transcriptase with high processivity and fidelity for reading SHAPE modifications. Thermo Fisher Scientific
MagneHis Ni-Particles For rapid purification of 6xHis-tagged in vitro transcribed RNA for crystallography. Promega
Hampton Crystal Screen Sparse-matrix screens for initial RNA crystallization condition screening. Hampton Research
AMBER Force Field (OL3, bsc1) High-accuracy nucleic acid force field parameters for MD simulations, includes non-CBPs. AmberTools
Rosetta RNA Suite Computational modeling suite for de novo and template-based prediction, optimizes non-CBPs. Rosetta Commons
CheckMyMetal (CMM) Web server for validating metal-binding sites in macromolecular structures. University of Virginia

Strategies for Improving Pseudoknot and Tertiary Contact Prediction

The Critical Assessment of protein Structure Prediction (CASP) expanded to include RNA targets in its 15th round, providing a rigorous, blind benchmark for the field. CASP15 results underscored a significant performance gap: while prediction of simple secondary structures is maturing, accurate identification of pseudoknots and long-range tertiary contacts (e.g., base pairs more than 20 nucleotides apart) remains a formidable challenge. This whitepaper analyzes the post-CASP15 landscape, detailing advanced computational and experimental strategies to bridge this accuracy gap, which is critical for modeling functional RNA architectures in biomedical research and drug discovery.

Quantitative Assessment from CASP15 and Recent Benchmarks

The table below summarizes key quantitative metrics from CASP15 RNA assessment and subsequent studies, highlighting the performance deficit on complex motifs.

Table 1: Performance Metrics on Pseudoknot & Tertiary Contact Prediction (CASP15 & Post-CASP15 Studies)

Method Category Example Tools/Approaches Pseudoknot F1-Score* Tertiary Contact (Long-Range) F1-Score* Key Limitation Identified
Comparative Modeling R-scape, RAF 0.45 - 0.60 0.20 - 0.35 Requires deep, high-quality sequence alignments.
Deep Learning (Sequence Only) SPOT-RNA2, MXfold2 0.55 - 0.70 0.25 - 0.40 Struggles with evolutionarily rare contacts.
Deep Learning + Evolutionary AlphaFold2 (adapted), RhoFold 0.65 - 0.78 0.35 - 0.50 Improved but still lags behind protein performance.
Integrative (Exp. Data) using SAXS, RIC-seq, DMS 0.70 - 0.85 0.50 - 0.70 Accuracy tied to experimental data quality/resolution.
Physics-Based & MD coarse-grained MD, IsRNA1 0.30 - 0.50 0.15 - 0.30 Computationally expensive, often low precision.

*F1-Score ranges are approximate, compiled from published assessments. Higher is better (max 1.0).

Core Methodologies and Experimental Protocols

Protocol: Integrating Chemical Probing Data (DMS/MaP) for Constraint-Based Folding

Objective: Incorporate experimental single-nucleotide reactivity data to guide in silico folding and improve tertiary contact prediction.

  • Data Acquisition: Perform in vitro DMS (Dimethyl Sulfate) or SHAPE probing on the target RNA. For cellular contexts, use MaP (Mutational Profiling) variants.
  • Reactivity Scoring: Calculate normalized reactivity scores per nucleotide. Positions with low reactivity are considered structurally constrained (paired or buried).
  • Constraint Encoding: Convert reactivities into soft probabilistic constraints (e.g., pseudo-energy bonuses/penalties) for folding algorithms. For example, low reactivity receives a bonus for being paired.
  • Constrained Folding: Execute folding simulations (e.g., using ViennaRNA with --constraint option or Rosetta's FARFAR2) using the derived constraints.
  • Ensemble Analysis: Cluster generated models, evaluate constraint satisfaction, and select top-scoring decoys for tertiary contact analysis.

Protocol: Employing RIC-seq for Direct Tertiary Contact Mapping

Objective: Experimentally capture RNA-RNA proximal interactions within a cellular complex to guide 3D modeling.

  • Cell Lysis & In-Situ Crosslinking: Fix RNA-protein and RNA-RNA proximities in situ using formaldehyde or UV crosslinking.
  • RNase Digestion & Proximity Ligation: Partially digest RNA with RNase, leaving crosslinked fragments. Ligate RNA ends that are held in proximity.
  • Library Prep & Sequencing: Reverse transcribe, construct sequencing library, and perform paired-end deep sequencing.
  • Bioinformatic Analysis: Map chimeric reads to the genome/transcriptome. Identify recurrent ligation junctions as evidence of spatial proximity between distal RNA regions.
  • Constraint Application in Modeling: Use identified proximal pairs as distance restraints (e.g., 5-20 Å) in 3D structure modeling pipelines like Rosetta or 3dRNA.

Visualizing Strategic Approaches

G Start Target RNA Sequence DL Deep Learning Prediction (e.g., RhoFold) Start->DL Comp Comparative Evolutionary Analysis Start->Comp Integrate Integrative Modeling Engine (e.g., FARFAR2, SimRNA) DL->Integrate Base Pair Probs Exp Experimental Data Integration Exp->Integrate DMS/RIC-seq Restraints Comp->Integrate Covariation Restraints Output 3D Ensemble with Pseudoknots & Tertiary Contacts Integrate->Output

Title: Integrative Modeling Workflow for RNA Structure

The Scientist's Toolkit: Key Research Reagent Solutions

Table 2: Essential Reagents & Tools for Advanced RNA Structure Prediction

Item Function/Application in Prediction Key Provider/Example
DMS (Dimethyl Sulfate) Chemical probe for detecting unpaired Adenosine and Cytosine bases. Generates single-nucleotide reactivity constraints. Sigma-Aldrich
N3-kethoxal Selective chemical probe for unpaired Guanosine bases. Complementary to DMS for full nucleotide coverage. Merck
Formaldehyde Crosslinking agent for fixing RNA-RNA proximities in RIC-seq and related protocols. Thermo Fisher Scientific
T4 DNA Ligase Enzyme for ligating proximally crosslinked RNA fragments in RIC-seq library preparation. New England Biolabs
Monarch RNA Purification Kits High-yield, DNase-treated RNA isolation for in vitro probing experiments. New England Biolabs
SuperScript IV Reverse Transcriptase Engineered for high-efficiency cDNA synthesis from structured RNA and crosslinked fragments. Thermo Fisher Scientific
ViennaRNA Package Core software suite for RNA secondary structure prediction, folding, and constraint incorporation. University of Vienna
Rosetta FARFAR2 Fragment Assembly of RNA for ab initio 3D modeling with experimental restraints. Rosetta Commons
SimRNA Coarse-grained Monte Carlo simulator for 3D RNA folding using various restraint types. SimRNA.org
R-scape Statistical tool for identifying evolutionarily covarying base pairs from alignments. R-scape/Eddy Lab

This guide, framed within the broader thesis on CASP15 RNA structure prediction assessment results, provides a technical framework for selecting and tuning computational models to predict the structure of distinct RNA classes. The CASP15 assessment revealed significant disparities in model performance across RNA types, underscoring the need for class-specific strategies. This document synthesizes current methodologies, data, and experimental protocols for researchers, scientists, and drug development professionals.

CASP15 RNA Assessment: Key Performance Insights

The CASP15 experiment provided a critical benchmark for RNA structure prediction. The results demonstrated that no single model performs optimally across all RNA structural classes. Performance is heavily influenced by RNA length, the presence of pseudoknots, multibranch loops, and non-canonical base pairs. The following table summarizes key quantitative findings from the assessment for major model types.

Table 1: Summary of CASP15 RNA Prediction Model Performance by RNA Class

RNA Structural Class / Feature Top-Performing Model(s) Average GDT_TS (Range) Key Challenge
Small Simple Motifs (<50 nt) AlphaFold2 (AF2) / RoseTTAFold 75-85 Limited to single structures; misses conformational diversity.
Large Riboswitches & Aptamers DeepFoldRNA / DRfold 65-75 Modeling long-range tertiary contacts and ligand binding pockets.
RNA-Protein Complexes AF2-multimer 60-70 (RNA component) Accurately positioning RNA within the complex interface.
RNAs with Pseudoknots Rhofold / SPOT-RNA 50-65 Predicting mutually exclusive base-pairing patterns.
Multi-Helix Junctions Fragment Assembly methods 55-70 Correct relative orientation of helical arms.
Genomic Length RNAs Constraint-based Folding N/A (Qualitative) Computational tractability and inclusion of in vivo constraints.

Model Selection Guide by RNA Class

This section maps RNA characteristics to recommended model classes and tuning strategies.

Table 2: Model Selection and Tuning Matrix

RNA Class Primary Characteristics Recommended Base Model Critical Tuning Parameters / Strategies
tRNA / miRNA Small, high structure conservation, 2D structure critical. SPOT-RNA, CONTRAfold Use high-weight base-pairing constraints; limit conformational sampling.
Riboswitches Ligand-binding pockets, complex tertiary folds, conformational change. DeepFoldRNA, DRfold Incorporate ligand density maps (if available) as restraints; focus on pocket region refinement with MD.
Ribozymes Catalytic core, specific metal ion binding, often compact. AlphaFold2 (modified) Fix metal-ion binding site geometry with distance restraints; refine active site with QM/MM.
lncRNAs / Genomic Very long, modular, protein-bound, in vivo structures. Rosetta/FARFAR2 with experimental data. Integrate SHAPE-MaP, DMS-seq, and RIC-seq data as soft constraints; use fragment assembly.
Viral RNA Elements Pseudoknots, multibranch junctions, replication frameworks. Rhofold, MXfold2 Enable pseudoknot prediction flags; apply specialized energy parameters for viral motifs.
RNA-Protein Complexes Protein interface, binding-induced folding. AF2-multimer, HADDOCK Provide protein sequence/structure as input; co-predict interface.

Experimental Protocols for Generating Restraint Data

Model tuning for specific RNAs often requires integration of experimental data. Below are detailed protocols for key experiments that generate structural restraints.

Protocol: SHAPE-MaP for Probing RNA Secondary and Tertiary Structure

Objective: Obtain nucleotide-resolution chemical probing data to inform on RNA flexibility and base-pairing status. Key Reagents: See "The Scientist's Toolkit" below.

  • RNA Preparation: In vitro transcribe and gel-purify target RNA (>200 pmol). Refold in appropriate buffer (e.g., 50 mM HEPES pH 8.0, 100 mM KCl, 5 mM MgCl₂) by heating to 95°C for 2 min, then cooling on ice for 2 min, followed by incubation at 37°C for 20 min.
  • SHAPE Modification: Divide RNA into (+) and (-) reagent tubes. For (+), add 1-5 mM N-methylisatoic anhydride (NMIA) or 1M7 in DMSO. For (-), add DMSO only. Incubate at 37°C for 45 min.
  • RNA Cleanup: Ethanol precipitate RNA.
  • Mutational Profiling (MaP) RT: Perform reverse transcription with SuperScript II using a gene-specific primer and dNTPs including a high dNTP concentration (1 mM each) to promote mutation incorporation at modified sites. Use thermocycling: 25°C for 5 min, 42°C for 45 min, 70°C for 15 min.
  • cDNA Amplification & Library Prep: Amplify cDNA by PCR using barcoded primers for Illumina. Purify and size-select libraries.
  • Sequencing & Analysis: Sequence on an Illumina MiSeq. Process data with the shape-mapper2 pipeline to generate reactivity profiles.
  • Constraint Conversion: Convert SHAPE reactivities to pseudo-energy constraints for use in models like Rosetta (e.g., high reactivity = penalty for pairing).

Protocol: RIC-seq for Capturing RNA-RNA Proximity

Objective: Identify spatially proximate RNA residues (in vivo or in vitro) to inform 3D modeling.

  • Crosslinking: For cells, treat with 0.3% formaldehyde for 10 min at room temperature. Lyse cells. For in vitro, incubate refolded RNA complex with 0.1% glutaraldehyde for 10 min on ice. Quench with 125 mM glycine.
  • RNase Digestion & Proximity Ligation: Partially digest RNA with RNase I (for single-strand bias) or micrococcal nuclease. Repair ends and ligate spatially proximate RNA fragments using T4 RNA Ligase 1.
  • Library Construction & Sequencing: Reverse transcribe, PCR amplify, and sequence on an Illumina platform.
  • Data Analysis: Use ricemap or similar to identify chimeric reads. Build a contact map of proximal nucleotides.
  • Model Integration: Use proximal nucleotide pairs as distance restraints (e.g., < 20 Å) in tertiary structure modeling with programs like Rosetta or through custom scripts for neural network models.

Visualization of Workflows and Relationships

G RNA Class\nIdentification RNA Class Identification Experimental\nData Acquisition Experimental Data Acquisition RNA Class\nIdentification->Experimental\nData Acquisition Informs Needs Base Model\nSelection Base Model Selection RNA Class\nIdentification->Base Model\nSelection Guides Choice Parameter\nTuning & Restraint\nIntegration Parameter Tuning & Restraint Integration Experimental\nData Acquisition->Parameter\nTuning & Restraint\nIntegration Provides Constraints Base Model\nSelection->Parameter\nTuning & Restraint\nIntegration Structure\nPrediction &\nRefinement Structure Prediction & Refinement Parameter\nTuning & Restraint\nIntegration->Structure\nPrediction &\nRefinement Validation\n(vs. Experimental) Validation (vs. Experimental) Structure\nPrediction &\nRefinement->Validation\n(vs. Experimental) Validation\n(vs. Experimental)->RNA Class\nIdentification Feedback Loop

Diagram Title: RNA Structure Prediction Model Tuning Workflow

H Input: RNA\nSequence Input: RNA Sequence MSA + Covariance\nAnalysis MSA + Covariance Analysis Input: RNA\nSequence->MSA + Covariance\nAnalysis Neural Network\n(Encoder) Neural Network (Encoder) MSA + Covariance\nAnalysis->Neural Network\n(Encoder) Geometric Module\n(Decoder) Geometric Module (Decoder) Neural Network\n(Encoder)->Geometric Module\n(Decoder) Predicted\n3D Coordinates Predicted 3D Coordinates Geometric Module\n(Decoder)->Predicted\n3D Coordinates Refinement\n(MD/Energy Min.) Refinement (MD/Energy Min.) Predicted\n3D Coordinates->Refinement\n(MD/Energy Min.) Initial Model Experimental\nRestraints Experimental Restraints Experimental\nRestraints->Refinement\n(MD/Energy Min.) Guides Final Model Final Model Refinement\n(MD/Energy Min.)->Final Model

Diagram Title: Deep Learning Model Pipeline with Experimental Integration

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Reagents for RNA Structure Probing Experiments

Item Function in Experiment Example Product / Specification
NMIA or 1M7 SHAPE reagent. Modifies flexible (unpaired) RNA nucleotides at the 2'-OH position, providing reactivity data. 1-methyl-7-nitroisatoic anhydride (1M7), >95% purity, stored desiccated at -20°C.
SuperScript II/III Reverse Transcriptase for MaP. Low processivity and lack of proofreading allow incorporation of mismatches at modified sites during cDNA synthesis. Invitrogen SuperScript II Reverse Transcriptase.
RNase I Single-strand specific ribonuclease. Used in RIC-seq for partial digestion to generate fragments for proximity ligation. Thermo Fisher RNase I, 100 U/μL.
T4 RNA Ligase 1 Catalyzes RNA-RNA ligation. In RIC-seq, it ligates crosslinked, proximal RNA fragments. NEB T4 RNA Ligase 1 (ssRNA Ligase), 10 U/μL.
DMS (Dimethyl Sulfate) Chemical probe for adenine and cytosine accessibility. Methylates N1 of A and N3 of C in unstructured regions. Sigma-Aldrich, ≥99% purity. Caution: Highly toxic.
5'-Biotinylated DNA Primers For pulldown of specific RNAs in in vivo experiments or for immobilization during in vitro folding. HPLC-purified, with C6 linker biotin at 5' end.
Structure-Specific RNases (e.g., RNase V1, RNase T1) Enzymes that cleave double-stranded (V1) or single-stranded guanosine (T1) residues. Provide complementary pairing data. Ambion RNase V1; Thermo Fisher RNase T1.
PEGylated Crowding Agents (e.g., PEG 200) Mimic intracellular crowded environment for in vitro refolding, which can significantly alter RNA structure. Sigma-Aldrich Polyethylene Glycol 200.

CASP15 RNA Showdown: Rigorous Performance Validation and Model Comparison

This whitepaper presents a quantitative analysis of leading computational methods for RNA tertiary structure prediction, evaluated on blind targets from the Critical Assessment of Structure Prediction (CASP15) experiment. The findings are framed within the broader thesis that while deep learning has revolutionized protein structure prediction, its application to RNA presents unique challenges due to RNA's increased conformational flexibility, complex non-canonical base pairing, and metal ion interactions. This benchmark assesses how current methods address these challenges in a blind testing scenario.

Experimental Protocols: CASP15 RNA Assessment

The core methodology follows the standardized CASP15 protocol for RNA structure prediction.

A. Target Selection & Distribution:

  • Source: Experimentalists deposited soon-to-be-published RNA structures with the CASP organizers.
  • Blinding: Sequences (and sometimes secondary structure constraints) were released to predictor groups. Solved 3D structures were withheld for assessment.
  • Target Complexity: Targets included single chains, RNA-protein complexes, and multi-chain RNAs, ranging from ~30 to ~150 nucleotides.

B. Prediction Submission & Evaluation:

  • Prediction: Participating groups (e.g., DeepMind's AlphaFold2 variants, RoseTTAFold, specialized tools like SimRNA, FARFAR2) submitted predicted 3D coordinate models.
  • Primary Quantitative Metric: Root Mean Square Deviation (RMSD) of all backbone atoms (P, C4', N1/N9) after optimal superposition on the experimental structure. Lower RMSD indicates higher accuracy.
  • Secondary Metrics: Interaction Network Fidelity (INF) for non-canonical pairs, and DockQ Score for RNA-protein interface accuracy in complexes.
  • Assessment: The CASP assessors performed independent, blinded calculations of these metrics to rank methods.

Quantitative Benchmark Results

The table below summarizes the performance of top-performing methods on a representative subset of CASP15 RNA-only blind targets.

Table 1: Quantitative Benchmark of Top Methods on CASP15 RNA Blind Targets

Method Name (Group) Core Approach Avg. RMSD (Å) (All Atoms) Best Target RMSD (Å) Worst Target RMSD (Å) INF Score* (Avg)
AlphaFold2 (AF2) Deep Learning (MSA + Evoformer + Structure Module) 4.5 1.2 (Target R1101) 12.8 (Target R1103) 0.71
RoseTTAFoldNA Deep Learning (3-track network for sequence, distance, coordinates) 5.8 2.1 14.5 0.65
FARFAR2 (Rosetta) Fragment Assembly + Fragment Monte Carlo 7.2 3.5 16.0 0.58
SimRNA Coarse-grained Modeling + Statistical Potentials 8.1 4.8 18.2 0.52
Reference (Baseline) Classic Homology Modeling 15.3 10.5 25.7 0.22

*INF Score: 1.0 indicates perfect prediction of non-canonical interaction networks.

Key Interpretation: AF2-based approaches demonstrated superior average accuracy, particularly on targets with deep evolutionary information (multiple sequence alignments). However, high RMSD on certain targets highlights failures on more flexible or unique folds. Classical sampling methods (FARFAR2, SimRNA) showed more consistent but generally less precise performance.

Visualizing the Assessment Workflow & Key Challenges

Diagram 1: CASP15 RNA Prediction Assessment Workflow

casp15_rna_workflow step1 Blind Target Selection step2 Sequence Release to Predictors step1->step2 step3 Model Generation step2->step3 method1 Deep Learning (e.g., AF2) step3->method1 method2 Physics/Sampling (e.g., FARFAR2) step3->method2 step4 Model Submission step3->step4 method1->step4 method2->step4 step5 Independent Assessment (RMSD, INF, DockQ) step4->step5 step6 Ranking & Publication step5->step6

Diagram 2: Key RNA-Specific Challenges in Structure Prediction

rna_challenges challenge RNA Blind Target c1 Flexible Linkers & Loops challenge->c1 c2 Non-Canonical Base Pairs challenge->c2 c3 Metal Ion Dependence challenge->c3 c4 Sparse Evolutionary Data challenge->c4 result High Prediction RMSD & Low INF Score c1->result c2->result c3->result c4->result

The Scientist's Toolkit: Key Research Reagent Solutions

This table details essential computational and experimental resources for research in this field.

Table 2: Essential Toolkit for RNA Structure Prediction & Validation

Item / Solution Category Primary Function in Research
AlphaFold2 (ColabFold) Software Provides state-of-the-art deep learning predictions for RNA and RNA-protein complexes via an accessible interface.
Rosetta FARFAR2 Software Samples RNA conformational space using fragment assembly and physics-based scoring, useful for de novo design.
DCA (Direct Coupling Analysis) Algorithm Infers evolutionary co-variance from MSAs to predict RNA-RNA or RNA-protein contacts for restraint generation.
Cryo-EM Structures Data High-resolution experimental structures from databases (PDB, EMDB) serve as critical benchmarks and training data.
SHAPE-MaP Wet-lab Reagent (Selective 2'-Hydroxyl Acylation analyzed by Primer Extension and Mutational Profiling). Probes RNA backbone flexibility in vitro and in vivo to inform secondary/tertiary structure models.
Mg²⁺ / Mn²⁺ Chelators Wet-lab Reagent Used in crystallization and buffer optimization to study metal ion dependence of RNA folding and stability.
3dRNA Software A template-based method for RNA structure prediction, useful when homologous structures are available.

Strengths and Weaknesses Analysis by RNA Type (riboswitches, ribozymes, aptamers)

The Critical Assessment of Structure Prediction (CASP) is the gold-standard competition for evaluating protein and, more recently, RNA structure prediction methodologies. The CASP15 assessment highlighted significant progress in RNA tertiary structure prediction, driven primarily by deep-learning techniques adapted from protein folding (e.g., AlphaFold2). However, performance varied considerably across RNA functional types, underscoring the need for a nuanced analysis of the biophysical and experimental constraints inherent to different RNA classes. This whitepaper provides an in-depth technical analysis of three key functional RNA types—riboswitches, ribozymes, and aptamers—framed by the challenges and opportunities identified in CASP15. Understanding their distinct structural characteristics, flexibility, and ligand-binding mechanisms is critical for improving computational models and guiding rational drug and diagnostic design.

Quantitative Comparison of Key Attributes

Table 1: Comparative Analysis of Core Characteristics

Attribute Riboswitches Ribozymes Aptamers
Primary Function Gene regulation via metabolite binding Catalysis of chemical reactions Specific ligand binding (diagnostic/therapeutic)
Key Structural Motif Complex aptamer domain + expression platform Pre-organized active site (e.g., hammerhead, HDV) Variable binding pocket, often G-quadruplexes or stem-loops
Typical Size (nt) 70-200 30-200+ 20-80 (core)
Ligand Dependency High (conformational switch upon binding) Often cofactor-dependent (e.g., Mg²⁺) High (binding induces structure)
Structural Flexibility Very High (transitions between states) Moderate (requires precise active site geometry) High (often from unstructured to structured)
CASP15 Avg. RMSD (Å)* 8.5 - 15.2 (High) 4.8 - 9.3 (Moderate) 6.7 - 12.1 (High)
Key Strength Exquisite specificity for small metabolites; natural regulatory logic. High catalytic efficiency; potential for in vitro evolution. Versatile target range (ions to cells); synthetic selection.
Key Weakness Dynamic conformational switching is hard to capture statically. Metal ion coordination geometry is challenging to predict. In vitro selected structures may have multiple non-native conformations.
Therapeutic Potential Novel antibacterial targets (exploiting metabolite sensing). Gene therapy (self-cleaving motifs); biosensors. Antidotes, targeted delivery (e.g., pegaptanib).

*RMSD (Root Mean Square Deviation) ranges are illustrative estimates based on CASP15 assessment data for targets representing these categories, reflecting the difficulty of prediction.

Detailed Experimental Protocols

3.1. Protocol: In-line Probing for Riboswitch/Aptamer Ligand Binding

  • Purpose: To map ligand-induced conformational changes at single-nucleotide resolution without chemical modifiers.
  • Reagents: 5'-end ³²P-labeled RNA, purified ligand, reaction buffer (e.g., 50 mM Tris-HCl pH 8.3, 20 mM MgCl₂, 100 mM KCl), alkaline phosphatase.
  • Procedure:
    • Labeling: RNA is dephosphorylated with alkaline phosphatase and 5'-end labeled using [γ-³²P]ATP and T4 polynucleotide kinase.
    • Equilibration: Labeled RNA (~50,000 cpm) is incubated with/without ligand in reaction buffer at 25°C for 40 hrs.
    • Cleavage: Spontaneous RNA cleavage at flexible ("unprotected") phosphodiester bonds occurs via transesterification.
    • Analysis: Reactions are quenched, resolved on 10% denaturing PAGE, and visualized by phosphorimaging. Protected regions (reduced cleavage) indicate ligand-stabilized structure.
  • Key Insight: Provides quantitative Kd estimates and secondary structure mapping under near-physiological conditions.

3.2. Protocol: Kinetic Analysis of Ribozyme Cleavage

  • Purpose: Determine catalytic rate (kobs) and magnesium ion dependence.
  • Reagents: 5'-³²P-labeled ribozyme substrate, purified ribozyme (or cis-acting construct), reaction buffer (50 mM Tris-HCl pH 7.5, varying [MgCl₂]), stop solution (80% formamide, 50 mM EDTA).
  • Procedure:
    • Pre-incubation: Ribozyme is pre-folded in reaction buffer at desired temperature.
    • Initiation: Reaction is initiated by adding labeled substrate. Aliquots are removed at set time points (e.g., 0, 10s, 30s, 1m, 5m).
    • Quenching: Each aliquot is immediately quenched in stop solution on dry ice.
    • Separation: Products are separated by denaturing PAGE and quantified.
    • Fitting: Fraction cleaved vs. time is fit to a single-exponential curve: Fraction Cleaved = A(1 - e^{-kobs*t}), where kobs is the observed rate constant.

Visualization of Logical and Experimental Frameworks

G A Unbound RNA (Apophenotype) B Ligand Binding (Metabolite, Ion, Drug) A->B Specific Recognition C Conformational Rearrangement B->C Induced Fit or Conformational Capture D1 Riboswitch: Altered Gene Expression C->D1 D2 Ribozyme: Catalytic Activation C->D2 D3 Aptamer: Signal Output (e.g., FRET) C->D3

Diagram 1: Generalized ligand-induced RNA functional switching.

G Start 5'-³²P Label RNA Step1 Incubate ± Ligand (40 hrs, 25°C) Start->Step1 Step2 Spontaneous In-line Cleavage Step1->Step2 Step3 Denaturing PAGE Step2->Step3 Step4 Phosphorimaging Step3->Step4 Res1 Protected Regions: Structured/Bound Step4->Res1 Res2 Cleaved Regions: Flexible/Unbound Step4->Res2

Diagram 2: In-line probing experimental workflow.

The Scientist's Toolkit: Key Research Reagent Solutions

Table 2: Essential Reagents for Functional RNA Analysis

Reagent/Material Function/Application Key Consideration
T4 Polynucleotide Kinase (T4 PNK) 5'-end labeling of RNA with [γ-³²P]ATP for detection in probing/kinetic assays. Use mutant versions (e.g., PNK M1) for efficient phosphorylation of RNA 5'-ends.
In vitro Transcription Kit (e.g., T7 RNA Polymerase based) High-yield synthesis of homogeneous, unlabeled RNA for structural/biochemical studies. NTP quality and DNA template purity are critical for yield and preventing premature termination.
Solid-Phase Synthesis Columns (2'-ACE protected) Custom synthesis of chemically modified RNA oligos (e.g., for SELEX or stability). Enables site-specific incorporation of fluorophores, biotin, or 2'-modifications (F, OMe).
Heparin-Agarose A polyanion competitor used in filter-binding assays to reduce non-specific RNA-protein interactions. Critical for accurate determination of binding constants (Kd) in aptamer selection/purification.
Magenta-Gal (X-Gal analog) Colorimetric substrate for the Mango-II fluorescent RNA aptamer, used in cellular imaging. Example of a synthetic ligand enabling visualization of RNA dynamics in live cells.
Biotinylated Metabolite Analogs Capture agents for pull-down assays to isolate specific riboswitch-aptamer complexes from cellular lysates. Used to validate in vivo targets and binding specificity of natural riboswitches.

This whitepaper, framed within a broader thesis assessing CASP15 RNA structure prediction results, provides a technical analysis of the alignment and divergence between computational predictions and experimental structures. The advent of deep learning models like AlphaFold2 and specialized RNA tools has revolutionized structural biology, necessitating a rigorous, quantitative comparison to experimental benchmarks to guide research and therapeutic development.

Quantitative Assessment of CASP15 RNA Predictors

The following tables summarize key performance metrics for top-performing RNA structure prediction groups in CASP15, comparing global and local accuracy measures.

Table 1: Global Accuracy Metrics for Top CASP15 RNA Predictors

Predictor Group GDT-TS (Avg) RMSD (Å) (Avg) TM-Score (Avg) Successful Targets (GDT-TS > 0.6)
AIchemy_RNA2 0.72 3.8 0.85 18/24
DeepFold RNA 0.68 4.5 0.81 15/24
RoseTTAFold2NA 0.65 5.1 0.78 12/24
Baseline (Mxfold2) 0.51 8.3 0.65 5/24

Metrics: GDT-TS (Global Distance Test-Total Score), RMSD (Root Mean Square Deviation), TM-Score (Template Modeling Score). Data averaged over 24 assessed RNA targets.

Table 2: Divergence Analysis by Structural Element

Structural Element Avg. Predicted RMSD (Å) Avg. Experimental B-factor (Ų) Key Divergence Point
Canonical Helices 2.1 25.4 Minor end-fraying
Non-Canonical Loops 6.8 45.2 Tertiary contact placement
Long-range Jcts. 8.5 52.1 Global topology errors
Ligand-Binding Pockets 7.2 38.7 Side-chain/ion coordination

Experimental Protocols for Benchmark Determination

The experimental structures used as CASP15 benchmarks were determined via high-resolution methods. Key protocols are detailed below.

Cryo-Electron Microscopy (Cryo-EM) for Large RNAs

Protocol: For target R1083 (a 200-nt riboswitch).

  • Sample Preparation: 2 mg/mL RNA in buffer (20 mM HEPES-KOH pH 7.5, 50 mM KCl, 2 mM MgCl₂) was vitrified using a Vitrobot Mark IV (Thermo Fisher) on UltrAuFoil R1.2/1.3 300-mesh grids.
  • Data Collection: Movies (40 frames, total dose 50 e⁻/Ų) collected on a 300 keV Krios G4 with a K3 detector at a nominal magnification of 105,000x (0.826 Å/pixel).
  • Processing: Motion correction (MotionCor2), CTF estimation (CTFFIND-4.2), particle picking (crYOLO). 850k particles were subjected to 2D classification, ab initio reconstruction, and non-uniform refinement in cryoSPARC v3.3.2, yielding a 2.8 Å map.
  • Model Building: Initial model placed with ModelAngelo. Manual adjustment in Coot v0.9.8 followed by real-space refinement in Phenix v1.20.

X-ray Crystallography for Medium-Sized Complexes

Protocol: For target R0981 (a 80-nt RNA-protein complex).

  • Crystallization: Complex (at 10 mg/mL) was crystallized via sitting-drop vapor diffusion against a reservoir of 0.1 M Tris-HCl pH 8.5, 25% PEG 3350, 0.2 M ammonium citrate.
  • Data Collection: A single crystal cryo-cooled in liquid N₂ yielded a 1.9 Å dataset at the APS GM/CA beamline 23-ID-D.
  • Phasing & Refinement: Molecular replacement using a homolog (PDB: 4PQR) with Phaser. Iterative rounds of manual building (Coot) and refinement (REFMAC5) with TLS parameters.

Pathway Analysis of Success and Divergence Factors

The logical relationship between prediction inputs, methods, and outcomes determining success or divergence is mapped below.

G Inputs Input Data (Sequence, Covariance) Method Prediction Method Inputs->Method Output Predicted 3D Structure Method->Output Constraints Physicochemical/ Evolutionary Constraints Constraints->Method Success SUCCESS Factors Output->Success Accurate for: Divergence DIVERGENCE Factors Output->Divergence Inaccurate for: S1 • Canonical Helices • Base Pairing Networks • Rigid Core Domains Success->S1 D1 • Flexible Loops/Junctions • Ion/Ligand Interactions • Dynamics-Dependent Folds Divergence->D1 ExpBenchmark Experimental Benchmark (High-Res Structure) ExpBenchmark->Success Comparison ExpBenchmark->Divergence Comparison

Title: Logical Flow of Prediction Success and Divergence

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Reagents for RNA Structure Determination & Validation

Item Function in Experiment Example Product/Catalog
High-Purity NTPs In vitro transcription for sample prep. NEB N0450S (ATP, 100mM)
Tagged Ribonucleotides For phasing in X-ray crystallography (e.g., Se-Met derivatization). Silantes 60101 (Selenium-labeled UTP)
Cryo-EM Grids Support film for vitrification. Quantifoil R1.2/1.3 Cu 300 mesh
Stabilizing Buffer Kit Maintains native RNA fold during purification/analysis. ThermoFisher J23146 (RNA Stable Buffer Kit)
Crosslinking Reagent Captures transient RNA-protein interactions for structural analysis. ThermoFisher 26106 (DSG, Disuccinimidyl Glutarate)
Divalent Metal Ion Solutions Essential for folding; Mg²⁺/Mn²⁺ for crystallization. Sigma-Aldrich 63020 (MgCl₂, Molecular Biology Grade)
Cryoprotectant Prevents ice crystal formation in cryo-EM & X-ray. Sigma-Aldrich H2779 (HEPES buffer) + Glycerol/PEG
RNase Inhibitor Prevents degradation during long experiments. Takara 2313A (Recombinant RNase Inhibitor)

Predictions succeed most reliably in regions governed by strong evolutionary covariation and base-pairing thermodynamics, such as canonical helices. The primary divergence points occur in structurally plastic elements like loops and junctions, and in contexts where specific ion interactions or co-transcriptional folding dynamics dictate the final fold. Bridging this gap requires integrating experimental data on dynamics and energy landscapes into the next generation of predictive algorithms.

This whitepaper, framed within a broader thesis assessing CASP15 RNA structure prediction results, examines the critical "generality test" for computational methods. The core question is whether leading algorithms exhibit true generalization by accurately predicting structures for novel folds with no known structural homologs, or if their performance is contingent upon the presence of evolutionarily related templates in training data. This distinction is paramount for researchers and drug development professionals seeking reliable de novo prediction tools for novel non-coding RNAs and therapeutic targets.

Core Quantitative Assessment from CASP15

Data from the CASP15 RNA assessment highlight a pronounced performance gap between targets classified as "Easy" (with known structural homologs) and "Hard" (novel folds). The following table summarizes key performance metrics for leading groups (e.g., AlphaFold2, RoseTTAFold, and specialized RNA predictors).

Table 1: CASP15 RNA Prediction Performance Summary (Selected Groups)

Target Classification Example CASP15 Target ID Best Performance (GDT_TS / lDDT) Median Performance (GDT_TS) Performance Delta (Hard vs. Easy) Top Performing Method Class
Easy (Known Homolog) R1101, R1102 0.85 - 0.92 0.78 Baseline Deep Learning (Integrated)
Hard (Novel Fold) R1113, R1126 0.45 - 0.60 0.35 -40% to -50% Physics-Based Refinement
Template-Based R1103 0.90+ 0.82 N/A Comparative Modeling
Free Modeling (Novel) R1120 < 0.55 < 0.30 -55% Experimental Mapping Guided

Metrics: GDT_TS (Global Distance Test Total Score) for overall topology, lDDT (local Distance Difference Test) for local accuracy. Data synthesized from CASP15 assessment publications and presentations.

Experimental Protocols for Generality Validation

Protocol for Cross-Validation on Novel Folds

Objective: To rigorously test a model's generalization capability by evaluating it on folds excluded from training.

  • Dataset Curation: Compile a non-redundant set of RNA structures from the PDB. Cluster sequences at <30% identity. Split clusters into training and test sets, ensuring no fold similarity (via Dali or CE structural alignment) exists between sets.
  • Model Training: Train the prediction network (e.g., an end-to-end deep learning model) exclusively on the training set clusters.
  • Blind Testing: Predict structures for all sequences in the held-out test set clusters.
  • Analysis: Calculate standard metrics (GDT_TS, RMSD) per target. Compare average performance on "novel fold" test set versus a control test set containing homologs of training structures.

Protocol for Ablation: Template Removal from Training

Objective: To quantify the contribution of evolutionary information to performance.

  • Input Featurization: For each training example, generate two input feature sets:
    • Set A: Includes multiple sequence alignment (MSA)-derived features (covariance, positional frequency).
    • Set B: Uses only sequence and predicted secondary structure.
  • Model Comparison: Train two model instances: Model A (full features) and Model B (ablated features).
  • Evaluation: Benchmark both models on the "Hard" novel fold targets from CASP15.
  • Output: The performance gap (Model A - Model B) isolates the "homology dependency" of the algorithm.

Visualization of Workflows and Relationships

Diagram 1: Generality Test Evaluation Workflow

GeneralityWorkflow Start RNA Structure Database (e.g., PDB) Cluster Clustering by Sequence & Fold Start->Cluster Split Strict Split: No Fold Overlap Cluster->Split TrainSet Training Set (Known Folds) Split->TrainSet TestSet Hold-Out Test Set (Novel Folds) Split->TestSet Model Prediction Model Training TrainSet->Model Eval Blind Prediction & Metric Calculation TestSet->Eval Model->Eval Result Performance Gap Analysis: Generalization Score Eval->Result

Diagram 2: Homology Dependency in Prediction

HomologyDependency Input RNA Sequence MSA Deep MSA (Evolutionary Data) Input->MSA SeqOnly Sequence-Only Features Input->SeqOnly ModelFull Trained Model (With Homology) MSA->ModelFull ModelAbl Trained Model (Without Homology) SeqOnly->ModelAbl OutputFull Accurate Structure (Known Homologs) ModelFull->OutputFull High lDDT OutputPoor Inaccurate Structure (Novel Folds) ModelAbl->OutputPoor Low lDDT

The Scientist's Toolkit: Key Research Reagent Solutions

Table 2: Essential Materials for RNA Structure Prediction & Validation

Item / Reagent Function & Rationale
Rosetta FARFAR2 A fragment assembly-based de novo RNA modeling suite. Essential for generating initial decoys without homology, serving as a baseline or input for deep learning refinement.
AlphaFold2 (w/ RNA mods) Protein structure prediction engine adapted for RNA. Used to test the transferability of deep learning approaches and to generate predicted aligned error (PAE) maps for confidence estimation.
SHAPE-MaP Reagents (e.g., NAI, 1M7). Provide experimental chemical probing data that informs on nucleotide flexibility/pairedness. Used as soft constraints to guide de novo folding or validate predictions.
CASP15 RNA Datasets Curated benchmark of "Easy" and "Hard" targets. The gold-standard for performing controlled generality tests and comparing method performance.
DCA / plmDCA Software Direct Coupling Analysis tools. Infer evolutionary co-variance from MSAs to predict base-base contacts, a crucial input feature for homology-dependent methods.
RNA-Puzzles Submissions Portal Platform for blind RNA structure prediction challenges. Enables ongoing, community-wide testing of generality on newly solved structures.
MD Simulation Packages (e.g., AMBER, GROMACS) For all-atom molecular dynamics refinement. Used to relax predicted models, sample conformational landscapes, and improve stereochemical quality.

Conclusion

The CASP15 assessment marks a definitive inflection point for RNA structure prediction, establishing deep learning as the dominant paradigm. While methods inspired by the protein-folding revolution have delivered unprecedented accuracy for many targets, significant gaps remain—particularly for large complexes and uniquely RNA-specific motifs. The convergence of expanded experimental datasets, refined neural architectures, and integrated biophysical principles will drive the next leap. For biomedical research, these advances are rapidly transforming RNA from a challenging target to a tractable one, accelerating the design of small-molecule drugs, antisense oligonucleotides, and mRNA therapeutics with precise structural underpinnings. The path forward is clear: a collaborative, iterative cycle between computational prediction and experimental validation is essential to fully unlock the therapeutic potential of the RNA structurome.