Skip to main content
Structure Metrics
License: Structure Metrics is open source and free for academic and commercial use under an Apache-2.0 license. Please refer to the license for full terms.

This toolkit is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.


@misc{structuremetrics2024,
  title={Structure quality metrics for protein structure prediction filtering},
  author={Internal},
  year={2024},
  note={Internal proto-tools tool using biotite for SSE annotation}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/structure_scoring/structure_metrics
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_structure_metrics()Compute structural quality metrics (SS percentages, longest helix, gyration radius) from PDB files Docs Source

Background

The tool assigns secondary structure using the P-SEA algorithm (Labesse, Colloc’h, Pothier, and Mornon, 1997) implemented in Biotite, which classifies each protein residue as alpha helix, beta sheet, or loop from the Cα-atom trace alone using distance and angle patterns. The reported helix, sheet, and loop percentages summarise the overall secondary-structure composition of the structure, while the longest contiguous alpha-helix length is a separate scalar that captures the longest single helix without averaging across the structure. Radius of gyration is the mass-weighted root-mean-square distance of all atoms from the centre of mass and is a standard scalar measure of overall structural compactness used in small-angle X-ray scattering, polymer physics, and protein structural analysis. For proteins of a given length, compact native folds produce smaller gyration radii than disordered or partially folded conformations, which is what makes the metric useful as an artifact filter for predicted structures. Both metrics are useful as inexpensive sanity checks on structures produced by sequence-based predictors such as ESMFold, AlphaFold, Chai, Boltz, and Protenix. Predictors can default to extended helical bundles for low-confidence regions, and failed folds frequently appear as extended conformations with elevated gyration radii. The Biotite Python library (Kunzmann et al., 2023) provides the underlying secondary-structure annotation and gyration-radius implementations used by this tool.

Learning Resources

  • Biotite documentation (TU Darmstadt). API reference for the secondary-structure annotation and gyration-radius computations used by this tool.

Tools

Structure Quality Metrics (structure-metrics)

Computes five quality metrics for each input structure: helix_pct, sheet_pct, loop_pct (secondary-structure composition on the 0 to 100 scale), longest_alpha_helix (residue count of the longest contiguous alpha-helical segment), and gyration_radius (radius of gyration in Ã…). Inputs are passed as a list of structures and results are returned in the same order.

API Reference

Source
structures
List[Structure]
required
Structures to analyze. Accepts file paths, raw PDB/CIF content strings, or Structure objects per item.
Source
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
metrics
List[StructureQualityMetrics]
Per-structure quality metrics, index-aligned with inputs.structures.
Metrics (one set per metrics item)
MetricTypeRangeAvailability
longest_alpha_helixint≥ 0.0always
gyration_radiusfloat≥ 0.0always
helix_pctfloat0.0 to 100.0always
sheet_pctfloat0.0 to 100.0always
loop_pctfloat0.0 to 100.0always

Applications

This tool is appropriate as a fast first-pass filter for batch screening of predicted protein structures. Representative applications include flagging predicted structures with unrealistically long alpha helices that often arise as artifacts on low-confidence regions, identifying extended or disordered conformations that fail to fold compactly, summarising the secondary-structure composition of a designed protein, and ranking generated structures by structural plausibility before more expensive downstream analyses.

Usage Tips

  • Inputs accept a list of Structure objects, file paths, or raw PDB or mmCIF content strings. A single bare input is automatically wrapped in a list. Each item is coerced to a Structure before analysis.
  • All five metrics are computed over every chain of the input structure. There is no per-chain breakdown at the tool level. To analyse a specific chain, extract that chain into its own Structure using Structure.select_chain() before passing it in.
  • Filter thresholds depend on the protein family. A 50-residue alpha helix is a strong artifact signal for a typical globular protein but is normal for coiled-coil and fibrous proteins. A gyration radius above 45 Ã… indicates failed folding for a 1000-residue protein but is expected for naturally elongated proteins. Calibrate thresholds against known structures of the protein family being screened.
  • The secondary_structure_percentages summary and longest_alpha_helix use the same P-SEA assignment. The helix percentage and longest contiguous helix length are derived from the same per-residue annotation, so a structure with helix_pct=80 and longest_alpha_helix=200 indicates that nearly the entire structure is one continuous helix.

Toolkit Notes

These apply to every Structure Metrics tool in this toolkit (structure-metrics).
  • Outputs are returned as typed metric objects. Each StructureQualityMetrics entry carries longest_alpha_helix (integer residue count), gyration_radius (Ã…), helix_pct, sheet_pct, and loop_pct (all on the 0 to 100 scale). Results can be exported to CSV or JSON through the standard export method.
  • The tool implementation runs entirely in-process and uses CPU only. Computation is performed in pure Python through Biotite, with no standalone environment or separate program invoked. Per-structure runtime is sub-second for typical protein sizes and scales linearly with the number of input structures.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.