ORFipy - Proto

License: ORFipy is open source and free for academic and commercial use under an MIT license. Please refer to the license for full terms.

This toolkit is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.

GitHub 80 GitHub 80 Publication Publication Cite Cite Tool Source Tool Source Open as Notebook Open as Notebook Open on Proto Open on Proto

urmi-21/orfipy

Fast and flexible ORF finder

80 stars

View repo

orfipy: a fast and flexible tool for extracting ORFs

Urminder Singh and Eve Syrkin Wurtele

Bioinformatics (2021)

Read paper

@article{singh2021orfipy,
  title={orfipy: a fast and flexible tool for extracting ORFs},
  author={Singh, Urminder and Wurtele, Eve Syrkin},
  journal={Bioinformatics},
  volume={37},
  number={18},
  pages={3019--3020},
  year={2021},
  publisher={Oxford University Press},
  doi={10.1093/bioinformatics/btab090}
}

Copy citation

proto-bio/proto-tools/proto_tools/tools/orf_prediction/orfipy

View source

Open Notebook

Open notebook

Coming soon!

Run this tool directly in Proto with no setup required.

Function	Description
`run_orfipy_prediction()`	ORF (Open Reading Frame) prediction using Orfipy	Docs Source

Background

ORFipy (Singh and Wurtele, 2021) was developed as a fast and flexible replacement for older ORF-extraction tools that struggle with the scale of contemporary genomic and transcriptomic datasets. The published work emphasises customisable search criteria together with high throughput, and reports that ORFipy scales to whole-genome and de novo transcriptome inputs that exceed what earlier ORF finders can comfortably process. The reference implementation is written in Python and is distributed through PyPI and bioconda. An open reading frame is a continuous stretch of DNA bounded by an in-frame start and stop codon. ORF extraction is mechanistic rather than predictive. Every region that begins at a recognised start codon and continues in frame to the first downstream stop codon is reported, regardless of whether the resulting region encodes a biologically functional protein. This stands in contrast to gene-prediction tools such as Prodigal, which apply learned models to score whether each candidate ORF is likely to correspond to a real gene. ORFipy is appropriate when the goal is exhaustive enumeration of every candidate ORF for downstream filtering or annotation; a gene-prediction tool is appropriate when the goal is a curated set of likely coding genes.

Learning Resources

urmi-21/orfipy (Wurtele Lab, Iowa State University). Official ORFipy repository and command-line reference.

Tools

Orfipy ORF Prediction (`orfipy-prediction`)

Scans one or more DNA sequences across the configured strand setting (three forward and three reverse reading frames by default) and returns every open reading frame that satisfies the configured start codon, stop codon, strand, and length filters. Each returned ORF carries its nucleotide sequence, translated amino-acid sequence, 1-indexed start and end positions on the parent sequence, strand, reading frame, and the parent sequence identifier.

API Reference

Source

Input: OrfipyInput

sequences

List[string]

required

DNA sequence(s) to analyze for open reading frames. Can be provided as:

Source

Config: OrfipyConfig

threads

integer

default:"4"

Number of CPU threads to use for processing each sequence. Since processing is batched per-sequence, this controls intra-sequence parallelism. Must be at least 1. Default: 4.

start_codons

List[string]

default:"['ATG', 'GTG', 'TTG']"

Start codons to recognize for ORF prediction. Multi-select from:

stop_codons

List[string]

default:"['TAA', 'TAG', 'TGA']"

Stop codons to recognize for ORF prediction. Multi-select from:

strand

enum

default:"b"

Which strand(s) to scan for ORFs. Options:Available options: f, r, b

min_len

integer

default:"0"

Minimum ORF length in nucleotides (not including stop codon unless include_stop=True). ORFs shorter than this are filtered out. Common values:

max_len

integer

default:"10000"

Maximum ORF length in nucleotides. ORFs longer than this are silently filtered out by orfipy; raise (e.g. 1_000_000_000) for genome-scale inputs. Default: 10000.

include_stop

boolean

default:"True"

Whether to include the stop codon in the reported ORF nucleotide sequence. If True, the stop codon is included in both the nucleotide sequence and length calculations. If False, the stop codon is excluded. Default: True.

ignore_case

boolean

default:"False"

Treat lowercase (soft-masked) nucleotides as ORF-eligible. Default: False.

partial_3

boolean

default:"False"

Report ORFs missing a stop codon at the 3’ end of the sequence. Default: False.

partial_5

boolean

default:"False"

Report ORFs missing a start codon at the 5’ end of the sequence. Default: False.

between_stops

boolean

default:"False"

Report ORFs spanning stop-to-stop (start codons ignored). Default: False.

translation_table

string

NCBI genetic code for translation. None uses the standard genetic code (table 1). Only tables supported by orfipy’s built-in translation table dict are available (NCBI tables 1-6, 9-14, 16, 21-30). Common options:

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cpu"

Device to run the tool on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

Source

Output: OrfipyOutput

predicted_orfs

List[array]

List of ORF results per input sequence. This is the source of truth for all predicted ORFs. Each inner list contains the ORFs found in a single input sequence.

Applications

This tool is appropriate for the upstream ORF-enumeration step of any analysis that begins with raw DNA sequences and needs candidate coding regions. Representative applications include cataloguing all ORFs in a newly assembled genome before annotation, extracting coding-sequence candidates from a de novo transcriptome assembly, generating an exhaustive ORF set for downstream filtering by length, codon usage, or homology to a known protein, and producing translated protein sequences for downstream language-model scoring or domain annotation.

Usage Tips

min_len is the primary control on the number of reported ORFs. At the default of min_len=0, every candidate region is reported, including many short ORFs that arise by chance in any DNA sequence and do not encode functional proteins. A threshold of min_len=150 (approximately 50 amino acids) excludes the majority of these short ORFs. A threshold of min_len=300 (approximately 100 amino acids) focuses the output on typical small proteins, and min_len=900 (approximately 300 amino acids) restricts the output to larger proteins. The threshold is specified in nucleotides.
start_codons should match the genetic context of the input. The default of ["ATG", "GTG", "TTG"] is appropriate for bacterial and archaeal sequences, in which alternative start codons account for approximately 15 to 20 percent of genes. A value of ["ATG"] is appropriate for stringent eukaryotic ORF analyses, and the inclusion of "CTG" is appropriate for organisms that use a non-standard genetic code in which CTG functions as an alternative start codon.
strand controls which DNA strands are scanned. The default of "b" scans both strands and reports ORFs from both the forward sequence and its reverse complement. A value of "f" or "r" restricts the scan to a single strand, which approximately halves the number of ORFs returned and is appropriate when the coding strand of the input is known in advance.
translation_table selects the genetic code used for amino-acid translation. The default value of None applies the standard genetic code (NCBI table 1). A value of "bacterial" selects the bacterial, archaeal, and plant plastid code (NCBI table 11), "vertebrate_mitochondrial" selects the vertebrate mitochondrial code (NCBI table 2), and the remaining supported NCBI tables are appropriate for organisms that use the corresponding alternative codes.
The partial-ORF flags allow incomplete reading frames at sequence boundaries. A value of partial_3=True reports ORFs that begin at a recognised start codon and continue to the 3’ end of the input without an in-frame stop codon. A value of partial_5=True reports ORFs that end at a recognised stop codon but begin at the 5’ end of the input without a recognised start codon. Both flags are disabled by default and are appropriate when the input represents a fragment of a larger sequence, such as a transcriptome contig.
between_stops=True reports every region between two in-frame stop codons regardless of whether a recognised start codon is present. This is appropriate for ribosome-profiling analyses that aim to identify all potential translation regions, and implies that both partial_3 and partial_5 behave as if enabled.
The output is exhaustive rather than curated. ORFipy reports every candidate ORF that satisfies the configured filters. Confirming the biological relevance of any individual ORF requires subsequent analyses such as homology search with BLAST, domain annotation with HMMER, or gene prediction with Prodigal.

Toolkit Notes

These apply to every ORFipy tool in this toolkit (orfipy-prediction).

max_len defaults to 10000 nucleotides and silently filters longer ORFs. Raise the limit (for example to 1_000_000_000) for genome-scale inputs to avoid losing long open reading frames without an error.
threads controls intra-sequence parallelism. The default of 4 is reasonable for single-genome inputs. Raise this on multi-core hosts when processing very large sequences. The tool processes each input sequence independently, so additional sequence-level parallelism can be achieved by running multiple instances of orfipy-prediction concurrently through a ToolPool.
Input sequences are normalised to uppercase and filtered to the four standard DNA nucleotides before scanning. Ambiguity codes such as N and IUPAC mixture codes are silently removed from the sequence, as are non-DNA characters. The remaining nucleotides are passed to ORFipy in their original order.
Position fields are 1-indexed to match standard biological residue numbering conventions. ORF start and end positions on the parent sequence follow the conventions used in PDB files, GenBank annotations, and the published literature, so positions can be compared directly against external references without conversion.

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

​Background

​Learning Resources

​Tools

​Orfipy ORF Prediction (orfipy-prediction)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

Background

Learning Resources

Tools

Orfipy ORF Prediction (`orfipy-prediction`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides