
This toolkit is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.
Background
ORFipy (Singh and Wurtele, 2021) was developed as a fast and flexible replacement for older ORF-extraction tools that struggle with the scale of contemporary genomic and transcriptomic datasets. The published work emphasises customisable search criteria together with high throughput, and reports that ORFipy scales to whole-genome and de novo transcriptome inputs that exceed what earlier ORF finders can comfortably process. The reference implementation is written in Python and is distributed through PyPI and bioconda. An open reading frame is a continuous stretch of DNA bounded by an in-frame start and stop codon. ORF extraction is mechanistic rather than predictive. Every region that begins at a recognised start codon and continues in frame to the first downstream stop codon is reported, regardless of whether the resulting region encodes a biologically functional protein. This stands in contrast to gene-prediction tools such as Prodigal, which apply learned models to score whether each candidate ORF is likely to correspond to a real gene. ORFipy is appropriate when the goal is exhaustive enumeration of every candidate ORF for downstream filtering or annotation; a gene-prediction tool is appropriate when the goal is a curated set of likely coding genes.Learning Resources
- urmi-21/orfipy (Wurtele Lab, Iowa State University). Official ORFipy repository and command-line reference.
Tools
Orfipy ORF Prediction (orfipy-prediction)
Scans one or more DNA sequences across the configured strand setting (three forward and three reverse reading frames by default) and returns every open reading frame that satisfies the configured start codon, stop codon, strand, and length filters. Each returned ORF carries its nucleotide sequence, translated amino-acid sequence, 1-indexed start and end positions on the parent sequence, strand, reading frame, and the parent sequence identifier.API Reference
Input: OrfipyInput
Input: OrfipyInput
Config: OrfipyConfig
Config: OrfipyConfig
f, r, binclude_stop=True). ORFs shorter than this are filtered out. Common values:1_000_000_000) for genome-scale inputs. Default: 10000.True, the stop codon is included in both the nucleotide sequence and length calculations. If False, the stop codon is excluded. Default: True.False.False.False.False.None uses the standard genetic code (table 1). Only tables supported by orfipy’s built-in translation table dict are available (NCBI tables 1-6, 9-14, 16, 21-30). Common options:True is coerced to 1 and False to 0.None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Output: OrfipyOutput
Output: OrfipyOutput
Applications
This tool is appropriate for the upstream ORF-enumeration step of any analysis that begins with raw DNA sequences and needs candidate coding regions. Representative applications include cataloguing all ORFs in a newly assembled genome before annotation, extracting coding-sequence candidates from a de novo transcriptome assembly, generating an exhaustive ORF set for downstream filtering by length, codon usage, or homology to a known protein, and producing translated protein sequences for downstream language-model scoring or domain annotation.Usage Tips
min_lenis the primary control on the number of reported ORFs. At the default ofmin_len=0, every candidate region is reported, including many short ORFs that arise by chance in any DNA sequence and do not encode functional proteins. A threshold ofmin_len=150(approximately 50 amino acids) excludes the majority of these short ORFs. A threshold ofmin_len=300(approximately 100 amino acids) focuses the output on typical small proteins, andmin_len=900(approximately 300 amino acids) restricts the output to larger proteins. The threshold is specified in nucleotides.start_codonsshould match the genetic context of the input. The default of["ATG", "GTG", "TTG"]is appropriate for bacterial and archaeal sequences, in which alternative start codons account for approximately 15 to 20 percent of genes. A value of["ATG"]is appropriate for stringent eukaryotic ORF analyses, and the inclusion of"CTG"is appropriate for organisms that use a non-standard genetic code in whichCTGfunctions as an alternative start codon.strandcontrols which DNA strands are scanned. The default of"b"scans both strands and reports ORFs from both the forward sequence and its reverse complement. A value of"f"or"r"restricts the scan to a single strand, which approximately halves the number of ORFs returned and is appropriate when the coding strand of the input is known in advance.translation_tableselects the genetic code used for amino-acid translation. The default value ofNoneapplies the standard genetic code (NCBI table 1). A value of"bacterial"selects the bacterial, archaeal, and plant plastid code (NCBI table 11),"vertebrate_mitochondrial"selects the vertebrate mitochondrial code (NCBI table 2), and the remaining supported NCBI tables are appropriate for organisms that use the corresponding alternative codes.- The partial-ORF flags allow incomplete reading frames at sequence boundaries. A value of
partial_3=Truereports ORFs that begin at a recognised start codon and continue to the 3’ end of the input without an in-frame stop codon. A value ofpartial_5=Truereports ORFs that end at a recognised stop codon but begin at the 5’ end of the input without a recognised start codon. Both flags are disabled by default and are appropriate when the input represents a fragment of a larger sequence, such as a transcriptome contig. between_stops=Truereports every region between two in-frame stop codons regardless of whether a recognised start codon is present. This is appropriate for ribosome-profiling analyses that aim to identify all potential translation regions, and implies that bothpartial_3andpartial_5behave as if enabled.- The output is exhaustive rather than curated. ORFipy reports every candidate ORF that satisfies the configured filters. Confirming the biological relevance of any individual ORF requires subsequent analyses such as homology search with BLAST, domain annotation with HMMER, or gene prediction with Prodigal.
Toolkit Notes
These apply to every ORFipy tool in this toolkit (orfipy-prediction).
max_lendefaults to 10000 nucleotides and silently filters longer ORFs. Raise the limit (for example to1_000_000_000) for genome-scale inputs to avoid losing long open reading frames without an error.threadscontrols intra-sequence parallelism. The default of4is reasonable for single-genome inputs. Raise this on multi-core hosts when processing very large sequences. The tool processes each input sequence independently, so additional sequence-level parallelism can be achieved by running multiple instances oforfipy-predictionconcurrently through aToolPool.- Input sequences are normalised to uppercase and filtered to the four standard DNA nucleotides before scanning. Ambiguity codes such as
Nand IUPAC mixture codes are silently removed from the sequence, as are non-DNA characters. The remaining nucleotides are passed to ORFipy in their original order. - Position fields are 1-indexed to match standard biological residue numbering conventions. ORF start and end positions on the parent sequence follow the conventions used in PDB files, GenBank annotations, and the published literature, so positions can be compared directly against external references without conversion.