
This toolkit is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.
Background
Prodigal (Hyatt, Chen, LoCascio, Land, Larimer, and Hauser, 2010) was developed as a fast and accurate replacement for earlier prokaryotic gene-prediction programs. The published method targets three specific objectives, namely improved gene-structure prediction, improved translation initiation site recognition, and reduction in the false-positive rate. The authors report that Prodigal achieves favourable results against the gene finders that were the established standard at the time of publication, and the program has since become one of the most widely used tools for automated prokaryotic genome annotation. Prokaryotic gene prediction is straightforward relative to eukaryotic gene prediction because prokaryotic genes are contiguous, are not interrupted by introns, and often begin with a Shine-Dalgarno ribosome binding site located a short distance upstream of the start codon, typically around 5 to 10 nucleotides. Prodigal exploits these regularities by combining a dynamic-programming search across candidate open reading frames with scoring terms for coding-region hexamer frequencies, the presence and strength of a recognised ribosome binding site motif, and the identity of the start codon. The program supports two operating modes. In single-genome mode it first trains its scoring parameters on the input sequence itself and then predicts genes using those trained parameters, which requires at least approximately 100 kilobases of input sequence for reliable training. In meta mode it applies a set of pre-trained parameters from a curated panel of reference genomes, which is appropriate for short contigs, draft assemblies, and metagenomic samples. This toolkit uses pyrodigal (Larralde, 2022), a Python interface to Prodigal that exposes the original C implementation through Python bindings with SIMD-accelerated coding-region scoring. The interface reproduces the predictions of the reference Prodigal program while removing the need to manage an external command-line invocation.Learning Resources
- hyattpd/Prodigal (Hyatt, Oak Ridge National Laboratory). Official Prodigal source code and command-line reference.
- althonos/pyrodigal (Larralde, EMBL). Python interface to Prodigal used by this toolkit, with extended documentation and API reference at pyrodigal.readthedocs.io.
Tools
Prodigal ORF Prediction (prodigal-prediction)
Predicts protein-coding genes in one or more prokaryotic DNA sequences using Prodigal through the pyrodigal interface. Each returned gene carries its nucleotide and translated amino-acid sequence, 1-indexed start and end coordinates on the parent sequence, strand, reading frame, partial-gene status, GC content, start codon identity, and the detected ribosome binding site motif and spacer.API Reference
Input: ProdigalInput
Input: ProdigalInput
Config: ProdigalConfig
Config: ProdigalConfig
meta_mode=False). In meta mode, pre-trained metagenomic models use their own built-in translation tables and this parameter is ignored. Common options:Available options: standard, vertebrate_mitochondrial, yeast_mitochondrial, mycoplasma, invertebrate_mitochondrial, ciliate_nuclear, echinoderm_mitochondrial, euplotid_nuclear, bacterial, alternative_yeast_nuclear, ascidian_mitochondrial, alternative_flatworm_mitochondrial, blepharisma_nuclear, chlorophycean_mitochondrial, trematode_mitochondrial, scenedesmus_mitochondrial, thraustochytrium_mitochondrial, rhabdopleuridae_mitochondrial, candidate_division_sr1True, treat runs of N bases as masked and do not call genes spanning them (equivalent to prodigal’s -m). Default: False.True is coerced to 1 and False to 0.None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Output: ProdigalOutput
Output: ProdigalOutput
Applications
This tool is appropriate for the gene-calling step of any analysis that begins with raw prokaryotic DNA sequences and needs a curated set of protein-coding genes rather than an exhaustive enumeration of all open reading frames. Representative applications include initial gene annotation of a newly assembled bacterial or archaeal genome, recovery of protein-coding genes from metagenomic contigs for downstream functional or taxonomic analysis, and generation of translated protein sequences for subsequent homology search or domain annotation.Usage Tips
meta_modeselects the operating mode and is the most consequential setting. The default ofmeta_mode=Trueapplies a panel of pre-trained parameters that is appropriate for short contigs, draft assemblies, and metagenomic samples. A value ofmeta_mode=Falseinstead trains scoring parameters on the input sequence itself and requires at least approximately 100 kilobases of input for reliable training. The single-genome mode is appropriate for complete or near-complete genomes when sufficient training sequence is available.translation_tableselects the genetic code and is only consulted in single-genome mode. In meta mode the pre-trained metagenomic models carry their own internal translation tables, and this parameter has no effect on the output. The default value of"bacterial"corresponds to NCBI table 11 (bacterial, archaeal, and plant plastid code). A value of"mycoplasma"selects NCBI table 4 (Mycoplasma and Spiroplasma), and"standard"selects NCBI table 1 (the standard genetic code). Additional supported NCBI tables are appropriate for organisms that use the corresponding alternative codes.closed_endscontrols whether partial genes at sequence boundaries are reported. The default ofclosed_ends=Falseallows partial genes at the 5’ and 3’ ends of each input sequence, which is appropriate for linear contigs and draft assemblies in which real genes may extend across the assembly boundary. A value ofclosed_ends=Trueprevents partial-gene predictions and is appropriate for complete circular genomes such as bacterial chromosomes and plasmids, in which there are no true sequence ends.min_geneis the minimum gene length in nucleotides and defaults to 90. This corresponds to approximately 30 amino acids. Lower values can be considered for draft assemblies in which short gene fragments are expected at contig boundaries, while higher values are appropriate when only larger, well-defined genes are of interest.mask=Trueexcludes regions of unresolved nucleotides from gene calling. When the input contains runs ofNcharacters representing low-quality or gap regions, a value ofmask=Trueprevents Prodigal from calling genes that span those regions, which is appropriate for draft assemblies with significant unresolved sequence content.- Partial-gene status is reported as a two-digit code on each predicted gene. A status of
00_00indicates a complete gene with both a start and a stop codon present in the input. A status of10_00indicates a gene that is truncated at the 5’ end of the input,00_01indicates truncation at the 3’ end, and10_01indicates truncation at both ends. Partial genes commonly represent real coding sequences that extend beyond the boundary of the input and should not be excluded from downstream analyses without consideration.
Toolkit Notes
These apply to every Prodigal tool in this toolkit (prodigal-prediction).
- Prodigal is appropriate for prokaryotic genomes only. The scoring model is calibrated for bacterial and archaeal gene structure, and the program does not handle introns. Eukaryotic gene prediction requires a dedicated eukaryotic gene finder.
- Input sequences are accepted as a single string or a list of strings and are normalised to uppercase before gene calling. IUPAC ambiguity codes are permitted in the input. The validator raises an error when the input contains characters that are not recognised DNA nucleotides or IUPAC codes.
num_threadscontrols the parallelism used to process multiple input sequences. Each input sequence is processed by an independent worker, so increasing the thread count benefits batches of many sequences but has no effect on a single input. The default automatically detects the number of available CPU cores.- Position fields are 1-indexed to match standard biological residue numbering conventions. Gene start and end positions on the parent sequence follow the conventions used in GenBank annotations and the published literature, so positions can be compared directly against external references without conversion.