Skip to main content
Prodigal
License: Prodigal has a GPL-3.0 license. Please refer to the license for full terms.

This toolkit is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.


hyattpd/Prodigal
hyattpd/Prodigal
Prodigal Gene Prediction Software
531 stars
View repo
Prodigal: prokaryotic gene recognition and translation initiation site identification
Doug Hyatt, Gwo-Liang Chen, … Loren J Hauser
BMC Bioinformatics (2010)
Read paper
@article{hyatt2010prodigal,
  title={Prodigal: prokaryotic gene recognition and translation initiation site identification},
  author={Hyatt, Doug and Chen, Gwo-Liang and LoCascio, Philip F and Land, Miriam L and Larimer, Frank W and Hauser, Loren J},
  journal={BMC Bioinformatics},
  volume={11},
  number={1},
  pages={119},
  year={2010},
  publisher={BioMed Central},
  doi={10.1186/1471-2105-11-119}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/orf_prediction/prodigal
View source
Open Notebook
Open notebook
FunctionDescription
run_prodigal_prediction()Prokaryotic ORF and gene prediction using Prodigal Docs Source

Background

Prodigal (Hyatt, Chen, LoCascio, Land, Larimer, and Hauser, 2010) was developed as a fast and accurate replacement for earlier prokaryotic gene-prediction programs. The published method targets three specific objectives, namely improved gene-structure prediction, improved translation initiation site recognition, and reduction in the false-positive rate. The authors report that Prodigal achieves favourable results against the gene finders that were the established standard at the time of publication, and the program has since become one of the most widely used tools for automated prokaryotic genome annotation. Prokaryotic gene prediction is straightforward relative to eukaryotic gene prediction because prokaryotic genes are contiguous, are not interrupted by introns, and often begin with a Shine-Dalgarno ribosome binding site located a short distance upstream of the start codon, typically around 5 to 10 nucleotides. Prodigal exploits these regularities by combining a dynamic-programming search across candidate open reading frames with scoring terms for coding-region hexamer frequencies, the presence and strength of a recognised ribosome binding site motif, and the identity of the start codon. The program supports two operating modes. In single-genome mode it first trains its scoring parameters on the input sequence itself and then predicts genes using those trained parameters, which requires at least approximately 100 kilobases of input sequence for reliable training. In meta mode it applies a set of pre-trained parameters from a curated panel of reference genomes, which is appropriate for short contigs, draft assemblies, and metagenomic samples. This toolkit uses pyrodigal (Larralde, 2022), a Python interface to Prodigal that exposes the original C implementation through Python bindings with SIMD-accelerated coding-region scoring. The interface reproduces the predictions of the reference Prodigal program while removing the need to manage an external command-line invocation.

Learning Resources

Tools

Prodigal ORF Prediction (prodigal-prediction)

Predicts protein-coding genes in one or more prokaryotic DNA sequences using Prodigal through the pyrodigal interface. Each returned gene carries its nucleotide and translated amino-acid sequence, 1-indexed start and end coordinates on the parent sequence, strand, reading frame, partial-gene status, GC content, start codon identity, and the detected ribosome binding site motif and spacer.

API Reference

Source
input_sequences
List[string]
required
DNA sequence(s) to analyze for genes and open reading frames. Can be provided as:
Source
meta_mode
boolean
default:"True"
Use meta mode for gene prediction. Options:
translation_table
enum
default:"bacterial"
NCBI genetic code for translation. Only used in single-genome mode (meta_mode=False). In meta mode, pre-trained metagenomic models use their own built-in translation tables and this parameter is ignored. Common options:Available options: standard, vertebrate_mitochondrial, yeast_mitochondrial, mycoplasma, invertebrate_mitochondrial, ciliate_nuclear, echinoderm_mitochondrial, euplotid_nuclear, bacterial, alternative_yeast_nuclear, ascidian_mitochondrial, alternative_flatworm_mitochondrial, blepharisma_nuclear, chlorophycean_mitochondrial, trematode_mitochondrial, scenedesmus_mitochondrial, thraustochytrium_mitochondrial, rhabdopleuridae_mitochondrial, candidate_division_sr1
closed_ends
boolean
default:"False"
Prevent genes from running off sequence edges. Options:
mask
boolean
default:"False"
When True, treat runs of N bases as masked and do not call genes spanning them (equivalent to prodigal’s -m). Default: False.
min_gene
integer
default:"90"
Minimum gene length in nucleotides. Default: 90. Drop for draft assemblies where short fragments are expected.
num_threads
integer
Number of CPU threads for parallel processing of multiple sequences. Higher values speed up batch processing. By default, automatically detects and uses all available CPU cores. Must be at least 1. Default: auto-detect all cores.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
predicted_orfs
List[array]
List of ORF results per input sequence. Each inner list contains the ORF objects found in a single input sequence. The outer list order matches the input sequences.

Applications

This tool is appropriate for the gene-calling step of any analysis that begins with raw prokaryotic DNA sequences and needs a curated set of protein-coding genes rather than an exhaustive enumeration of all open reading frames. Representative applications include initial gene annotation of a newly assembled bacterial or archaeal genome, recovery of protein-coding genes from metagenomic contigs for downstream functional or taxonomic analysis, and generation of translated protein sequences for subsequent homology search or domain annotation.

Usage Tips

  • meta_mode selects the operating mode and is the most consequential setting. The default of meta_mode=True applies a panel of pre-trained parameters that is appropriate for short contigs, draft assemblies, and metagenomic samples. A value of meta_mode=False instead trains scoring parameters on the input sequence itself and requires at least approximately 100 kilobases of input for reliable training. The single-genome mode is appropriate for complete or near-complete genomes when sufficient training sequence is available.
  • translation_table selects the genetic code and is only consulted in single-genome mode. In meta mode the pre-trained metagenomic models carry their own internal translation tables, and this parameter has no effect on the output. The default value of "bacterial" corresponds to NCBI table 11 (bacterial, archaeal, and plant plastid code). A value of "mycoplasma" selects NCBI table 4 (Mycoplasma and Spiroplasma), and "standard" selects NCBI table 1 (the standard genetic code). Additional supported NCBI tables are appropriate for organisms that use the corresponding alternative codes.
  • closed_ends controls whether partial genes at sequence boundaries are reported. The default of closed_ends=False allows partial genes at the 5’ and 3’ ends of each input sequence, which is appropriate for linear contigs and draft assemblies in which real genes may extend across the assembly boundary. A value of closed_ends=True prevents partial-gene predictions and is appropriate for complete circular genomes such as bacterial chromosomes and plasmids, in which there are no true sequence ends.
  • min_gene is the minimum gene length in nucleotides and defaults to 90. This corresponds to approximately 30 amino acids. Lower values can be considered for draft assemblies in which short gene fragments are expected at contig boundaries, while higher values are appropriate when only larger, well-defined genes are of interest.
  • mask=True excludes regions of unresolved nucleotides from gene calling. When the input contains runs of N characters representing low-quality or gap regions, a value of mask=True prevents Prodigal from calling genes that span those regions, which is appropriate for draft assemblies with significant unresolved sequence content.
  • Partial-gene status is reported as a two-digit code on each predicted gene. A status of 00_00 indicates a complete gene with both a start and a stop codon present in the input. A status of 10_00 indicates a gene that is truncated at the 5’ end of the input, 00_01 indicates truncation at the 3’ end, and 10_01 indicates truncation at both ends. Partial genes commonly represent real coding sequences that extend beyond the boundary of the input and should not be excluded from downstream analyses without consideration.

Toolkit Notes

These apply to every Prodigal tool in this toolkit (prodigal-prediction).
  • Prodigal is appropriate for prokaryotic genomes only. The scoring model is calibrated for bacterial and archaeal gene structure, and the program does not handle introns. Eukaryotic gene prediction requires a dedicated eukaryotic gene finder.
  • Input sequences are accepted as a single string or a list of strings and are normalised to uppercase before gene calling. IUPAC ambiguity codes are permitted in the input. The validator raises an error when the input contains characters that are not recognised DNA nucleotides or IUPAC codes.
  • num_threads controls the parallelism used to process multiple input sequences. Each input sequence is processed by an independent worker, so increasing the thread count benefits batches of many sequences but has no effect on a single input. The default automatically detects the number of available CPU cores.
  • Position fields are 1-indexed to match standard biological residue numbering conventions. Gene start and end positions on the parent sequence follow the conventions used in GenBank annotations and the published literature, so positions can be compared directly against external references without conversion.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.