Skip to main content
License: MAFFT is open source and free for academic and commercial use under a BSD-3-Clause license. Please refer to the license for full terms.

Proto is not affiliated with RIMD. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.


GSLBiotech/mafft
GSLBiotech/mafft
Align multiple amino acid or nucleotide sequences.
79 stars
View repo
MAFFT multiple sequence alignment software version 7: improvements in performance and usability
Kazutaka Katoh and Daron M Standley
Molecular Biology and Evolution (2013)
Read paper
@article{katoh2013mafft,
  title={MAFFT multiple sequence alignment software version 7: improvements in performance and usability},
  author={Katoh, Kazutaka and Standley, Daron M},
  journal={Molecular Biology and Evolution},
  volume={30},
  number={4},
  pages={772--780},
  year={2013},
  publisher={Oxford University Press},
  doi={10.1093/molbev/mst010}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/sequence_alignment/mafft
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_mafft_align()Multiple sequence alignment (MSA) using MAFFT (Multiple Alignment using Fast Fourier Transform) Docs Source

Background

MAFFT (Katoh and Standley, 2013) is a multiple sequence alignment program that constructs an alignment through progressive alignment along a guide tree followed by optional iterative refinement. Pairwise distances between input sequences are first estimated rapidly using either k-mer counting or a Fast Fourier Transform that detects homologous segments in compositionally transformed sequences. A guide tree is built from these distances, sequences are progressively aligned along the tree, and the alignment is optionally refined by an iterative cycle that repeatedly removes and re-aligns subsets of sequences. MAFFT exposes several algorithm variants that differ in pairwise scoring and refinement strategy. FFT-NS-i is the default progressive method with iterative refinement on FFT-derived distances and is appropriate for large datasets. L-INS-i (localpair) performs local pairwise alignment with iterative refinement and is appropriate for sequences with one alignable domain flanked by variable regions. G-INS-i (globalpair) performs global pairwise alignment with iterative refinement and is appropriate for sequences of similar length. E-INS-i (genafpair) is a local-alignment variant that handles sequences with multiple conserved domains separated by long unalignable regions.

Learning Resources

  • MAFFT software homepage (Osaka University). Official distribution site and user documentation for the command-line program that this toolkit invokes.
  • MAFFT algorithm comparison (Osaka University). A side-by-side comparison of the alignment algorithm variants that the align_method field selects.
  • MAFFT online server (Osaka University). Hosted entry point to the same MAFFT pipeline, useful for a quick browser-based alignment before scripting against the tool.

Tools

MAFFT Alignment (mafft-align)

Performs multiple sequence alignment over two or more input sequences using the bundled mafft command-line program. The selected algorithm variant is controlled by the align_method configuration field. The tool returns a typed MSA object containing the aligned sequences and their identifiers, with helpers for column statistics and serialisation to FASTA or A3M.

API Reference

Source
sequences
List[string]
required
List of sequence strings (protein or nucleotide) to align. At least 2 sequences are required for alignment.
sequence_ids
array
Optional list of sequence identifiers. If not provided, sequences are assigned sequential IDs (seq_0, seq_1, …).
Source
align_method
enum
default:"auto"
"auto" (MAFFT picks by input size), "localpair" (L-INS-i), "globalpair" (G-INS-i), or "genafpair" (E-INS-i).Available options: auto, localpair, globalpair, genafpair
max_iterations
integer
default:"0"
Iterative-refinement cycles. 0 = no refinement; ~1000 enables the full *-INS-i pipelines with *pair methods.
threads
integer
default:"1"
Number of CPU threads for parallel processing.
extra_args
List[string]
default:"[]"
Verbatim mafft CLI tokens for niche flags (e.g. ["--retree", "3", "--reorder"]).
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
msa
MSA
required
The multiple sequence alignment result containing aligned sequences, sequence IDs, and original unaligned sequences.

Applications

This tool is appropriate for any analysis that benefits from a multiple sequence alignment of homologous protein or nucleotide sequences. Common downstream uses include phylogenetic-tree inference, conservation analysis over alignment columns to identify functionally important residues, homology modelling against a related reference, motif and domain discovery across a protein family, and variant-effect analysis in the context of the conserved structural and functional positions revealed by the alignment.

Usage Tips

  • align_method="auto" is the default and lets MAFFT select an algorithm based on input size. Use localpair for sequences with a single conserved domain flanked by variable regions, globalpair for full-length homologs of similar length, and genafpair for multi-domain sequences separated by long unalignable regions. The *pair variants run in O(N^2) time and are appropriate for up to a few hundred sequences.
  • max_iterations=0 (the default) skips iterative refinement. Raise it to enable the full *-INS-i refinement pipeline when paired with one of the *pair methods. A value around 1000 is appropriate for high-accuracy alignments of small to medium datasets.
  • threads=1 is the default; raise it on large alignments. MAFFT parallelises both the all-against-all distance computation and the iterative refinement passes, so increasing the thread count yields substantial wall-time reductions on alignments of hundreds of sequences or longer.
  • Inputs must contain at least two non-empty sequences. The input validator hard-errors otherwise. Auto-generated identifiers default to seq_0, seq_1, and so on when sequence_ids is omitted.
  • extra_args accepts verbatim mafft CLI tokens. Pass any CLI flag not exposed as a typed field through this list (for example ["--retree", "3", "--reorder"] to control the guide-tree rebuild schedule). Tokens are inserted before the input FASTA path and take precedence over MAFFT’s own defaults.

Toolkit Notes

These apply to every MAFFT tool in this toolkit (mafft-align).
  • Outputs are returned as typed MSA objects. The msa field of MafftOutput exposes the aligned sequences, their identifiers, alignment dimensions, column-level conservation statistics, and gap-statistics properties. The result serialises to FASTA or A3M through the standard export interface.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.