MAFFT - Proto

License: MAFFT is open source and free for academic and commercial use under a BSD-3-Clause license. Please refer to the license for full terms.

Proto is not affiliated with RIMD. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.

GitHub 79 GitHub 79 Publication Publication Cite Cite Tool Source Tool Source Open as Notebook Open as Notebook Open on Proto Open on Proto

GSLBiotech/mafft

Align multiple amino acid or nucleotide sequences.

79 stars

View repo

MAFFT multiple sequence alignment software version 7: improvements in performance and usability

Kazutaka Katoh and Daron M Standley

Molecular Biology and Evolution (2013)

Read paper

@article{katoh2013mafft,
  title={MAFFT multiple sequence alignment software version 7: improvements in performance and usability},
  author={Katoh, Kazutaka and Standley, Daron M},
  journal={Molecular Biology and Evolution},
  volume={30},
  number={4},
  pages={772--780},
  year={2013},
  publisher={Oxford University Press},
  doi={10.1093/molbev/mst010}
}

Copy citation

proto-bio/proto-tools/proto_tools/tools/sequence_alignment/mafft

View source

Open Notebook

Open notebook

Coming soon!

Run this tool directly in Proto with no setup required.

Function	Description
`run_mafft_align()`	Multiple sequence alignment (MSA) using MAFFT (Multiple Alignment using Fast Fourier Transform)	Docs Source

Background

MAFFT (Katoh and Standley, 2013) is a multiple sequence alignment program that constructs an alignment through progressive alignment along a guide tree followed by optional iterative refinement. Pairwise distances between input sequences are first estimated rapidly using either k-mer counting or a Fast Fourier Transform that detects homologous segments in compositionally transformed sequences. A guide tree is built from these distances, sequences are progressively aligned along the tree, and the alignment is optionally refined by an iterative cycle that repeatedly removes and re-aligns subsets of sequences. MAFFT exposes several algorithm variants that differ in pairwise scoring and refinement strategy. FFT-NS-i is the default progressive method with iterative refinement on FFT-derived distances and is appropriate for large datasets. L-INS-i (localpair) performs local pairwise alignment with iterative refinement and is appropriate for sequences with one alignable domain flanked by variable regions. G-INS-i (globalpair) performs global pairwise alignment with iterative refinement and is appropriate for sequences of similar length. E-INS-i (genafpair) is a local-alignment variant that handles sequences with multiple conserved domains separated by long unalignable regions.

Learning Resources

MAFFT software homepage (Osaka University). Official distribution site and user documentation for the command-line program that this toolkit invokes.
MAFFT algorithm comparison (Osaka University). A side-by-side comparison of the alignment algorithm variants that the align_method field selects.
MAFFT online server (Osaka University). Hosted entry point to the same MAFFT pipeline, useful for a quick browser-based alignment before scripting against the tool.

Tools

MAFFT Alignment (`mafft-align`)

Performs multiple sequence alignment over two or more input sequences using the bundled mafft command-line program. The selected algorithm variant is controlled by the align_method configuration field. The tool returns a typed MSA object containing the aligned sequences and their identifiers, with helpers for column statistics and serialisation to FASTA or A3M.

API Reference

Source

Input: MafftInput

sequences

List[string]

required

List of sequence strings (protein or nucleotide) to align. At least 2 sequences are required for alignment.

sequence_ids

array

Optional list of sequence identifiers. If not provided, sequences are assigned sequential IDs (seq_0, seq_1, …).

Source

Config: MafftConfig

align_method

enum

default:"auto"

"auto" (MAFFT picks by input size), "localpair" (L-INS-i), "globalpair" (G-INS-i), or "genafpair" (E-INS-i).Available options: auto, localpair, globalpair, genafpair

max_iterations

integer

default:"0"

Iterative-refinement cycles. 0 = no refinement; ~1000 enables the full *-INS-i pipelines with *pair methods.

threads

integer

default:"1"

Number of CPU threads for parallel processing.

extra_args

List[string]

default:"[]"

Verbatim mafft CLI tokens for niche flags (e.g. ["--retree", "3", "--reorder"]).

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cpu"

Device to run the tool on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

Source

Output: MafftOutput

msa

MSA

required

The multiple sequence alignment result containing aligned sequences, sequence IDs, and original unaligned sequences.

Show MSA

aligned_sequences

List[string]

required

Aligned sequences with - characters indicating gaps. All sequences must have the same length.

sequence_ids

List[string]

required

Identifiers for each sequence. Auto-generated as seq_0, seq_1, … if not provided.

Applications

This tool is appropriate for any analysis that benefits from a multiple sequence alignment of homologous protein or nucleotide sequences. Common downstream uses include phylogenetic-tree inference, conservation analysis over alignment columns to identify functionally important residues, homology modelling against a related reference, motif and domain discovery across a protein family, and variant-effect analysis in the context of the conserved structural and functional positions revealed by the alignment.

Usage Tips

align_method="auto" is the default and lets MAFFT select an algorithm based on input size. Use localpair for sequences with a single conserved domain flanked by variable regions, globalpair for full-length homologs of similar length, and genafpair for multi-domain sequences separated by long unalignable regions. The *pair variants run in O(N^2) time and are appropriate for up to a few hundred sequences.
max_iterations=0 (the default) skips iterative refinement. Raise it to enable the full *-INS-i refinement pipeline when paired with one of the *pair methods. A value around 1000 is appropriate for high-accuracy alignments of small to medium datasets.
threads=1 is the default; raise it on large alignments. MAFFT parallelises both the all-against-all distance computation and the iterative refinement passes, so increasing the thread count yields substantial wall-time reductions on alignments of hundreds of sequences or longer.
Inputs must contain at least two non-empty sequences. The input validator hard-errors otherwise. Auto-generated identifiers default to seq_0, seq_1, and so on when sequence_ids is omitted.
extra_args accepts verbatim mafft CLI tokens. Pass any CLI flag not exposed as a typed field through this list (for example ["--retree", "3", "--reorder"] to control the guide-tree rebuild schedule). Tokens are inserted before the input FASTA path and take precedence over MAFFT’s own defaults.

Toolkit Notes

These apply to every MAFFT tool in this toolkit (mafft-align).

Outputs are returned as typed MSA objects. The msa field of MafftOutput exposes the aligned sequences, their identifiers, alignment dimensions, column-level conservation statistics, and gap-statistics properties. The result serialises to FASTA or A3M through the standard export interface.

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

​Background

​Learning Resources

​Tools

​MAFFT Alignment (mafft-align)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

Background

Learning Resources

Tools

MAFFT Alignment (`mafft-align`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides