Skip to main content
License: BLAST is licensed under Custom (NCBI BLAST+ public domain). Please refer to the license for full terms.

Proto is not affiliated with NCBI. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.


blast.ncbi.nlm.nih.gov
Visit website
Basic local alignment search tool
Stephen F Altschul, Warren Gish, … David J Lipman
Journal of Molecular Biology (1990)
Read paper
@article{altschul1990blast,
  title={Basic local alignment search tool},
  author={Altschul, Stephen F and Gish, Warren and Miller, Webb and Myers, Eugene W and Lipman, David J},
  journal={Journal of Molecular Biology},
  volume={215},
  number={3},
  pages={403--410},
  year={1990},
  publisher={Elsevier},
  doi={10.1016/S0022-2836(05)80360-2}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/sequence_alignment/blast
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_create_blast_db()Create a local BLAST database from a FASTA file Docs Source
run_blast_search()Search sequences against BLAST databases (online or local) Docs Source

Background

BLAST (Altschul et al., 1990) performs sequence-similarity search through a heuristic algorithm that approximates the exhaustive Smith-Waterman local alignment at a fraction of its computational cost. The query is first broken into short fixed-length words, exact word matches are located in the database, and each match is extended in both directions until the running alignment score drops below a threshold. The statistical significance of each surviving alignment is expressed as an E-value derived from the Karlin-Altschul statistics, which represents the number of alignments with at least the observed score that would be expected to occur by chance for a database of the given size. BLAST supports five program variants that pair query and database types appropriately. blastn aligns a nucleotide query against a nucleotide database. blastp aligns a protein query against a protein database. blastx translates a nucleotide query and aligns the translations against a protein database. tblastn aligns a protein query against a database of translated nucleotide sequences. tblastx translates both query and database. The toolkit’s local execution mode uses the NCBI BLAST+ command-line distribution (Camacho et al., 2009), which provides the blastn, blastp, blastx, tblastn, tblastx, and makeblastdb command-line programs that this toolkit invokes. The remote execution mode dispatches to the public NCBI BLAST web service through the QBLAST API.

Learning Resources

  • NCBI BLAST web service (NCBI). The public hosted interface that the remote execution mode targets, useful for an interactive run before scripting against the tool.
  • NCBI BLAST+ User Manual (NCBI Bookshelf). The reference manual for the command-line distribution that the local execution mode runs.

Tools

Create BLAST Database (blast-create-db)

Builds a local BLAST database from a FASTA file using the BLAST+ makeblastdb program. The output is a set of indexed files referenced by a common stem path. The stem path is returned as db_path and can be passed directly as local_db to blast-search.

API Reference

Source
fasta
string
required
Path to a FASTA file containing the sequences to be indexed into a BLAST database. The file must exist and contain valid FASTA-formatted sequences. For nucleotide databases, sequences should be DNA or RNA. For protein databases, sequences should be amino acids.
Source
dbtype
enum
default:"nucl"
"nucl" for DNA/RNA, "prot" for protein. Must match the input FASTA.Available options: nucl, prot
out_prefix
string
File-path prefix for generated DB files; None falls back to the input FASTA stem.
title
string
Descriptive DB title shown in BLAST reports; makeblastdb falls back to the input file name when None.
parse_seqids
boolean
default:"False"
Parse FASTA seq IDs so blastdbcmd can address sequences by ID; required for v5 taxonomy lookups.
hash_index
boolean
default:"False"
Create a hash index of seq IDs (faster ID lookups).
blastdb_version
enum
default:"5"
DB format version. 5 (taxonomy- aware) is the upstream default since BLAST+ 2.10.Available options: 4, 5
max_file_sz
string
default:"1GB"
Max size per DB volume with a unit suffix (e.g. "1GB"); upstream caps at "4GB".
taxid
integer
NCBI taxonomy ID assigned to every sequence; set to tag a single-organism DB.
extra_args
List[string]
default:"[]"
Extra makeblastdb CLI tokens passed verbatim (e.g. ["-mask_data", "/path/to/mask"]). Escape hatch for flags not exposed as typed fields above.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
db_path
string
required
The base path to the generated BLAST database files (without file extensions). This path can be used directly as the value for the local_db parameter in BlastSearchConfig. For example, if db_path is "/data/mydb", makeblastdb will have created multiple files like "/data/mydb.nhr", "/data/mydb.nin", "/data/mydb.nsq" (for nucleotide databases) or similar extensions for protein databases.

Applications

This tool is the prerequisite for any local BLAST workflow that searches against a custom reference set, such as an in-house genome assembly, a curated subset of a public database, or a panel of designed sequences. Building a local database once and reusing it across many queries avoids repeated network traffic to NCBI and gives full control over the reference content.

Usage Tips

  • dbtype must match the input FASTA type. Use "nucl" for nucleotide sequences and "prot" for amino-acid sequences. The configuration validator hard-errors on any other value, and a mismatch against the FASTA content will be caught by makeblastdb at runtime.
  • out_prefix defaults to the input FASTA stem in the same directory. Set it explicitly when the database should live in a different location or under a different name.
  • parse_seqids=True is required for FASTA identifiers to be addressable. Enable it when downstream calls need to retrieve sequences by identifier through blastdbcmd or when building a taxonomy-aware database. Pair it with hash_index=True for faster identifier lookups.
  • extra_args accepts verbatim makeblastdb CLI tokens. Use it for niche flags not exposed as typed fields, such as ["-mask_data", "/path/to/mask"] for premasking input or ["-gi_mask", "..."] for taxonomy-related options.

Toolkit Notes

These apply to every BLAST tool in this toolkit (blast-search, blast-create-db).
  • Hits use the standard BLAST -outfmt 6 tabular schema. Each BlastHit carries the twelve canonical fields qseqid, sseqid, pident, length, mismatch, gapopen, qstart, qend, sstart, send, evalue, and bitscore. pident is reported on a 0-to-100 scale.
  • The local installation downloads the platform-specific NCBI BLAST+ distribution on first use. The standalone setup pulls the appropriate NCBI BLAST+ tarball and extracts the blastn, blastp, blastx, tblastn, tblastx, and makeblastdb executables. No reference database is bundled, so local execution requires either a user-built database from blast-create-db or a separately downloaded NCBI database.
  • The two tools differ in execution mode. blast-search supports both online (search_mode="online", the default) and local (search_mode="local") execution. blast-create-db runs only locally because the NCBI web service does not expose makeblastdb.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.