BLAST - Proto

License: BLAST is licensed under Custom (NCBI BLAST+ public domain). Please refer to the license for full terms.

Proto is not affiliated with NCBI. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.

Website Website Publication Publication Cite Cite Tool Source Tool Source Open as Notebook Open as Notebook Open on Proto Open on Proto

blast.ncbi.nlm.nih.gov

Visit website

Basic local alignment search tool

Stephen F Altschul, Warren Gish, … David J Lipman

Journal of Molecular Biology (1990)

Read paper

@article{altschul1990blast,
  title={Basic local alignment search tool},
  author={Altschul, Stephen F and Gish, Warren and Miller, Webb and Myers, Eugene W and Lipman, David J},
  journal={Journal of Molecular Biology},
  volume={215},
  number={3},
  pages={403--410},
  year={1990},
  publisher={Elsevier},
  doi={10.1016/S0022-2836(05)80360-2}
}

Copy citation

proto-bio/proto-tools/proto_tools/tools/sequence_alignment/blast

View source

Open Notebook

Open notebook

Coming soon!

Run this tool directly in Proto with no setup required.

Function	Description
`run_create_blast_db()`	Create a local BLAST database from a FASTA file	Docs Source
`run_blast_search()`	Search sequences against BLAST databases (online or local)	Docs Source

Background

BLAST (Altschul et al., 1990) performs sequence-similarity search through a heuristic algorithm that approximates the exhaustive Smith-Waterman local alignment at a fraction of its computational cost. The query is first broken into short fixed-length words, exact word matches are located in the database, and each match is extended in both directions until the running alignment score drops below a threshold. The statistical significance of each surviving alignment is expressed as an E-value derived from the Karlin-Altschul statistics, which represents the number of alignments with at least the observed score that would be expected to occur by chance for a database of the given size. BLAST supports five program variants that pair query and database types appropriately. blastn aligns a nucleotide query against a nucleotide database. blastp aligns a protein query against a protein database. blastx translates a nucleotide query and aligns the translations against a protein database. tblastn aligns a protein query against a database of translated nucleotide sequences. tblastx translates both query and database. The toolkit’s local execution mode uses the NCBI BLAST+ command-line distribution (Camacho et al., 2009), which provides the blastn, blastp, blastx, tblastn, tblastx, and makeblastdb command-line programs that this toolkit invokes. The remote execution mode dispatches to the public NCBI BLAST web service through the QBLAST API.

Learning Resources

NCBI BLAST web service (NCBI). The public hosted interface that the remote execution mode targets, useful for an interactive run before scripting against the tool.
NCBI BLAST+ User Manual (NCBI Bookshelf). The reference manual for the command-line distribution that the local execution mode runs.

Tools

BLAST Search (`blast-search`)

Aligns a query sequence against a reference database and returns the resulting hits. The remote execution mode submits the query to the NCBI BLAST web service through the QBLAST API. The local execution mode invokes the appropriate BLAST+ program (blastn, blastp, blastx, tblastn, or tblastx) against a user-supplied database. The query field accepts either a raw nucleotide or protein sequence string or a path to a FASTA file, and the input form is detected automatically.

API Reference

Source

Input: BlastSearchInput

query

string

required

A raw nucleotide/protein sequence (e.g. "ATGCGTAAA") or a path to a FASTA file.

query_type

enum

default:"sequence"

Automatically set to "sequence" or "fasta_path" during validation. Read-only; do not set manually.Available options: sequence, fasta_path

Source

Config: BlastSearchConfig

search_mode

enum

default:"online"

"online" routes to NCBI QBLAST; "local" runs BLAST+ CLI against a local database.Available options: online, local

program

enum

default:"blastn"

BLAST algorithm (blastn, blastp, blastx, tblastn, tblastx).Available options: blastn, blastp, blastx, tblastn, tblastx

database

enum

default:"nt"

NCBI database to search (online only).Available options: nt, nr, refseq_rna, refseq_protein, swissprot, pdb, pataa, patnt

entrez_query

string

Restrict online search with an Entrez query.

hitlist_size

integer

Number of hits to return (online only).

megablast

boolean

Use MegaBLAST (online, blastn only).

local_db

string

Path to a local BLAST database (local only, required).

num_threads

integer

default:"4"

CPU threads for local search.

evalue

number

E-value threshold (both modes).

word_size

integer

Word size for initial matches (both modes).

gapopen

integer

Cost to open a gap (both modes).

gapextend

integer

Cost to extend a gap (both modes).

matrix

string

Scoring matrix for protein searches (both modes).

reward

integer

Nucleotide match reward (blastn only, both modes).

penalty

integer

Nucleotide mismatch penalty (blastn only, both modes).

threshold

integer

Min word score for lookup table (protein only, both modes).

comp_based_stats

integer

Composition-based stats mode (protein only, both modes).

max_target_seqs

integer

Max aligned sequences to keep (local only).

perc_identity

number

Min percent identity filter (both modes).

qcov_hsp_perc

number

Min query coverage per HSP (local only).

soft_masking

boolean

Soft masking for initial matches (local only).

lcase_masking

boolean

Treat lowercase in FASTA as masked (both modes).

dust

string

Low-complexity filter for nucleotide queries (local only).

seg

string

Low-complexity filter for protein queries (local only).

task

string

Task preset (local only).

ungapped

boolean

Ungapped alignment only (both modes).

strand

string

Query strand (local only; for blastn/blastx/tblastx).

query_gencode

integer

Genetic code for translating query (blastx/tblastx, both modes).

db_gencode

integer

Genetic code for translating DB (tblastn/tblastx, both modes).

extra_args

List[string]

default:"[]"

Verbatim BLAST+ CLI tokens for niche flags not exposed above (e.g. ["-max_hsps", "1"]). Local mode only; online mode goes through NCBIWWW.qblast which doesn’t accept arbitrary CLI tokens.

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cpu"

Device to run the tool on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

Source

Output: BlastSearchOutput

hits

List[BlastHit]

BLAST alignment hits with standard tabular

Show BlastHit

qseqid

string

required

Query sequence ID.

sseqid

string

required

Subject sequence ID.

pident

number

required

Percentage of identical matches.

length

integer

required

Alignment length.

mismatch

integer

required

Number of mismatches.

gapopen

integer

required

Number of gap openings.

qstart

integer

required

Start of alignment in query.

qend

integer

required

End of alignment in query.

sstart

integer

required

Start of alignment in subject.

send

integer

required

End of alignment in subject.

evalue

number

required

Expect value.

bitscore

number

required

Bit score.

Applications

This tool is the standard first step in any analysis that begins with an unknown sequence and asks what it resembles. Representative applications include functional annotation of a newly assembled gene through homology to characterised proteins, taxonomic identification of an environmental DNA fragment, off-target screening of a PCR primer or CRISPR guide against a reference genome, and tracing the evolutionary distribution of a gene across species.

Usage Tips

The program field must match the query and database types. Mismatched combinations return no hits and waste a search. Use blastn for nucleotide-against-nucleotide, blastp for protein-against-protein, blastx for a nucleotide query against a protein database, tblastn for a protein query against a nucleotide database, and tblastx for translated nucleotide against translated nucleotide.
Remote execution targets the NCBI BLAST web service and is limited by NCBI rate limits. The database field selects from the hosted reference databases (nt, nr, refseq_rna, refseq_protein, swissprot, pdb, pataa, patnt). High-throughput or batch workloads should use local execution to avoid being throttled or blocked by NCBI.
Local execution requires a local_db value pointing at a prebuilt database. Build one with blast-create-db or download a prebuilt NCBI database. The path is the database stem with no file extension. The configuration validator hard-errors when local_db is missing in local mode.
evalue is the primary parameter controlling sensitivity. The BLAST+ default of 10.0 is permissive and returns spurious hits. Set it to 1e-5 or stricter to filter out alignments that would occur by chance, or use a higher value when searching for short or divergent matches.
extra_args accepts verbatim BLAST+ CLI tokens and applies only in local execution. Pass any CLI flag not exposed as a typed field through this list (for example ["-max_hsps", "1"]). The remote QBLAST API does not accept arbitrary CLI tokens, so extra_args is ignored when search_mode="online" and the configuration validator emits a warning in that case.

Create BLAST Database (`blast-create-db`)

Builds a local BLAST database from a FASTA file using the BLAST+ makeblastdb program. The output is a set of indexed files referenced by a common stem path. The stem path is returned as db_path and can be passed directly as local_db to blast-search.

API Reference

Source

Input: CreateBlastDbInput

fasta

string

required

Path to a FASTA file containing the sequences to be indexed into a BLAST database. The file must exist and contain valid FASTA-formatted sequences. For nucleotide databases, sequences should be DNA or RNA. For protein databases, sequences should be amino acids.

Source

Config: CreateBlastDbConfig

dbtype

enum

default:"nucl"

"nucl" for DNA/RNA, "prot" for protein. Must match the input FASTA.Available options: nucl, prot

out_prefix

string

File-path prefix for generated DB files; None falls back to the input FASTA stem.

title

string

Descriptive DB title shown in BLAST reports; makeblastdb falls back to the input file name when None.

parse_seqids

boolean

default:"False"

Parse FASTA seq IDs so blastdbcmd can address sequences by ID; required for v5 taxonomy lookups.

hash_index

boolean

default:"False"

Create a hash index of seq IDs (faster ID lookups).

blastdb_version

enum

default:"5"

DB format version. 5 (taxonomy- aware) is the upstream default since BLAST+ 2.10.Available options: 4, 5

max_file_sz

string

default:"1GB"

Max size per DB volume with a unit suffix (e.g. "1GB"); upstream caps at "4GB".

taxid

integer

NCBI taxonomy ID assigned to every sequence; set to tag a single-organism DB.

extra_args

List[string]

default:"[]"

Extra makeblastdb CLI tokens passed verbatim (e.g. ["-mask_data", "/path/to/mask"]). Escape hatch for flags not exposed as typed fields above.

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cpu"

Device to run the tool on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Source

Output: CreateBlastDbOutput

db_path

string

required

The base path to the generated BLAST database files (without file extensions). This path can be used directly as the value for the local_db parameter in BlastSearchConfig. For example, if db_path is "/data/mydb", makeblastdb will have created multiple files like "/data/mydb.nhr", "/data/mydb.nin", "/data/mydb.nsq" (for nucleotide databases) or similar extensions for protein databases.

Applications

This tool is the prerequisite for any local BLAST workflow that searches against a custom reference set, such as an in-house genome assembly, a curated subset of a public database, or a panel of designed sequences. Building a local database once and reusing it across many queries avoids repeated network traffic to NCBI and gives full control over the reference content.

Usage Tips

dbtype must match the input FASTA type. Use "nucl" for nucleotide sequences and "prot" for amino-acid sequences. The configuration validator hard-errors on any other value, and a mismatch against the FASTA content will be caught by makeblastdb at runtime.
out_prefix defaults to the input FASTA stem in the same directory. Set it explicitly when the database should live in a different location or under a different name.
parse_seqids=True is required for FASTA identifiers to be addressable. Enable it when downstream calls need to retrieve sequences by identifier through blastdbcmd or when building a taxonomy-aware database. Pair it with hash_index=True for faster identifier lookups.
extra_args accepts verbatim makeblastdb CLI tokens. Use it for niche flags not exposed as typed fields, such as ["-mask_data", "/path/to/mask"] for premasking input or ["-gi_mask", "..."] for taxonomy-related options.

Toolkit Notes

These apply to every BLAST tool in this toolkit (blast-search, blast-create-db).

Hits use the standard BLAST -outfmt 6 tabular schema. Each BlastHit carries the twelve canonical fields qseqid, sseqid, pident, length, mismatch, gapopen, qstart, qend, sstart, send, evalue, and bitscore. pident is reported on a 0-to-100 scale.
The local installation downloads the platform-specific NCBI BLAST+ distribution on first use. The standalone setup pulls the appropriate NCBI BLAST+ tarball and extracts the blastn, blastp, blastx, tblastn, tblastx, and makeblastdb executables. No reference database is bundled, so local execution requires either a user-built database from blast-create-db or a separately downloaded NCBI database.
The two tools differ in execution mode. blast-search supports both online (search_mode="online", the default) and local (search_mode="local") execution. blast-create-db runs only locally because the NCBI web service does not expose makeblastdb.

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

​Background

​Learning Resources

​Tools

​BLAST Search (blast-search)

​API Reference

​Applications

​Usage Tips

​Create BLAST Database (blast-create-db)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

Background

Learning Resources

Tools

BLAST Search (`blast-search`)

API Reference

Applications

Usage Tips

Create BLAST Database (`blast-create-db`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides