AbLang - Proto

License: AbLang is open source and free for academic and commercial use under a BSD-3-Clause license. Please refer to the license for full terms.

Proto is not affiliated with Oxford Protein Informatics Group (OPIG). This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.

GitHub 42 GitHub 42

HuggingFace

HuggingFace Publication Publication Cite Cite Tool Source Tool Source Open as Notebook Open as Notebook Open on Proto Open on Proto

oxpig/AbLang2

An antibody-specific language model focusing on NGL prediction

AbLang: an antibody language model for completing antibody sequences

Tobias H Olsen, Iain H Moal and Charlotte M Deane

Bioinformatics Advances (2022)

Read paper

@article{olsen2022ablang,
  title={AbLang: an antibody language model for completing antibody sequences},
  author={Olsen, Tobias H and Moal, Iain H and Deane, Charlotte M},
  journal={Bioinformatics Advances},
  volume={2},
  number={1},
  pages={vbac046},
  year={2022},
  publisher={Oxford University Press},
  doi={10.1093/bioadv/vbac046}
}

@article{olsen2024ablang2,
  title={Addressing the antibody germline bias and its effect on language models for improved antibody design},
  author={Olsen, Tobias H and Moal, Iain H and Deane, Charlotte M},
  journal={Bioinformatics},
  volume={40},
  number={11},
  pages={btae618},
  year={2024},
  publisher={Oxford University Press},
  doi={10.1093/bioinformatics/btae618}
}

Copy citation

proto-bio/proto-tools/proto_tools/tools/masked_models/ablang

View source

Open Notebook

Open notebook

Coming soon!

Run this tool directly in Proto with no setup required.

Function	Description
`run_ablang_embeddings()`	Extract antibody sequence embeddings using AbLang (GPU)	Docs Source
`run_ablang_gradient()`	Compute AbLang masked pseudo-log-likelihood gradient for relaxed antibody sequences (GPU)	Docs Source
`run_ablang_sample()`	Restore masked antibody sequence positions using AbLang (GPU)	Docs Source
`run_ablang_score()`	Score antibody sequences using AbLang language model (GPU)	Docs Source

Background

AbLang (Olsen, Moal, and Deane, 2022) is a BERT-style masked language model trained exclusively on antibody variable-domain sequences from the OAS database. The published work demonstrates that AbLang restores residues missing from antibody sequence reads more accurately than germline-based imputation or the general-purpose ESM-1b protein language model, and runs approximately seven times faster than ESM-1b. Two single-chain checkpoints are provided, ablang1-heavy and ablang1-light, each with a 768-dimensional hidden representation. AbLang-2 (Olsen, Moal, and Deane, 2024) is trained on both unpaired and paired antibody sequence data and addresses a germline-residue bias observed in earlier antibody language models that overweighted germline positions during training. The published analysis shows that AbLang-2 suggests a diverse set of valid mutations with high cumulative probability and provides paired-chain context for antibody design. The ablang2-paired checkpoint exposed by this toolkit has a 480-dimensional hidden representation.

Learning Resources

oxpig/AbLang (OPIG, University of Oxford). Official AbLang repository, source code, and reference implementation of the heavy- and light-chain checkpoints.
oxpig/AbLang2 (OPIG, University of Oxford). Official AbLang-2 repository for the paired heavy-plus-light checkpoint.
Observed Antibody Space (OPIG). Public antibody sequence database used to train the AbLang models.

Tools

AbLang Embeddings (`ablang-embedding`)

Computes per-sequence AbLang embeddings for a list of Antibody inputs. Each Antibody carries an optional heavy chain and an optional light chain, and the tool routes to ablang1-heavy, ablang1-light, or ablang2-paired based on which chains are present. The output is a list of mean-pooled embeddings (768-dimensional for the single-chain checkpoints, 480-dimensional for the paired checkpoint) together with attention masks that mark valid sequence positions.

API Reference

Source

Input: AbLangEmbeddingsInput

antibodies

List[Antibody]

required

Antibody sequence(s) to embed.

Show Antibody

heavy_chain

string

Heavy chain amino-acid sequence.

light_chain

string

Light chain amino-acid sequence.

Source

Config: AbLangEmbeddingsConfig

return_logits

boolean

default:"False"

Include per-position amino-acid logits in output (large; disable to save memory).

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cuda"

Device to run on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

batch_size

integer

default:"1"

Number of sequences to process per forward pass.

Source

Output: AbLangEmbeddingsOutput

results

List[SequenceEmbedding]

required

Per-sequence embedding results. Each SequenceEmbedding contains:

Show SequenceEmbedding

mean_embedding

List[number]

required

Mean-pooled embedding vector for one sequence.

attention_mask

List[integer]

required

Binary mask indicating valid positions (1) vs padding (0).

logits

array

Optional per-position amino acid logits for one sequence.

projection

Projection2D

Optional 2D coordinate from a UMAP projection of all embeddings in the same call. Populated when n_sequences >= 4; None otherwise (single-point or 2-3-point UMAP is meaningless).

Applications

This tool is appropriate for any antibody-sequence analysis that benefits from a learned representation. Representative applications include clustering antibody repertoires by sequence similarity in embedding space, ranking humanization candidates by distance to a known humanised lead, identifying paired heavy-plus-light combinations with similar predicted binding behaviour, and providing input features to downstream classifiers for property prediction.

Usage Tips

Provide both chains when available to get the paired representation. Setting both heavy_chain and light_chain on the Antibody input routes to ablang2-paired, which captures inter-chain co-evolutionary signals that the single-chain checkpoints cannot. Provide only one chain to use the corresponding single-chain model.
Use the returned attention mask when pooling or comparing positions. Variable-length sequences in a batch are padded to the longest input, and the attention mask flags which positions are real (1) versus padding (0). Downstream per-position analyses should respect the mask.

AbLang Sampling (`ablang-sample`)

Restores masked positions in antibody sequences using the AbLang masked-language-model head. Positions to be restored are marked with an underscore (_) in the input sequence, and the tool samples a replacement amino acid at each masked position from the model’s predicted distribution. The sampling temperature is configurable, and greedy argmax decoding is selected by setting temperature=0.

API Reference

Source

Input: AbLangSampleInput

antibodies

List[Antibody]

required

Antibody sequence(s) with _ at positions to restore.

Show Antibody

heavy_chain

string

Heavy chain amino-acid sequence.

light_chain

string

Light chain amino-acid sequence.

Source

Config: AbLangSampleConfig

temperature

number

default:"1.0"

Softmax temperature for per-position amino-acid sampling. temperature == 0 selects greedy argmax decoding (equivalent to ablang’s native restore mode). temperature == 1 samples from the unscaled model distribution; higher values flatten the distribution toward uniform, lower values sharpen toward greedy.

align

boolean

default:"False"

Run ANARCI alignment first; enables restoration of unknown numbers of missing residues at chain termini. Forces greedy decoding (ANARCI’s spread-of-variants logic is incompatible with stochastic sampling).

return_logits

boolean

default:"False"

Include per-position logits in the output (large; disable to save memory). Triggers a second likelihood-mode forward pass per batch.

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cuda"

Device to run on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

batch_size

integer

default:"1"

Number of sequences per forward pass.

Source

Output: AbLangSampleOutput

logits

array

Per-position logits for each restored sequence. Shape is (num_sequences, seq_len, vocab_size=20). Only present if return_logits=True in config.

sequences

List[string]

required

Restored antibody sequences with masked positions replaced by model predictions.

Applications

This tool is appropriate for completing antibody sequences with missing residues, a common need when working with B-cell receptor sequencing reads that drop the first several N-terminal residues. Representative applications include filling sequencing-dropout positions before downstream structural prediction, exploring single-position substitutions in CDR or framework regions, and generating antibody-context-aware variants for humanisation or affinity-maturation campaigns.

Usage Tips

Use the underscore (_) as the mask character. Other placeholders such as *, X, or <mask> are not recognised. Each underscore in the input sequence is replaced with a sample drawn from the model distribution at that position.
temperature controls the sampling stochasticity. The default of 1.0 samples from the unscaled model distribution, producing different sequences across repeated calls. Set temperature=0 for greedy argmax decoding, which matches AbLang’s native restore mode and produces deterministic output. Lower positive values sharpen toward the top prediction, higher values flatten toward uniform. Use seed to make stochastic runs reproducible.
Set align=True to extend unknown-length termini. When the input sequence is shorter than expected, enabling ANARCI-based alignment lets AbLang restore residues at the N or C terminus as well as in the middle of the sequence. Setting align=True forces greedy decoding regardless of the temperature setting, since the ANARCI alignment is incompatible with stochastic sampling.
Set return_logits=True to recover the per-position amino-acid distribution. When enabled, the output carries a per-position logit matrix of shape (num_sequences, seq_len, 20) alongside the sampled sequence, which is useful for downstream re-ranking or post-hoc analysis. The default omits the logits to keep the response small.

AbLang Scoring (`ablang-score`)

Computes per-sequence scores under the AbLang masked-language-model head. The scoring_mode configuration field selects between pseudo-log-likelihood ("pseudo_log_likelihood") and confidence ("confidence") scoring.

API Reference

Source

Input: AbLangScoringInput

antibodies

List[Antibody]

required

Antibody sequence(s) to score.

Show Antibody

heavy_chain

string

Heavy chain amino-acid sequence.

light_chain

string

Light chain amino-acid sequence.

Source

Config: AbLangScoringConfig

scoring_mode

enum

default:"pseudo_log_likelihood"

Scoring method. "pseudo_log_likelihood" masks each position individually (accurate, O(L) passes); "confidence" is a single-pass confidence proxy (faster, less accurate).Available options: pseudo_log_likelihood, confidence

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cuda"

Device to run on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

batch_size

integer

default:"1"

Number of sequences per forward pass.

return_logits

boolean

default:"False"

Include per-position logits in the output (large; disable to save memory). Triggers a second likelihood-mode forward pass per batch.

Source

Output: MaskedModelScoringOutput

scores

List[MaskedModelScoringMetrics]

required

List of scoring outputs, one per input sequence. Each entry is a Metrics subclass with scalar metrics (accessed via score.perplexity or score["perplexity"]) plus declared logits / vocab fields that carry raw model outputs when requested.

Show MaskedModelScoringMetrics

logits

array

Per-position logits array (seq_len, vocab_size). None unless return_logits=True.

vocab

array

Token ordering for logits.

primary_metric

string

Name of the metric that best summarizes the result overall (e.g. "avg_plddt" for AlphaFold2). Used by downstream UI and reporting to pick a headline value.

Metrics (one set per scores item)

Metric	Type	Range	Availability
`log_likelihood`	float	≤ 0.0	always
`avg_log_likelihood`	float	≤ 0.0	always
`perplexity`	float	≥ 1.0	always

Applications

This tool is appropriate for ranking antibody sequences by how “natural” they look under the model. Representative applications include selecting humanisation candidates closer to natural human antibody repertoires, flagging candidate sequences with low predicted naturalness for redesign, and ranking ProteinMPNN- or design-pipeline-generated sequences by pseudo-log-likelihood before more expensive downstream analyses.

Usage Tips

Pseudo-log-likelihood scores from different checkpoints sit on different scales and are not directly comparable. Each of ablang1-heavy, ablang1-light, and ablang2-paired was trained independently and produces scores on its own scale, so heavy-chain scores cannot be compared against light-chain scores and single-chain scores cannot be compared against paired-chain scores. Only compare antibodies that were scored with the same model variant.
Higher pseudo-log-likelihood corresponds to a more probable sequence under AbLang. Use scores comparatively across variants of the same antibody rather than as an absolute developability or affinity score. A high score reflects sequence likeness to the training distribution, not predicted experimental performance.

AbLang Gradient (`ablang-gradient`)

Computes the gradient of the AbLang masked pseudo-log-likelihood objective with respect to a relaxed antibody-logit input. The tool accepts an AntibodyLogits object whose heavy_chain and light_chain fields are per-position logit or probability matrices, masks each amino-acid position in turn, scores the bidirectional-context prediction with cross-entropy against the input distribution, and returns the gradient matrix together with the loss value and auxiliary metrics.

API Reference

Source

Input: AbLangGradientInput

antibody

AntibodyLogits

required

Antibody with relaxed sequence distributions. The model variant is selected automatically based on which chains are provided.

Show AntibodyLogits

heavy_chain

array

Heavy chain logits with shape (Lh, 20) in canonical amino-acid order.

light_chain

array

Light chain logits with shape (Ll, 20) in canonical amino-acid order.

temperature

number

Optional softmax temperature. When set, applies softmax(input / temperature) before computing the gradient. When None (default), the input is used as-is.

Source

Config: AbLangGradientConfig

use_ste

boolean

default:"False"

Straight-Through Estimator: hard one-hot in the forward pass with gradients flowing through soft probabilities. When False, uses soft blended embeddings directly.

compute_gradient

boolean

default:"True"

Run backward pass and return gradient. Set False for forward-only log-likelihood scoring (e.g. MCMC proposal ranking).

batch_size

integer

AA positions per forward pass for batched PLL. None auto- selects a per-model default (lower if OOM, higher for throughput).

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cuda"

Device to run the model on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Source

Output: AbLangGradientOutput

gradient

array

Gradient w.r.t. input logits, or None when compute_gradient=False (forward-only scoring).

loss

number

required

Mean negative log-likelihood over AA positions.

metrics

Dict[string, any]

log_likelihood, avg_log_likelihood, perplexity, sequence_length, model_choice, objective.

vocab

List[string]

required

Amino-acid column ordering for the input logits.

Applications

This tool is appropriate for differentiable antibody-design pipelines that update a continuous sequence representation by gradient descent. Representative applications include relaxed-logit hallucination for antibody design, joint optimisation of AbLang likelihood together with structure-based losses such as AlphaFold2 hallucination, and incorporating an antibody-specific naturalness term into broader binder-design objectives.

Usage Tips

Input logits use the canonical protein order ACDEFGHIKLMNPQRSTVWY. The tool implementation internally maps to AbLang’s vocabulary order before the forward pass and returns the gradient in the same canonical order, so the user does not need to handle the AbLang-specific token order separately.
Set temperature to apply a softmax before scoring. When temperature is set, the tool implementation applies softmax(input / temperature) to the input logits before the forward pass. Leave temperature=None (the default) when the user already provides a normalised probability distribution.
Use the Straight-Through Estimator option for discrete-token gradients. Setting use_ste=True substitutes hard one-hot tokens in the forward pass while allowing gradients to flow through the soft probabilities, which can produce sharper update directions for some discrete-design loops. The default (use_ste=False) uses soft blended embeddings.
Set compute_gradient=False for forward-only scoring. This skips the backward pass and returns gradient=None together with the loss value, which is useful for ranking candidates from a Monte Carlo proposal without paying the backward-pass cost.

Toolkit Notes

These apply to every AbLang tool in this toolkit (ablang-embedding, ablang-gradient, ablang-sample, ablang-score).

All four tools route automatically among the three AbLang checkpoints based on the chains provided. Providing only a heavy chain selects ablang1-heavy, providing only a light chain selects ablang1-light, and providing both selects the paired ablang2-paired checkpoint. At least one chain must be set on each input.
Every antibody in a batched call must use the same chain configuration. The embedding, scoring, and sampling tools accept a list of antibodies in a single call, and every antibody in that list must provide the same combination of heavy and light chains so that all entries route to the same checkpoint. Mixed lists are rejected at input construction with a clear error.
AbLang is appropriate for antibody variable-domain sequences only. Non-antibody proteins should be analysed with a general-purpose protein language model such as ESM2 rather than AbLang, which was trained exclusively on antibody sequences and produces unreliable scores or embeddings outside that distribution.

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

​Background

​Learning Resources

​Tools

​AbLang Embeddings (ablang-embedding)

​API Reference

​Applications

​Usage Tips

​AbLang Sampling (ablang-sample)

​API Reference

​Applications

​Usage Tips

​AbLang Scoring (ablang-score)

​API Reference

​Applications

​Usage Tips

​AbLang Gradient (ablang-gradient)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

Background

Learning Resources

Tools

AbLang Embeddings (`ablang-embedding`)

API Reference

Applications

Usage Tips

AbLang Sampling (`ablang-sample`)

API Reference

Applications

Usage Tips

AbLang Scoring (`ablang-score`)

API Reference

Applications

Usage Tips

AbLang Gradient (`ablang-gradient`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides