Skip to main content
License: AbLang is open source and free for academic and commercial use under a BSD-3-Clause license. Please refer to the license for full terms.

Proto is not affiliated with Oxford Protein Informatics Group (OPIG). This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.


oxpig/AbLang2
oxpig/AbLang2
An antibody-specific language model focusing on NGL prediction
42 stars
View repo
AbLang: an antibody language model for completing antibody sequences
Tobias H Olsen, Iain H Moal and Charlotte M Deane
Bioinformatics Advances (2022)
Read paper
@article{olsen2022ablang,
  title={AbLang: an antibody language model for completing antibody sequences},
  author={Olsen, Tobias H and Moal, Iain H and Deane, Charlotte M},
  journal={Bioinformatics Advances},
  volume={2},
  number={1},
  pages={vbac046},
  year={2022},
  publisher={Oxford University Press},
  doi={10.1093/bioadv/vbac046}
}

@article{olsen2024ablang2,
  title={Addressing the antibody germline bias and its effect on language models for improved antibody design},
  author={Olsen, Tobias H and Moal, Iain H and Deane, Charlotte M},
  journal={Bioinformatics},
  volume={40},
  number={11},
  pages={btae618},
  year={2024},
  publisher={Oxford University Press},
  doi={10.1093/bioinformatics/btae618}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/masked_models/ablang
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_ablang_embeddings()Extract antibody sequence embeddings using AbLang (GPU) Docs Source
run_ablang_gradient()Compute AbLang masked pseudo-log-likelihood gradient for relaxed antibody sequences (GPU) Docs Source
run_ablang_sample()Restore masked antibody sequence positions using AbLang (GPU) Docs Source
run_ablang_score()Score antibody sequences using AbLang language model (GPU) Docs Source

Background

AbLang (Olsen, Moal, and Deane, 2022) is a BERT-style masked language model trained exclusively on antibody variable-domain sequences from the OAS database. The published work demonstrates that AbLang restores residues missing from antibody sequence reads more accurately than germline-based imputation or the general-purpose ESM-1b protein language model, and runs approximately seven times faster than ESM-1b. Two single-chain checkpoints are provided, ablang1-heavy and ablang1-light, each with a 768-dimensional hidden representation. AbLang-2 (Olsen, Moal, and Deane, 2024) is trained on both unpaired and paired antibody sequence data and addresses a germline-residue bias observed in earlier antibody language models that overweighted germline positions during training. The published analysis shows that AbLang-2 suggests a diverse set of valid mutations with high cumulative probability and provides paired-chain context for antibody design. The ablang2-paired checkpoint exposed by this toolkit has a 480-dimensional hidden representation.

Learning Resources

  • oxpig/AbLang (OPIG, University of Oxford). Official AbLang repository, source code, and reference implementation of the heavy- and light-chain checkpoints.
  • oxpig/AbLang2 (OPIG, University of Oxford). Official AbLang-2 repository for the paired heavy-plus-light checkpoint.
  • Observed Antibody Space (OPIG). Public antibody sequence database used to train the AbLang models.

Tools

AbLang Embeddings (ablang-embedding)

Computes per-sequence AbLang embeddings for a list of Antibody inputs. Each Antibody carries an optional heavy chain and an optional light chain, and the tool routes to ablang1-heavy, ablang1-light, or ablang2-paired based on which chains are present. The output is a list of mean-pooled embeddings (768-dimensional for the single-chain checkpoints, 480-dimensional for the paired checkpoint) together with attention masks that mark valid sequence positions.

API Reference

Source
antibodies
List[Antibody]
required
Antibody sequence(s) to embed.
Source
return_logits
boolean
default:"False"
Include per-position amino-acid logits in output (large; disable to save memory).
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cuda"
Device to run on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
batch_size
integer
default:"1"
Number of sequences to process per forward pass.
Source
results
List[SequenceEmbedding]
required
Per-sequence embedding results. Each SequenceEmbedding contains:

Applications

This tool is appropriate for any antibody-sequence analysis that benefits from a learned representation. Representative applications include clustering antibody repertoires by sequence similarity in embedding space, ranking humanization candidates by distance to a known humanised lead, identifying paired heavy-plus-light combinations with similar predicted binding behaviour, and providing input features to downstream classifiers for property prediction.

Usage Tips

  • Provide both chains when available to get the paired representation. Setting both heavy_chain and light_chain on the Antibody input routes to ablang2-paired, which captures inter-chain co-evolutionary signals that the single-chain checkpoints cannot. Provide only one chain to use the corresponding single-chain model.
  • Use the returned attention mask when pooling or comparing positions. Variable-length sequences in a batch are padded to the longest input, and the attention mask flags which positions are real (1) versus padding (0). Downstream per-position analyses should respect the mask.

AbLang Sampling (ablang-sample)

Restores masked positions in antibody sequences using the AbLang masked-language-model head. Positions to be restored are marked with an underscore (_) in the input sequence, and the tool samples a replacement amino acid at each masked position from the model’s predicted distribution. The sampling temperature is configurable, and greedy argmax decoding is selected by setting temperature=0.

API Reference

Source
antibodies
List[Antibody]
required
Antibody sequence(s) with _ at positions to restore.
Source
temperature
number
default:"1.0"
Softmax temperature for per-position amino-acid sampling. temperature == 0 selects greedy argmax decoding (equivalent to ablang’s native restore mode). temperature == 1 samples from the unscaled model distribution; higher values flatten the distribution toward uniform, lower values sharpen toward greedy.
align
boolean
default:"False"
Run ANARCI alignment first; enables restoration of unknown numbers of missing residues at chain termini. Forces greedy decoding (ANARCI’s spread-of-variants logic is incompatible with stochastic sampling).
return_logits
boolean
default:"False"
Include per-position logits in the output (large; disable to save memory). Triggers a second likelihood-mode forward pass per batch.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cuda"
Device to run on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
batch_size
integer
default:"1"
Number of sequences per forward pass.
Source
logits
array
Per-position logits for each restored sequence. Shape is (num_sequences, seq_len, vocab_size=20). Only present if return_logits=True in config.
sequences
List[string]
required
Restored antibody sequences with masked positions replaced by model predictions.

Applications

This tool is appropriate for completing antibody sequences with missing residues, a common need when working with B-cell receptor sequencing reads that drop the first several N-terminal residues. Representative applications include filling sequencing-dropout positions before downstream structural prediction, exploring single-position substitutions in CDR or framework regions, and generating antibody-context-aware variants for humanisation or affinity-maturation campaigns.

Usage Tips

  • Use the underscore (_) as the mask character. Other placeholders such as *, X, or <mask> are not recognised. Each underscore in the input sequence is replaced with a sample drawn from the model distribution at that position.
  • temperature controls the sampling stochasticity. The default of 1.0 samples from the unscaled model distribution, producing different sequences across repeated calls. Set temperature=0 for greedy argmax decoding, which matches AbLang’s native restore mode and produces deterministic output. Lower positive values sharpen toward the top prediction, higher values flatten toward uniform. Use seed to make stochastic runs reproducible.
  • Set align=True to extend unknown-length termini. When the input sequence is shorter than expected, enabling ANARCI-based alignment lets AbLang restore residues at the N or C terminus as well as in the middle of the sequence. Setting align=True forces greedy decoding regardless of the temperature setting, since the ANARCI alignment is incompatible with stochastic sampling.
  • Set return_logits=True to recover the per-position amino-acid distribution. When enabled, the output carries a per-position logit matrix of shape (num_sequences, seq_len, 20) alongside the sampled sequence, which is useful for downstream re-ranking or post-hoc analysis. The default omits the logits to keep the response small.

AbLang Scoring (ablang-score)

Computes per-sequence scores under the AbLang masked-language-model head. The scoring_mode configuration field selects between pseudo-log-likelihood ("pseudo_log_likelihood") and confidence ("confidence") scoring.

API Reference

Source
antibodies
List[Antibody]
required
Antibody sequence(s) to score.
Source
scoring_mode
enum
default:"pseudo_log_likelihood"
Scoring method. "pseudo_log_likelihood" masks each position individually (accurate, O(L) passes); "confidence" is a single-pass confidence proxy (faster, less accurate).Available options: pseudo_log_likelihood, confidence
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cuda"
Device to run on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
batch_size
integer
default:"1"
Number of sequences per forward pass.
return_logits
boolean
default:"False"
Include per-position logits in the output (large; disable to save memory). Triggers a second likelihood-mode forward pass per batch.
Source
scores
List[MaskedModelScoringMetrics]
required
List of scoring outputs, one per input sequence. Each entry is a Metrics subclass with scalar metrics (accessed via score.perplexity or score["perplexity"]) plus declared logits / vocab fields that carry raw model outputs when requested.
Metrics (one set per scores item)
MetricTypeRangeAvailability
log_likelihoodfloat≤ 0.0always
avg_log_likelihoodfloat≤ 0.0always
perplexityfloat≥ 1.0always

Applications

This tool is appropriate for ranking antibody sequences by how “natural” they look under the model. Representative applications include selecting humanisation candidates closer to natural human antibody repertoires, flagging candidate sequences with low predicted naturalness for redesign, and ranking ProteinMPNN- or design-pipeline-generated sequences by pseudo-log-likelihood before more expensive downstream analyses.

Usage Tips

  • Pseudo-log-likelihood scores from different checkpoints sit on different scales and are not directly comparable. Each of ablang1-heavy, ablang1-light, and ablang2-paired was trained independently and produces scores on its own scale, so heavy-chain scores cannot be compared against light-chain scores and single-chain scores cannot be compared against paired-chain scores. Only compare antibodies that were scored with the same model variant.
  • Higher pseudo-log-likelihood corresponds to a more probable sequence under AbLang. Use scores comparatively across variants of the same antibody rather than as an absolute developability or affinity score. A high score reflects sequence likeness to the training distribution, not predicted experimental performance.

AbLang Gradient (ablang-gradient)

Computes the gradient of the AbLang masked pseudo-log-likelihood objective with respect to a relaxed antibody-logit input. The tool accepts an AntibodyLogits object whose heavy_chain and light_chain fields are per-position logit or probability matrices, masks each amino-acid position in turn, scores the bidirectional-context prediction with cross-entropy against the input distribution, and returns the gradient matrix together with the loss value and auxiliary metrics.

API Reference

Source
antibody
AntibodyLogits
required
Antibody with relaxed sequence distributions. The model variant is selected automatically based on which chains are provided.
temperature
number
Optional softmax temperature. When set, applies softmax(input / temperature) before computing the gradient. When None (default), the input is used as-is.
Source
use_ste
boolean
default:"False"
Straight-Through Estimator: hard one-hot in the forward pass with gradients flowing through soft probabilities. When False, uses soft blended embeddings directly.
compute_gradient
boolean
default:"True"
Run backward pass and return gradient. Set False for forward-only log-likelihood scoring (e.g. MCMC proposal ranking).
batch_size
integer
AA positions per forward pass for batched PLL. None auto- selects a per-model default (lower if OOM, higher for throughput).
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cuda"
Device to run the model on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
gradient
array
Gradient w.r.t. input logits, or None when compute_gradient=False (forward-only scoring).
loss
number
required
Mean negative log-likelihood over AA positions.
metrics
Dict[string, any]
log_likelihood, avg_log_likelihood, perplexity, sequence_length, model_choice, objective.
vocab
List[string]
required
Amino-acid column ordering for the input logits.

Applications

This tool is appropriate for differentiable antibody-design pipelines that update a continuous sequence representation by gradient descent. Representative applications include relaxed-logit hallucination for antibody design, joint optimisation of AbLang likelihood together with structure-based losses such as AlphaFold2 hallucination, and incorporating an antibody-specific naturalness term into broader binder-design objectives.

Usage Tips

  • Input logits use the canonical protein order ACDEFGHIKLMNPQRSTVWY. The tool implementation internally maps to AbLang’s vocabulary order before the forward pass and returns the gradient in the same canonical order, so the user does not need to handle the AbLang-specific token order separately.
  • Set temperature to apply a softmax before scoring. When temperature is set, the tool implementation applies softmax(input / temperature) to the input logits before the forward pass. Leave temperature=None (the default) when the user already provides a normalised probability distribution.
  • Use the Straight-Through Estimator option for discrete-token gradients. Setting use_ste=True substitutes hard one-hot tokens in the forward pass while allowing gradients to flow through the soft probabilities, which can produce sharper update directions for some discrete-design loops. The default (use_ste=False) uses soft blended embeddings.
  • Set compute_gradient=False for forward-only scoring. This skips the backward pass and returns gradient=None together with the loss value, which is useful for ranking candidates from a Monte Carlo proposal without paying the backward-pass cost.

Toolkit Notes

These apply to every AbLang tool in this toolkit (ablang-embedding, ablang-gradient, ablang-sample, ablang-score).
  • All four tools route automatically among the three AbLang checkpoints based on the chains provided. Providing only a heavy chain selects ablang1-heavy, providing only a light chain selects ablang1-light, and providing both selects the paired ablang2-paired checkpoint. At least one chain must be set on each input.
  • Every antibody in a batched call must use the same chain configuration. The embedding, scoring, and sampling tools accept a list of antibodies in a single call, and every antibody in that list must provide the same combination of heavy and light chains so that all entries route to the same checkpoint. Mixed lists are rejected at input construction with a clear error.
  • AbLang is appropriate for antibody variable-domain sequences only. Non-antibody proteins should be analysed with a general-purpose protein language model such as ESM2 rather than AbLang, which was trained exclusively on antibody sequences and produces unreliable scores or embeddings outside that distribution.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.