Skip to main content
License: Malinois is open source and free for academic and commercial use under an MIT license and may require explicit attribution when utilized. Please refer to the license for full terms.

Proto is not affiliated with Broad Institute, The Jackson Laboratory, and Yale University. This toolkit is open source and builds on the implementations produced by these organizations. Product names, logos, and trademarks are the property of their respective owners.


sjgosai/boda2
sjgosai/boda2
Computational Optimization of DNA Activity (CODA)
68 stars
View repo
Machine-guided design of cell-type-targeting cis-regulatory elements
Sager J. Gosai, Rodrigo I. Castro, … Ryan Tewhey
Nature (2024)
Read paper
Machine-guided design of synthetic cell type-specific cis-regulatory elements
SJ Gosai, R. Castro, … R. Tewhey
bioRxiv (2023)
Read preprint
@article{gosai2024machine,
  title={Machine-guided design of cell-type-targeting cis-regulatory elements},
  author={Gosai, Sager J. and Castro, Rodrigo I. and Fuentes, Natalia and Butts, John C. and Mouri, Kousuke and Alasoadura, Michael and Kales, Susan and Nguyen, Thanh Thanh L. and Noche, Ramil R. and Rao, Arya S. and Joy, Mary T. and Sabeti, Pardis C. and Reilly, Steven K. and Tewhey, Ryan},
  journal={Nature},
  volume={634},
  pages={1211--1220},
  year={2024},
  doi={10.1038/s41586-024-08070-z},
  url={https://doi.org/10.1038/s41586-024-08070-z}
}

@misc{gosai2024machine_zenodo,
  title={Machine-guided design of cell-type-targeting cis-regulatory elements},
  author={Gosai, Sager and Castro, Rodrigo and Fuentes, Natalia and Butts, John and Mouri, Kousuke and Alasoadura, Michael and Kales, Susan and Nguyen, Thanh Thanh and Noche, Ramil and Rao, Arya and Joy, Mary Teena and Sabeti, Pardis and Reilly, Steven and Tewhey, Ryan},
  year={2024},
  publisher={Zenodo},
  version={1.0},
  doi={10.5281/zenodo.10698014},
  url={https://doi.org/10.5281/zenodo.10698014},
  note={Supplemental data and resources for the Nature article, including the Malinois model artifact}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/sequence_scoring/malinois
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_malinois_gradient()Compute differentiable Malinois MPRA activity losses and gradients for relaxed DNA logits (GPU) Docs Source
run_malinois_score()Score regulatory DNA activity using the Malinois MPRA model (GPU) Docs Source

Background

Malinois is the regulatory sequence model used in CODA (Gosai et al., 2024) for machine-guided design of cell-type-targeting cis-regulatory elements. The model adapts the Basset-style convolutional architecture to MPRA data and predicts activity from a fixed 200 nucleotide insert after adding the assay flanks expected by the published checkpoint. The model returns one raw activity value for each supported cell context: K562, HepG2, and SK-N-SH. The scoring wrapper averages forward and reverse-complement predictions and returns selected raw outputs. The gradient wrapper applies max/min sigmoid objective terms to these raw scores and backpropagates through relaxed A,C,G,T logits, matching the Fast SeqProp-style design path used for regulatory DNA optimization.

Tools

Malinois Score (malinois-score)

Scores one or more 200 bp DNA inserts and returns raw Malinois predictions keyed by requested cell type.

API Reference

Source
sequences
List[string]
required
DNA insert sequence(s) to score. A single string is normalized to a one-item list. Sequences must contain only A, C, G, and T; the configured seq_length is checked at run time.
Source
cell_types
List[string]
default:"['K562', 'HepG2', 'SKNSH']"
Cell-type outputs to return.
seq_length
integer
default:"200"
Expected insert length before MPRA flank padding.
artifact_path
string
default:""
Optional local override path to the Malinois model artifact tarball. Leave empty to download artifact_url into the managed weights cache.
artifact_url
string
HTTPS URL used to provision the Malinois artifact.
artifact_md5
string
default:"375142a714e7df73c463b46113a65210"
Optional MD5 checksum for the downloaded artifact.
malinois_dir
string
default:""
Optional local override directory containing unpacked artifact metadata. Leave empty to use the cache extraction directory.
batch_size
integer
default:"1"
Number of sequences to process in each GPU batch.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cuda"
Device used for inference.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
results
List[MalinoisScoreResult]
required
Per-sequence Malinois predictions.
cell_types
List[string]
required
Cell types included in each result’s scores.
seq_length
integer
required
Expected insert length used for MPRA flank padding.
Metrics (one set per results item)
MetricTypeRangeAvailability
K562floatunboundedwhen requested
HepG2floatunboundedwhen requested
SKNSHfloatunboundedwhen requested

Applications

Use this tool to rank regulatory DNA designs by predicted activity in K562, HepG2, or SK-N-SH cells, screen MPRA insert candidates, or compare candidate designs before selecting sequences for downstream validation.

Usage Tips

  • Sequence length is fixed by default. Inputs must match seq_length, which defaults to 200 bp.
  • Cell type keys are canonical. Request outputs as K562, HepG2, and SKNSH; SKNSH maps to the SK-N-SH model output.
  • Batch size affects throughput. Increase batch_size for many same-length inserts when GPU memory allows.

Malinois Gradient (malinois-gradient)

Computes a weighted differentiable activity objective and, by default, returns the gradient with respect to batched relaxed DNA logits.

API Reference

Source
logits
List[array]
required
Batched relaxed DNA sequence logits with shape (B, L, 4) in A,C,G,T order. Use B=1 for a single design candidate.
temperature
number
default:"1.0"
Softmax temperature used to relax logits.
Source
loss_terms
List[MalinoisGradientLossTerm]
Per-cell objective terms summed into one scalar loss.
seq_length
integer
default:"200"
Expected insert length before Malinois flank padding.
artifact_path
string
default:""
Optional local artifact tarball path.
artifact_url
string
URL used to provision the Malinois artifact.
artifact_md5
string
default:"375142a714e7df73c463b46113a65210"
Expected checksum for the downloaded artifact.
malinois_dir
string
default:""
Optional extracted Malinois artifact directory.
soft
number
default:"1.0"
Blend hard argmax one-hot (0) to softmax probabilities (1).
hard
number
default:"0.0"
Straight-through hard-forward coefficient.
compute_gradient
boolean
default:"True"
Run backward pass and return gradient.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cuda"
Device used for inference and backpropagation.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
sample_metrics
List[MalinoisGradientSampleMetrics]
Per-sample metric containers with scalar loss and raw cell-type scores.
gradient
array
Gradient tensor matching input DNA logits, or None when compute_gradient=False.
loss
number
required
Sum of per-sample weighted scalar objective values. Per-sample values are available in sample_metrics.
metrics
Dict[string, any]
Legacy metadata bundle from the standalone worker, including raw scores, objective-term metadata, and runtime relaxation parameters.
vocab
List[string]
required
DNA column ordering for logits and gradient.
Metrics
MetricTypeRangeAvailability
lossfloat≥ 0.0always
K562floatunboundedwhen requested
HepG2floatunboundedwhen requested
SKNSHfloatunboundedwhen requested

Applications

Use this tool inside gradient-based DNA design loops to maximize activity in an on-target cell type while minimizing activity in off-target cell types. It is designed for optimizer calls rather than final biological validation.

Usage Tips

  • Logits are batched. Pass logits with shape B x L x 4 in A,C,G,T order; use B=1 for a single candidate.
  • Directions are per term. direction="max" minimizes 1 - sigmoid(raw) and direction="min" minimizes sigmoid(raw) after centering and scaling.
  • Soft/hard mixing controls relaxation. soft=1.0, hard=0.0 is fully soft; increasing hard uses a straight-through hard-forward estimator.

Toolkit Notes

These apply to every Malinois tool in this toolkit (malinois-score, malinois-gradient).
  • Requires a GPU. Both tools load a PyTorch Malinois checkpoint and run most practically on CUDA.
  • Weights are provisioned automatically. By default, the standalone worker downloads the CODA Zenodo artifact into the managed model cache and verifies its MD5 checksum.
  • The gradient tool is a single evaluation. It returns one loss and optional gradient for the provided logits; run it from an optimizer for iterative design.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

Additional Information