Malinois - Proto

License: Malinois is open source and free for academic and commercial use under an MIT license and may require explicit attribution when utilized. Please refer to the license for full terms.

Proto is not affiliated with Broad Institute, The Jackson Laboratory, and Yale University. This toolkit is open source and builds on the implementations produced by these organizations. Product names, logos, and trademarks are the property of their respective owners.

GitHub 68 GitHub 68 Publication Publication Preprint Preprint Cite Cite Tool Source Tool Source Open as Notebook Open as Notebook Open on Proto Open on Proto

sjgosai/boda2

Computational Optimization of DNA Activity (CODA)

68 stars

View repo

Machine-guided design of cell-type-targeting cis-regulatory elements

Sager J. Gosai, Rodrigo I. Castro, … Ryan Tewhey

Nature (2024)

Read paper

Machine-guided design of synthetic cell type-specific cis-regulatory elements

SJ Gosai, R. Castro, … R. Tewhey

bioRxiv (2023)

Read preprint

@article{gosai2024machine,
  title={Machine-guided design of cell-type-targeting cis-regulatory elements},
  author={Gosai, Sager J. and Castro, Rodrigo I. and Fuentes, Natalia and Butts, John C. and Mouri, Kousuke and Alasoadura, Michael and Kales, Susan and Nguyen, Thanh Thanh L. and Noche, Ramil R. and Rao, Arya S. and Joy, Mary T. and Sabeti, Pardis C. and Reilly, Steven K. and Tewhey, Ryan},
  journal={Nature},
  volume={634},
  pages={1211--1220},
  year={2024},
  doi={10.1038/s41586-024-08070-z},
  url={https://doi.org/10.1038/s41586-024-08070-z}
}

@misc{gosai2024machine_zenodo,
  title={Machine-guided design of cell-type-targeting cis-regulatory elements},
  author={Gosai, Sager and Castro, Rodrigo and Fuentes, Natalia and Butts, John and Mouri, Kousuke and Alasoadura, Michael and Kales, Susan and Nguyen, Thanh Thanh and Noche, Ramil and Rao, Arya and Joy, Mary Teena and Sabeti, Pardis and Reilly, Steven and Tewhey, Ryan},
  year={2024},
  publisher={Zenodo},
  version={1.0},
  doi={10.5281/zenodo.10698014},
  url={https://doi.org/10.5281/zenodo.10698014},
  note={Supplemental data and resources for the Nature article, including the Malinois model artifact}
}

Copy citation

proto-bio/proto-tools/proto_tools/tools/sequence_scoring/malinois

View source

Open Notebook

Open notebook

Coming soon!

Run this tool directly in Proto with no setup required.

Function	Description
`run_malinois_gradient()`	Compute differentiable Malinois MPRA activity losses and gradients for relaxed DNA logits (GPU)	Docs Source
`run_malinois_score()`	Score regulatory DNA activity using the Malinois MPRA model (GPU)	Docs Source

Background

Malinois is the regulatory sequence model used in CODA (Gosai et al., 2024) for machine-guided design of cell-type-targeting cis-regulatory elements. The model adapts the Basset-style convolutional architecture to MPRA data and predicts activity from a fixed 200 nucleotide insert after adding the assay flanks expected by the published checkpoint. The model returns one raw activity value for each supported cell context: K562, HepG2, and SK-N-SH. The scoring wrapper averages forward and reverse-complement predictions and returns selected raw outputs. The gradient wrapper applies max/min sigmoid objective terms to these raw scores and backpropagates through relaxed A,C,G,T logits, matching the Fast SeqProp-style design path used for regulatory DNA optimization.

Tools

Malinois Score (`malinois-score`)

Scores one or more 200 bp DNA inserts and returns raw Malinois predictions keyed by requested cell type.

API Reference

Source

Input: MalinoisScoreInput

sequences

List[string]

required

DNA insert sequence(s) to score. A single string is normalized to a one-item list. Sequences must contain only A, C, G, and T; the configured seq_length is checked at run time.

Source

Config: MalinoisScoreConfig

cell_types

List[string]

default:"['K562', 'HepG2', 'SKNSH']"

Cell-type outputs to return.

seq_length

integer

default:"200"

Expected insert length before MPRA flank padding.

artifact_path

string

default:""

Optional local override path to the Malinois model artifact tarball. Leave empty to download artifact_url into the managed weights cache.

artifact_url

string

HTTPS URL used to provision the Malinois artifact.

artifact_md5

string

default:"375142a714e7df73c463b46113a65210"

Optional MD5 checksum for the downloaded artifact.

malinois_dir

string

default:""

Optional local override directory containing unpacked artifact metadata. Leave empty to use the cache extraction directory.

batch_size

integer

default:"1"

Number of sequences to process in each GPU batch.

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cuda"

Device used for inference.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

Source

Output: MalinoisScoreOutput

results

List[MalinoisScoreResult]

required

Per-sequence Malinois predictions.

Show MalinoisScoreResult

sequence

string

required

DNA sequence that was scored.

sequence_length

integer

required

Length of the scored DNA sequence.

scores

MalinoisActivityMetrics

required

Malinois predictions keyed by requested cell type name.

cell_types

List[string]

required

Cell types included in each result’s scores.

seq_length

integer

required

Expected insert length used for MPRA flank padding.

Metrics (one set per results item)

Metric	Type	Range	Availability
`K562`	float	unbounded	when requested
`HepG2`	float	unbounded	when requested
`SKNSH`	float	unbounded	when requested

Applications

Use this tool to rank regulatory DNA designs by predicted activity in K562, HepG2, or SK-N-SH cells, screen MPRA insert candidates, or compare candidate designs before selecting sequences for downstream validation.

Usage Tips

Sequence length is fixed by default. Inputs must match seq_length, which defaults to 200 bp.
Cell type keys are canonical. Request outputs as K562, HepG2, and SKNSH; SKNSH maps to the SK-N-SH model output.
Batch size affects throughput. Increase batch_size for many same-length inserts when GPU memory allows.

Malinois Gradient (`malinois-gradient`)

Computes a weighted differentiable activity objective and, by default, returns the gradient with respect to batched relaxed DNA logits.

API Reference

Source

Input: MalinoisGradientInput

logits

List[array]

required

Batched relaxed DNA sequence logits with shape (B, L, 4) in A,C,G,T order. Use B=1 for a single design candidate.

temperature

number

default:"1.0"

Softmax temperature used to relax logits.

Source

Config: MalinoisGradientConfig

loss_terms

List[MalinoisGradientLossTerm]

Per-cell objective terms summed into one scalar loss.

Show MalinoisGradientLossTerm

cell_type

enum

default:"K562"

Malinois output to optimize.Available options: K562, HepG2, SKNSH

direction

enum

default:"max"

"max" minimizes 1 - sigmoid((raw - center) / scale); "min" minimizes sigmoid((raw - center) / scale).Available options: max, min

weight

number

default:"1.0"

Non-negative scalar applied before terms are summed.

sigmoid_center

number

default:"4.0"

Raw Malinois score where the sigmoid is 0.5.

sigmoid_scale

number

default:"1.0"

Positive scale for the raw score transform.

seq_length

integer

default:"200"

Expected insert length before Malinois flank padding.

artifact_path

string

default:""

Optional local artifact tarball path.

artifact_url

string

URL used to provision the Malinois artifact.

artifact_md5

string

default:"375142a714e7df73c463b46113a65210"

Expected checksum for the downloaded artifact.

malinois_dir

string

default:""

Optional extracted Malinois artifact directory.

soft

number

default:"1.0"

Blend hard argmax one-hot (0) to softmax probabilities (1).

hard

number

default:"0.0"

Straight-through hard-forward coefficient.

compute_gradient

boolean

default:"True"

Run backward pass and return gradient.

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cuda"

Device used for inference and backpropagation.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Source

Output: MalinoisGradientOutput

sample_metrics

List[MalinoisGradientSampleMetrics]

Per-sample metric containers with scalar loss and raw cell-type scores.

Show MalinoisGradientSampleMetrics

loss_terms

List[Dict[string, any]]

Per-objective-term metadata, including direction, weight, sigmoid transform, and weighted score.

primary_metric

string

Name of the metric that best summarizes the result overall (e.g. "avg_plddt" for AlphaFold2). Used by downstream UI and reporting to pick a headline value.

gradient

array

Gradient tensor matching input DNA logits, or None when compute_gradient=False.

loss

number

required

Sum of per-sample weighted scalar objective values. Per-sample values are available in sample_metrics.

metrics

Dict[string, any]

Legacy metadata bundle from the standalone worker, including raw scores, objective-term metadata, and runtime relaxation parameters.

vocab

List[string]

required

DNA column ordering for logits and gradient.

Metrics

Metric	Type	Range	Availability
`loss`	float	≥ 0.0	always
`K562`	float	unbounded	when requested
`HepG2`	float	unbounded	when requested
`SKNSH`	float	unbounded	when requested

Applications

Use this tool inside gradient-based DNA design loops to maximize activity in an on-target cell type while minimizing activity in off-target cell types. It is designed for optimizer calls rather than final biological validation.

Usage Tips

Logits are batched. Pass logits with shape B x L x 4 in A,C,G,T order; use B=1 for a single candidate.
Directions are per term. direction="max" minimizes 1 - sigmoid(raw) and direction="min" minimizes sigmoid(raw) after centering and scaling.
Soft/hard mixing controls relaxation. soft=1.0, hard=0.0 is fully soft; increasing hard uses a straight-through hard-forward estimator.

Toolkit Notes

These apply to every Malinois tool in this toolkit (malinois-score, malinois-gradient).

Requires a GPU. Both tools load a PyTorch Malinois checkpoint and run most practically on CUDA.
Weights are provisioned automatically. By default, the standalone worker downloads the CODA Zenodo artifact into the managed model cache and verifies its MD5 checksum.
The gradient tool is a single evaluation. It returns one loss and optional gradient for the provided logits; run it from an optimizer for iterative design.

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

Additional Information

References

Gosai, S. J. et al. Machine-guided design of cell-type-targeting cis-regulatory elements. Nature 634, 1211-1220 (2024). DOI: 10.1038/s41586-024-08070-z
CODA/BODA2 repository: sjgosai/boda2
CODA supplemental data and resources: Zenodo record 10698014

​Background

​Tools

​Malinois Score (malinois-score)

​API Reference

​Applications

​Usage Tips

​Malinois Gradient (malinois-gradient)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

​Additional Information

Background

Tools

Malinois Score (`malinois-score`)

API Reference

Applications

Usage Tips

Malinois Gradient (`malinois-gradient`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides

Additional Information