Skip to main content
Open as a runnable notebook Running models on GPUs involves a recurring set of concerns: tracking which device each model occupies, handling the case where a requested device is busy, and releasing memory once a model is no longer needed. DeviceManager handles these automatically. The device field on a tool’s config selects where a model runs, and DeviceManager tracks every allocation, places models on free GPUs, and evicts the least-recently-used worker when memory runs out. The defaults suit most workloads without configuration: one model per GPU, least-recently-used eviction, and a full restart when an evicted tool is reused. This guide describes the default behavior and the configuration available when the defaults do not fit a particular setup.
python
from proto_tools.tools.structure_prediction.esmfold import (
    run_esmfold, ESMFoldInput, ESMFoldConfig,
)
from proto_tools.utils.tool_instance import ToolInstance
from proto_tools.utils.device_manager import DeviceManager, OffloadStrategy

1. Requesting a device

Every tool’s config has a device field. GPU tools default to "cuda" and CPU tools default to "cpu", but a more specific device can be requested. The accepted strings fall into two categories, which differ in how much control DeviceManager has over where the model lands.

General requests (DeviceManager chooses the GPU)

A general request names a class of device and lets DeviceManager pick the specific one. This is the appropriate choice in most cases: it requires no knowledge of which GPUs are currently free, and when every GPU is busy, DeviceManager performs eviction automatically.
ValueMeaning
"cpu"Run on CPU
"cuda"Run on one GPU, DeviceManager picks which one
"cudax2", "cudax3", …Run on N GPUs, DeviceManager picks which ones

Specific requests (the caller chooses the GPU)

A specific request names an exact device. DeviceManager honors it, evicting whatever currently occupies the slot if it is taken. This is appropriate when there is a reason to pin to a known device: benchmarking, affinity with other work on the same card, or reproducing a particular placement.
ValueMeaning
"cuda:0"Run on exactly GPU 0
"cuda:0,1" or "cuda:0,cuda:1"Run on exactly GPUs 0 and 1

Example: general versus specific allocation

The following two calls use identical inputs but different device requests. The first asks for any GPU; the second pins to device 2.
python
# General request: DeviceManager picks the GPU
with ToolInstance.persist():
    output = run_esmfold(ESMFoldInput(complexes=["MKTLLILAVVAAALA"]))

# Specific request: we pick the GPU
with ToolInstance.persist():
    output = run_esmfold(
        ESMFoldInput(complexes=["MKTLLILAVVAAALA"]),
        ESMFoldConfig(device="cuda:2"),
    )
General request lands on any free GPU; specific request pins to the requested device

Limiting the managed devices

By default, DeviceManager treats every visible GPU as part of its allocation pool. On a shared machine, or when some cards are deliberately reserved for another workload, the pool can be narrowed to a subset. This is done either by setting the BIO_TOOLS_MANAGED_DEVICES environment variable before any tool runs, or by calling configure(managed_devices=...) at runtime.
python
dm = DeviceManager.get_instance()
dm.configure(managed_devices=["cuda:1"])

# General requests now only land on cuda:1
with ToolInstance.persist():
    output = run_esmfold(ESMFoldInput(complexes=["MKTLLILAVVAAALA"]))

# Equivalent via environment variable:
# export BIO_TOOLS_MANAGED_DEVICES="cuda:1"

# Reset back to full pool
DeviceManager.reset_instance()
Managed pool restricts general requests to the listed GPUs; the rest are reserved for other work General requests now land only on managed GPUs. Unmanaged cards remain untouched by DeviceManager and stay available for other work on the same machine.

2. Eviction strategies

DeviceManager’s allocation pool is finite. When every managed GPU is full and a new model needs to load, an existing worker must be displaced. DeviceManager evicts the least recently used worker to make room; what eviction does to that worker is configurable through offload_strategy (or the BIO_TOOLS_OFFLOAD_STRATEGY environment variable).
  • RESTART (the default) shuts the evicted worker down entirely. This frees all of its memory, but the next call to that tool pays the full startup cost to reload it.
  • CPU moves the evicted model into system RAM. The model stays loaded, and returning it to the GPU later is a fast copy rather than a full reload.
RESTART is appropriate when GPU memory is the tightest constraint and the reload cost on an occasional wake-up is acceptable. CPU is appropriate when system RAM is plentiful and a small set of models is cycled frequently enough that cold reloads are costly.
Not every tool can stay resident on a GPU between calls, and not every tool can be offloaded to CPU. For tools that cannot, DeviceManager falls back to RESTART behavior regardless of the configured strategy, and every call pays the full startup cost.

RESTART (default)

Under the default strategy, eviction terminates the evicted worker’s subprocess. The GPU slot is fully freed and the new model takes its place immediately.
python
DeviceManager.reset_instance()
dm = DeviceManager.get_instance()
dm.configure(managed_devices=["cuda:0"])

with ToolInstance.persist_tool("esmfold", instance_name="A"):
    run_esmfold(ESMFoldInput(complexes=["MKTLLILAVVAAALA"]), instance="A")
    # A: cuda:0, ~9 GB used

    # other tasks here ...

    with ToolInstance.persist_tool("esmfold", instance_name="B"):
        run_esmfold(ESMFoldInput(complexes=["GAVLTVLLGGLLLA"]), instance="B")
        # RESTART evicts A; A fully shut down
        # B: cuda:0, ~9 GB used

DeviceManager.reset_instance()
RESTART eviction: A is fully shut down when B needs cuda:0 A later call to the evicted tool behaves exactly like its first call: a fresh subprocess, a fresh model load, and the full cold-start cost.

CPU offload

With offload_strategy=OffloadStrategy.CPU, an evicted worker is not torn down. DeviceManager moves its weights into CPU memory and keeps the process alive. Promoting it back to the GPU later is a tensor copy, which is orders of magnitude faster than a fresh model load.
python
DeviceManager.reset_instance()
dm = DeviceManager.get_instance()
dm.configure(managed_devices=["cuda:0"], offload_strategy=OffloadStrategy.CPU)

with ToolInstance.persist_tool("esmfold", instance_name="A"):
    run_esmfold(ESMFoldInput(complexes=["MKTLLILAVVAAALA"]), instance="A")
    # A: cuda:0

    # other tasks here ...

    with ToolInstance.persist_tool("esmfold", instance_name="B"):
        run_esmfold(ESMFoldInput(complexes=["GAVLTVLLGGLLLA"]), instance="B")
        # A moved to CPU, B: cuda:0

        # other tasks here ...

        run_esmfold(ESMFoldInput(complexes=["MKTLLILAVVAAALA"]), instance="A")
        # Fast CPU → GPU swap (~8s vs ~17s cold reload)
        # A: cuda:0, B: cpu

DeviceManager.reset_instance()
CPU offload: evicted models move to CPU memory and come back fast when re-activated The cost is RAM. Every offloaded worker remains resident, only not on the GPU. When the number of models exceeds what fits across both GPU and system memory, RESTART is the better choice.

LRU eviction across multiple GPUs

LRU eviction is straightforward: every call to a run_* function updates the last-used time on its worker. When a new allocation needs a GPU and every GPU is occupied, DeviceManager scans its allocation map and selects the worker with the oldest last-used time, regardless of which GPU it occupies.
python
DeviceManager.reset_instance()
dm = DeviceManager.get_instance()
dm.configure(managed_devices=["cuda:0", "cuda:1"])

with ToolInstance.persist_tool("esmfold", instance_name="A"):
    run_esmfold(ESMFoldInput(complexes=["MKTLLILAVVAAALA"]), instance="A")
    # A: cuda:0

    # other tasks here ...

    with ToolInstance.persist_tool("esmfold", instance_name="B"):
        run_esmfold(ESMFoldInput(complexes=["GAVLTVLLGGLLLA"]), instance="B")
        # A: cuda:0, B: cuda:1

        # other tasks here ...

        with ToolInstance.persist_tool("esmfold", instance_name="C"):
            run_esmfold(ESMFoldInput(complexes=["MGQQPGKVLGDQRR"]), instance="C")
            # A is the least recently used; evicted from cuda:0
            # B: cuda:1, C: cuda:0

DeviceManager.reset_instance()
LRU eviction with 2 GPUs and 3 instances: A is evicted because it was used least recently The particular GPU that the least-recently-used worker occupies does not matter; DeviceManager selects the oldest worker anywhere in the pool and reuses its slot.

3. Multiple models per device

DeviceManager’s default policy is one worker per GPU. When memory headroom allows, for example an 80 GB GPU and two 15 GB models that both need to be live, multiple workers can be packed onto the same card rather than contending for it.
DeviceManager does not track model sizes or estimate memory usage. Ensuring that the packed models actually fit is the caller’s responsibility; overcommitting produces an out-of-memory error.

Packing four instances on two GPUs

With allow_multiple_per_device=True, new allocations round-robin across the pool instead of triggering eviction. Four instances across two GPUs place two on each.
python
DeviceManager.reset_instance()
dm = DeviceManager.get_instance()
dm.configure(managed_devices=["cuda:0", "cuda:1"], allow_multiple_per_device=True)

with ToolInstance.persist_tool("esmfold", instance_name="A"):
    with ToolInstance.persist_tool("esmfold", instance_name="B"):
        with ToolInstance.persist_tool("esmfold", instance_name="C"):
            with ToolInstance.persist_tool("esmfold", instance_name="D"):

                run_esmfold(ESMFoldInput(complexes=["MKTLLILAVVAAALA"]), instance="A")
                run_esmfold(ESMFoldInput(complexes=["GAVLTVLLGGLLLA"]), instance="B")
                run_esmfold(ESMFoldInput(complexes=["MGQQPGKVLGDQRR"]), instance="C")
                run_esmfold(ESMFoldInput(complexes=["ASTVKFLGPVLDAA"]), instance="D")
                # A: cuda:0, B: cuda:1, C: cuda:0, D: cuda:1

DeviceManager.reset_instance()
Four instances packed across two GPUs: A, C on cuda:0 and B, D on cuda:1 Within the nested with blocks, every instance stays loaded: no eviction, no offload, and no restarts. The ceiling is GPU memory, together with whatever fragmentation PyTorch can accommodate.

Configuration reference

DeviceManager can be configured programmatically or through environment variables.

Environment variables

VariableExampleDescription
BIO_TOOLS_MANAGED_DEVICES"cuda:0,cuda:1"Restrict device pool
BIO_TOOLS_OFFLOAD_STRATEGY"restart" or "cpu"Eviction strategy
BIO_TOOLS_ALLOW_MULTI_DEVICE"true" or "false"Multiple models per GPU

Programmatic configuration

python
from proto_tools.utils.device_manager import DeviceManager, OffloadStrategy

dm = DeviceManager.get_instance()
dm.configure(
    managed_devices=["cuda:0", "cuda:1"],
    offload_strategy=OffloadStrategy.CPU,
    allow_multiple_per_device=False,
)

Go deeper

For the implementation details behind this guide, consult the developer notes in the proto-tools repository: Device Management ReferenceDevice strings, the allocation map, LRU eviction, RESTART vs CPU offload, managed devices, and packing.

Next Steps

Tool Persistence

Keep models loaded across calls to avoid repeated startup cost.

Parallel Execution

Fan work out across every managed GPU with ToolPool.