DeviceManager handles these automatically. The device field on a tool’s config selects where a model runs, and DeviceManager tracks every allocation, places models on free GPUs, and evicts the least-recently-used worker when memory runs out.
The defaults suit most workloads without configuration: one model per GPU, least-recently-used eviction, and a full restart when an evicted tool is reused. This guide describes the default behavior and the configuration available when the defaults do not fit a particular setup.
python
1. Requesting a device
Every tool’s config has adevice field. GPU tools default to "cuda" and CPU tools default to "cpu", but a more specific device can be requested. The accepted strings fall into two categories, which differ in how much control DeviceManager has over where the model lands.
General requests (DeviceManager chooses the GPU)
A general request names a class of device and lets DeviceManager pick the specific one. This is the appropriate choice in most cases: it requires no knowledge of which GPUs are currently free, and when every GPU is busy, DeviceManager performs eviction automatically.| Value | Meaning |
|---|---|
"cpu" | Run on CPU |
"cuda" | Run on one GPU, DeviceManager picks which one |
"cudax2", "cudax3", … | Run on N GPUs, DeviceManager picks which ones |
Specific requests (the caller chooses the GPU)
A specific request names an exact device. DeviceManager honors it, evicting whatever currently occupies the slot if it is taken. This is appropriate when there is a reason to pin to a known device: benchmarking, affinity with other work on the same card, or reproducing a particular placement.| Value | Meaning |
|---|---|
"cuda:0" | Run on exactly GPU 0 |
"cuda:0,1" or "cuda:0,cuda:1" | Run on exactly GPUs 0 and 1 |
Example: general versus specific allocation
The following two calls use identical inputs but different device requests. The first asks for any GPU; the second pins to device 2.python
Limiting the managed devices
By default, DeviceManager treats every visible GPU as part of its allocation pool. On a shared machine, or when some cards are deliberately reserved for another workload, the pool can be narrowed to a subset. This is done either by setting theBIO_TOOLS_MANAGED_DEVICES environment variable before any tool runs, or by calling configure(managed_devices=...) at runtime.
python
2. Eviction strategies
DeviceManager’s allocation pool is finite. When every managed GPU is full and a new model needs to load, an existing worker must be displaced. DeviceManager evicts the least recently used worker to make room; what eviction does to that worker is configurable throughoffload_strategy (or the BIO_TOOLS_OFFLOAD_STRATEGY environment variable).
- RESTART (the default) shuts the evicted worker down entirely. This frees all of its memory, but the next call to that tool pays the full startup cost to reload it.
- CPU moves the evicted model into system RAM. The model stays loaded, and returning it to the GPU later is a fast copy rather than a full reload.
Not every tool can stay resident on a GPU between calls, and not every tool can be offloaded to CPU. For tools that cannot, DeviceManager falls back to RESTART behavior regardless of the configured strategy, and every call pays the full startup cost.
RESTART (default)
Under the default strategy, eviction terminates the evicted worker’s subprocess. The GPU slot is fully freed and the new model takes its place immediately.python
CPU offload
Withoffload_strategy=OffloadStrategy.CPU, an evicted worker is not torn down. DeviceManager moves its weights into CPU memory and keeps the process alive. Promoting it back to the GPU later is a tensor copy, which is orders of magnitude faster than a fresh model load.
python
LRU eviction across multiple GPUs
LRU eviction is straightforward: every call to arun_* function updates the last-used time on its worker. When a new allocation needs a GPU and every GPU is occupied, DeviceManager scans its allocation map and selects the worker with the oldest last-used time, regardless of which GPU it occupies.
python
3. Multiple models per device
DeviceManager’s default policy is one worker per GPU. When memory headroom allows, for example an 80 GB GPU and two 15 GB models that both need to be live, multiple workers can be packed onto the same card rather than contending for it.DeviceManager does not track model sizes or estimate memory usage. Ensuring that the packed models actually fit is the caller’s responsibility; overcommitting produces an out-of-memory error.
Packing four instances on two GPUs
Withallow_multiple_per_device=True, new allocations round-robin across the pool instead of triggering eviction. Four instances across two GPUs place two on each.
python
with blocks, every instance stays loaded: no eviction, no offload, and no restarts. The ceiling is GPU memory, together with whatever fragmentation PyTorch can accommodate.
Configuration reference
DeviceManager can be configured programmatically or through environment variables.Environment variables
| Variable | Example | Description |
|---|---|---|
BIO_TOOLS_MANAGED_DEVICES | "cuda:0,cuda:1" | Restrict device pool |
BIO_TOOLS_OFFLOAD_STRATEGY | "restart" or "cpu" | Eviction strategy |
BIO_TOOLS_ALLOW_MULTI_DEVICE | "true" or "false" | Multiple models per GPU |
Programmatic configuration
python
Go deeper
For the implementation details behind this guide, consult the developer notes in the proto-tools repository: Device Management ReferenceDevice strings, the allocation map, LRU eviction, RESTART vs CPU offload, managed devices, and packing.Next Steps
Tool Persistence
Keep models loaded across calls to avoid repeated startup cost.
Parallel Execution
Fan work out across every managed GPU with ToolPool.