proto_tools runs inside an isolated environment managed by ToolInstance. This isolation keeps heavy dependencies such as PyTorch and ESM out of the main environment, and it allows every tool to expose the same interface regardless of its underlying requirements. The isolation has a further consequence: by default, every call to a run_* function is fully self-contained. A fresh subprocess is created, the model is loaded, inference is performed, and the subprocess exits. No state carries over between calls, and no GPU memory is retained once a call has completed.
This behavior is appropriate in most situations. For a single call in a notebook, one model load is a reasonable price for returning to a clean state afterward. For a batch workload, however, such as folding many sequences, sweeping hyperparameters, or stepping through an optimization loop, reloading the model on every call becomes the dominant cost. A forward pass may take a second, whereas loading a model such as ESMFold can take tens of seconds.
For workloads that require the model to remain in memory across calls, ToolInstance provides a small family of persistence mechanisms. These are peers to the one-shot default rather than improvements upon it; each is suited to a different shape of workload. The table below summarizes when each applies, and the remainder of this guide describes them in order of increasing control.
| Method | Caches? | Cleanup | Best for |
|---|---|---|---|
| Default (one-shot) | No | Automatic | Single calls, safety first |
ToolInstance.persist() | Yes, automatic | Automatic on exit | Batch jobs, optimization loops |
ToolInstance.persist_tool() | Yes, named tool | Automatic on exit | Multiple instances, multi-GPU |
ToolInstance.get() | Yes, until closed | Manual | Long-running sessions |
1. Default behavior (one-shot)
Without additional configuration, everyrun_* function is one-shot. Each call launches a fresh subprocess, loads the model, performs inference, and tears everything down on completion. Isolation is the objective: no process runs in the background, no GPU memory is held between calls, and no call can affect another. For a notebook or a short script that requires only a single prediction, this is the appropriate behavior.
The cost becomes apparent as soon as two calls are issued in succession, because both pay the model load in full.
python
2. The persist() context manager
The most convenient way to enable persistence is ToolInstance.persist(). It operates in the manner of torch.inference_mode(): the block of code that should use persistence is wrapped in the context manager, and any tool invoked within that block is cached on first use and cleaned up automatically when the block exits. The tool does not need to be named in advance, and the tool-call signature is unchanged. When a loop invokes several tools, each is cached on first use, so a design loop that runs both ProteinMPNN and ESMFold keeps both resident for the lifetime of the block. For this reason persist() is the recommended choice for most batch workloads.
python
with block, the first call still incurs the full load cost, since the model must be brought into memory. Every subsequent call skips the load entirely and runs inference against the resident worker. When the block exits, the worker is shut down, GPU memory is released, and execution returns to the default isolated state with no explicit cleanup required.
This is the appropriate pattern for nearly every batch workload: loops over sequences, optimization passes, and any procedure that invokes the same tool more than once.
3. Named instances with persist_tool()
persist_tool(tool_name) is a narrower form of persist() that scopes persistence to a single named tool. For most batch workloads persist() is sufficient, as it already caches every tool invoked within the block. The reason to use persist_tool() is the need for more than one live worker for the same tool at the same time.
This requirement arises most often with multi-GPU configurations, in which one ESMFold worker is pinned to cuda:0, another to cuda:1, and each call is routed to the appropriate worker. persist_tool() supports this through the instance_name argument. Each named instance runs in its own subprocess. At call time, the handle returned by the context manager, or the instance name as a string, is supplied as the instance= argument to the run_* call, and the call is dispatched to that specific worker.
python
with blocks keep both workers resident for the duration of the inner block, and the pool is torn down cleanly when the outermost block exits.
4. Manual lifecycle with get() and shutdown()
In some cases the relevant lifecycle is not a block at all. In a Jupyter notebook, for example, a model is typically kept resident across many cells, including idle periods, without enclosing the entire session in a single with statement. ToolInstance.get() provides this control directly: it creates, or retrieves, a persistent worker that remains available until it is explicitly shut down.
python
get(), the worker is loaded and serves every subsequent run_* call that has a matching configuration. The worker is released by calling tool.shutdown(), or ToolInstance.shutdown_instance("esmfold") when the handle is not available. get() is the appropriate choice when the natural unit of work is a session rather than a block.
5. Automatic restart on configuration changes
Regardless of the persistence mode in use, a worker always reflects the configuration with which it was most recently loaded. When a load-time parameter changes, such asdevice, model_checkpoint, model_name, or any other field marked reload_on_change=True, the persistence layer detects the mismatch and transparently restarts the worker with the new configuration before running the call.
This is the reason persist() and get() do not take a device= argument of their own: they do not require one. The first call within the block establishes the configuration, and any later change is applied automatically.
6. Timeout
Every tool call enforces a timeout, which defaults to 600 seconds (ten minutes). When a single call exceeds the timeout, its subprocess is terminated and aTimeoutError is raised so that the remainder of the program can determine how to proceed.
python
Go deeper
For the implementation details behind this guide, consult the developer notes in the proto-tools repository: Tool Persistence ReferenceWorker cache hierarchy, dispatch paths, config-change restarts, persistent workers, and tool pools.Next Steps
Device Management
Control which GPU each tool call runs on.
Parallel Execution
Run many tool calls concurrently across workers.