Skip to main content
Open as a runnable notebook Each tool in proto_tools runs inside an isolated environment managed by ToolInstance. This isolation keeps heavy dependencies such as PyTorch and ESM out of the main environment, and it allows every tool to expose the same interface regardless of its underlying requirements. The isolation has a further consequence: by default, every call to a run_* function is fully self-contained. A fresh subprocess is created, the model is loaded, inference is performed, and the subprocess exits. No state carries over between calls, and no GPU memory is retained once a call has completed. This behavior is appropriate in most situations. For a single call in a notebook, one model load is a reasonable price for returning to a clean state afterward. For a batch workload, however, such as folding many sequences, sweeping hyperparameters, or stepping through an optimization loop, reloading the model on every call becomes the dominant cost. A forward pass may take a second, whereas loading a model such as ESMFold can take tens of seconds. For workloads that require the model to remain in memory across calls, ToolInstance provides a small family of persistence mechanisms. These are peers to the one-shot default rather than improvements upon it; each is suited to a different shape of workload. The table below summarizes when each applies, and the remainder of this guide describes them in order of increasing control.
MethodCaches?CleanupBest for
Default (one-shot)NoAutomaticSingle calls, safety first
ToolInstance.persist()Yes, automaticAutomatic on exitBatch jobs, optimization loops
ToolInstance.persist_tool()Yes, named toolAutomatic on exitMultiple instances, multi-GPU
ToolInstance.get()Yes, until closedManualLong-running sessions

1. Default behavior (one-shot)

Without additional configuration, every run_* function is one-shot. Each call launches a fresh subprocess, loads the model, performs inference, and tears everything down on completion. Isolation is the objective: no process runs in the background, no GPU memory is held between calls, and no call can affect another. For a notebook or a short script that requires only a single prediction, this is the appropriate behavior. The cost becomes apparent as soon as two calls are issued in succession, because both pay the model load in full.
python
from proto_tools.tools.structure_prediction.esmfold import (
    run_esmfold, ESMFoldInput, ESMFoldConfig,
)

# Two consecutive calls; each loads the model from scratch
output1 = run_esmfold(
    ESMFoldInput(complexes=["MKTLLILAVVAAALA"]),
    ESMFoldConfig(device="cuda"),
)
# ~16s (model load + inference)

output2 = run_esmfold(
    ESMFoldInput(complexes=["MKTLLILAVVAAALA"]),
    ESMFoldConfig(device="cuda"),
)
# ~16s (model load + inference again)
Default one-shot: each call reloads the model and releases GPU memory This is the intended behavior for an isolated, one-off call. For a workload that invokes the same tool repeatedly, one of the persistence modes described below avoids paying the load cost on every call.

2. The persist() context manager

The most convenient way to enable persistence is ToolInstance.persist(). It operates in the manner of torch.inference_mode(): the block of code that should use persistence is wrapped in the context manager, and any tool invoked within that block is cached on first use and cleaned up automatically when the block exits. The tool does not need to be named in advance, and the tool-call signature is unchanged. When a loop invokes several tools, each is cached on first use, so a design loop that runs both ProteinMPNN and ESMFold keeps both resident for the lifetime of the block. For this reason persist() is the recommended choice for most batch workloads.
python
from proto_tools.utils.tool_instance import ToolInstance

sequences = [
    "MKTLLILAVVAAALA",
    "GAVLTVLLGGLLLA",
    "MGQQPGKVLGDQRR",
    "AAKIKVLGDQRRQA",
]

with ToolInstance.persist():
    for seq in sequences:
        output = run_esmfold(
            ESMFoldInput(complexes=[seq]),
            ESMFoldConfig(device="cuda"),
        )

# Call 1:    ~15s (model load + inference)
# Calls 2-4: <1s each (inference only)
# Total:     ~18s
With persist(): one load, many infers inside the with block Within the with block, the first call still incurs the full load cost, since the model must be brought into memory. Every subsequent call skips the load entirely and runs inference against the resident worker. When the block exits, the worker is shut down, GPU memory is released, and execution returns to the default isolated state with no explicit cleanup required. This is the appropriate pattern for nearly every batch workload: loops over sequences, optimization passes, and any procedure that invokes the same tool more than once.

3. Named instances with persist_tool()

persist_tool(tool_name) is a narrower form of persist() that scopes persistence to a single named tool. For most batch workloads persist() is sufficient, as it already caches every tool invoked within the block. The reason to use persist_tool() is the need for more than one live worker for the same tool at the same time. This requirement arises most often with multi-GPU configurations, in which one ESMFold worker is pinned to cuda:0, another to cuda:1, and each call is routed to the appropriate worker. persist_tool() supports this through the instance_name argument. Each named instance runs in its own subprocess. At call time, the handle returned by the context manager, or the instance name as a string, is supplied as the instance= argument to the run_* call, and the call is dispatched to that specific worker.
python
with ToolInstance.persist_tool("esmfold", instance_name="worker_a") as inst_a:
    with ToolInstance.persist_tool("esmfold", instance_name="worker_b"):

        # Route each call to a specific worker
        out_a = run_esmfold(
            ESMFoldInput(complexes=["MKTLLILAVVAAALA"]),
            ESMFoldConfig(device="cuda:0"),
            instance=inst_a,            # pass the instance object
        )

        out_b = run_esmfold(
            ESMFoldInput(complexes=["MKTLLILAVVAAALA"]),
            ESMFoldConfig(device="cuda:1"),
            instance="worker_b",        # or pass the instance name
        )
Named instances: two workers, each pinned to its own GPU and routed by instance name Nested with blocks keep both workers resident for the duration of the inner block, and the pool is torn down cleanly when the outermost block exits.

4. Manual lifecycle with get() and shutdown()

In some cases the relevant lifecycle is not a block at all. In a Jupyter notebook, for example, a model is typically kept resident across many cells, including idle periods, without enclosing the entire session in a single with statement. ToolInstance.get() provides this control directly: it creates, or retrieves, a persistent worker that remains available until it is explicitly shut down.
python
# Create a persistent instance; stays cached until explicitly shut down
tool = ToolInstance.get("esmfold")

for seq in sequences:
    output = run_esmfold(
        ESMFoldInput(complexes=[seq]),
        ESMFoldConfig(device="cuda"),
    )

# Clean up when done; stops worker and evicts from cache
tool.shutdown()

# You can also shut down by name without a reference to the instance
ToolInstance.shutdown_instance("esmfold")
Manual lifecycle: explicit get() loads the worker; shutdown() releases it on demand The contract is straightforward: after get(), the worker is loaded and serves every subsequent run_* call that has a matching configuration. The worker is released by calling tool.shutdown(), or ToolInstance.shutdown_instance("esmfold") when the handle is not available. get() is the appropriate choice when the natural unit of work is a session rather than a block.

5. Automatic restart on configuration changes

Regardless of the persistence mode in use, a worker always reflects the configuration with which it was most recently loaded. When a load-time parameter changes, such as device, model_checkpoint, model_name, or any other field marked reload_on_change=True, the persistence layer detects the mismatch and transparently restarts the worker with the new configuration before running the call. This is the reason persist() and get() do not take a device= argument of their own: they do not require one. The first call within the block establishes the configuration, and any later change is applied automatically.

6. Timeout

Every tool call enforces a timeout, which defaults to 600 seconds (ten minutes). When a single call exceeds the timeout, its subprocess is terminated and a TimeoutError is raised so that the remainder of the program can determine how to proceed.
python
# Set a 120-second timeout
output = run_esmfold(
    ESMFoldInput(complexes=["MKTLLILAVVAAALA"]),
    ESMFoldConfig(device="cuda", timeout=120),
)
For persistent workers, the timeout applies per call. A slow call that exceeds the timeout does not tear down the worker; subsequent calls continue to run against the loaded model as though the timeout had not occurred.

Go deeper

For the implementation details behind this guide, consult the developer notes in the proto-tools repository: Tool Persistence ReferenceWorker cache hierarchy, dispatch paths, config-change restarts, persistent workers, and tool pools.

Next Steps

Device Management

Control which GPU each tool call runs on.

Parallel Execution

Run many tool calls concurrently across workers.