Workflow Timeout Management#

Workflow Timeout Management is a deadline-based cancellation mechanism in DBOS Transact Python that ensures workflow executions do not exceed specified time limits. The system implements timeout enforcement through background threads that monitor workflow deadlines, event-based signaling for graceful shutdown, and durable tracking of timeout configurations in the system database. This mechanism protects against runaway workflows, enforces service-level agreements, and provides predictable execution behavior across distributed systems.

The timeout management system operates at workflow initialization time, where timeouts are converted to absolute deadlines and stored in the workflow status. When a workflow has an associated deadline, the system spawns a dedicated background thread that waits until the deadline expires, then triggers workflow cancellation through the system database. The persistence of timeout metadata enables consistent enforcement across workflow recovery, process restarts, and distributed execution scenarios.

Timeout management integrates with DBOS's broader workflow lifecycle, including queue management, recovery mechanisms, and parent-child workflow relationships. The system supports both relative timeout specifications (duration from workflow start) and absolute deadline timestamps, with intelligent propagation of deadlines to child workflows. Workflows that exceed their timeout raise DBOSAwaitedWorkflowCancelledError and may be subject to recovery attempts based on configured retry policies.

Architecture and Implementation#

Background Thread Mechanism#

The timeout enforcement mechanism begins during workflow initialization in the _init_workflow function. When a workflow should execute (as determined by should_execute) and has a non-null workflow_deadline_epoch_ms, the system creates a background thread to monitor the deadline:

if should_execute and workflow_deadline_epoch_ms is not None:
    evt = threading.Event()
    dbos.background_thread_stop_events.append(evt)

    def timeout_func() -> None:
        try:
            assert workflow_deadline_epoch_ms is not None
            time_to_wait_sec = (
                workflow_deadline_epoch_ms - (time.time() * 1000)
            ) / 1000
            if time_to_wait_sec > 0:
                was_stopped = evt.wait(time_to_wait_sec)
                if was_stopped:
                    return
            dbos._sys_db.cancel_workflows([wfid])
        except Exception as e:
            dbos.logger.warning(
                f"Exception in timeout thread for workflow {wfid}: {e}"
            )

    timeout_thread = threading.Thread(target=timeout_func, daemon=True)
    timeout_thread.start()
    dbos._background_threads.append(timeout_thread)

The thread uses a threading.Event() for graceful shutdown signaling. This event-based approach allows the DBOS runtime to signal timeout threads to terminate when the application shuts down, preventing resource leaks. The thread calculates the remaining time until the deadline, waits for that duration (or until signaled to stop), and then invokes workflow cancellation.

Each timeout thread is registered with the DBOS instance in two collections: background_thread_stop_events for signaling and _background_threads for thread lifecycle management. The daemon flag ensures that timeout threads do not prevent the Python interpreter from exiting.

Durable Timeout Tracking#

Timeout configurations are stored persistently in the workflow_status table using two optional fields:

workflow_timeout_ms: A BigInteger field storing the relative timeout duration in milliseconds. This represents the maximum time allowed from workflow start to completion.
workflow_deadline_epoch_ms: A BigInteger field storing the absolute deadline as milliseconds since the Unix epoch. This is computed by adding the timeout to the workflow start time, or propagated from a parent workflow.

Both fields are nullable, indicating that timeouts are optional workflow configuration parameters. The distinction between relative timeouts and absolute deadlines serves different use cases: timeouts express "this should complete within X seconds," while deadlines express "this must complete by timestamp Y." Internally, the system converts timeouts to deadlines at workflow initialization time for consistent enforcement.

The persistence of these fields enables timeout enforcement across workflow recovery scenarios. If a workflow fails and is later recovered, the system can check whether the deadline has already expired before attempting to re-execute the workflow. Deadlines also propagate when workflows are forked, ensuring that restarted workflows from specific steps inherit the parent's time constraints.

When a workflow is forked, the system marks the original workflow with was_forked_from=True in the workflow_status table to track the forking relationship. This enables querying and filtering workflows that have been forked from, distinguishing them from workflows that are the result of forking operations.

Cancellation Process#

When a timeout expires, the cancel_workflows method in the system database layer performs the cancellation. This method is designed as a batch operation that can cancel multiple workflows in a single database transaction:

def cancel_workflows(
    self,
    workflow_ids: list[str],
) -> None:
    with self.engine.begin() as c:
        # Set the workflows' status to CANCELLED and remove them from any queue,
        # but only if the workflow is not already complete.
        c.execute(
            sa.update(SystemSchema.workflow_status)
            .where(SystemSchema.workflow_status.c.workflow_uuid.in_(workflow_ids))
            .where(
                SystemSchema.workflow_status.c.status.notin_(
                    [
                        WorkflowStatusString.SUCCESS.value,
                        WorkflowStatusString.ERROR.value,
                    ]
                )
            )
            .values(
                status=WorkflowStatusString.CANCELLED.value,
                queue_name=None,
                deduplication_id=None,
                started_at_epoch_ms=None,
                updated_at=func.extract("epoch", func.now()) * 1000,
            )
        )

The cancellation operation has several important characteristics:

Idempotent: The method only updates workflows that are not already in a terminal state (SUCCESS or ERROR), preventing race conditions where a workflow completes just as its timeout expires. This includes cancelling workflows in DELAYED, ENQUEUED, and PENDING states.
Atomic: The database transaction ensures that all status updates occur atomically. Either all specified workflows are cancelled, or none are.
Queue-aware: The operation clears queue-related metadata (queue_name, deduplication_id, started_at_epoch_ms), ensuring cancelled workflows are removed from queue processing. This applies to both ENQUEUED and DELAYED workflows, with DELAYED workflows being cancelled before they ever execute.
Timestamped: The updated_at field is set to the current time, providing an audit trail of when the cancellation occurred.

After cancellation, the workflow's execution thread detects the status change and raises a DBOSWorkflowCancelledError. If the workflow is being awaited by a caller, this exception is converted to DBOSAwaitedWorkflowCancelledError to distinguish between direct workflow invocation and handle-based result retrieval.

Timeout Calculation and Propagation#

The _get_timeout_deadline function determines how timeouts and deadlines are computed for new workflows:

def _get_timeout_deadline(
    ctx: Optional[DBOSContext], queue: Optional[str]
) -> tuple[Optional[int], Optional[int]]:
    if ctx is None:
        return None, None
    # If a timeout is explicitly specified, use it over any propagated deadline
    if ctx.workflow_timeout_ms:
        if queue:
            # Queued workflows are assigned a deadline on dequeue
            return ctx.workflow_timeout_ms, None
        else:
            # Otherwise, compute the deadline immediately
            return (
                ctx.workflow_timeout_ms,
                int(time.time() * 1000) + ctx.workflow_timeout_ms,
            )
    # Otherwise, return the propagated deadline, if any
    else:
        return None, ctx.workflow_deadline_epoch_ms

The function implements a three-tier precedence system:

Explicit timeouts for immediate execution: When a workflow has an explicit workflow_timeout_ms and is not queued, the system computes an absolute deadline by adding the timeout to the current time. This ensures immediate workflows start counting down from their actual start time.
Explicit timeouts for queued workflows: When a workflow with an explicit timeout is enqueued (including those in DELAYED status), the deadline computation is deferred until the workflow is dequeued and begins execution. This prevents queued workflows from consuming their timeout budget while waiting in the queue or during their delay period. For DELAYED workflows, the timeout countdown begins only after the workflow transitions from DELAYED to ENQUEUED and is then dequeued, not during the delay interval. The deadline is calculated at dequeue time based on when the workflow actually begins execution.
Propagated deadlines: When no explicit timeout is set, the function returns the workflow_deadline_epoch_ms from the current context. This propagates parent workflow deadlines to child workflows, ensuring that child operations respect the parent's time constraints.

This propagation behavior enables deadline inheritance in workflow hierarchies. A long-running workflow with a 10-second timeout can invoke child workflows, and those children automatically inherit a deadline corresponding to the parent's remaining time. This prevents child workflows from exceeding the parent's overall time budget.

Usage and Configuration#

Setting Timeouts via Python APIs#

Timeouts can be configured using the SetWorkflowTimeout context manager, which sets the timeout for all workflow invocations within its scope:

from dbos import SetWorkflowTimeout
import pytest

# Set timeout for workflow execution
with SetWorkflowTimeout(0.1): # 100ms timeout
    with pytest.raises(DBOSAwaitedWorkflowCancelledError):
        blocked_workflow() # Workflow that exceeds timeout

The context manager accepts timeout values in seconds as a float parameter and internally converts them to milliseconds for storage. It works with direct workflow invocation, start_workflow, start_workflow_async, and queued workflows through queue.enqueue.

Multiple SetWorkflowTimeout context managers can be nested, with inner timeouts overriding outer ones. Passing None as the timeout value clears any previously set timeout.

Accessing Timeout Information in Workflows#

Workflows can query their timeout configuration through the workflow context:

from dbos._context import assert_current_dbos_context

@DBOS.workflow()
def my_workflow():
    ctx = assert_current_dbos_context()
    timeout_ms = ctx.workflow_timeout_ms # int or None
    deadline_ms = ctx.workflow_deadline_epoch_ms # int or None

This allows workflows to implement custom timeout-aware behavior, such as checking remaining time before starting expensive operations or adjusting batch sizes based on time constraints.

Async Workflow Timeouts#

Async workflows support the same timeout configuration:

with SetWorkflowTimeout(0.1):
    handle = await DBOS.start_workflow_async(blocked_workflow)
    await handle.get_result() # Raises timeout error

The timeout mechanism works identically for both synchronous and asynchronous workflows, with the background timeout thread operating independently of the async event loop.

PostgreSQL Stored Function Configuration#

For workflows enqueued through PostgreSQL, the enqueue_workflow stored function accepts optional timeout_ms and deadline_epoch_ms parameters:

SELECT enqueue_workflow(
    workflow_name := 'my.workflow',
    queue_name := 'default_queue',
    timeout_ms := 5000, -- 5 second timeout
    deadline_epoch_ms := NULL -- Or provide absolute deadline
);

This enables timeout configuration from database-driven workflow orchestration systems.

Timeout Expiration Behavior#

When a workflow exceeds its timeout:

Workflow Interruption: The background timeout thread invokes cancel_workflows, setting the workflow status to CANCELLED. This applies to workflows in any active state, including DELAYED, ENQUEUED, and PENDING. DELAYED workflows that are cancelled will never execute.
Exception Propagation: The workflow execution detects the cancellation and raises DBOSAwaitedWorkflowCancelledError to the caller.
Recovery Consideration: Cancelled workflows may be subject to recovery attempts based on the max_recovery_attempts configuration. If recovery attempts are exhausted, the status transitions to MAX_RECOVERY_ATTEMPTS_EXCEEDED, effectively placing the workflow in a dead letter queue state.
Queue Cleanup: Cancelled workflows are removed from any queues they were part of, with queue-related metadata cleared. DELAYED workflows exist in the queue but are not yet eligible for dequeue; when cancelled, they are cleaned up from the queue without ever consuming executor resources.

Relevant Code Files#

File	Purpose	Key Lines
dbos/_core.py	Core timeout mechanism implementation	452-474 (timeout thread), 1937-1956 (deadline calculation)
dbos/_sys_db.py	Workflow cancellation and database operations	703-728 (cancel_workflows)
dbos/_schemas/system_database.py	Database schema definitions	77-78 (timeout fields), 45-96 (workflow_status table)
dbos/_context.py	SetWorkflowTimeout context manager	453-502
tests/test_dbos.py	Synchronous workflow timeout tests	1788-1877
tests/test_async.py	Async workflow timeout tests	470-519

Workflow Recovery: Timeout management integrates with DBOS's workflow recovery system, with cancelled workflows eligible for recovery based on max_recovery_attempts configuration.
Queue Management: Queued workflows defer deadline calculation until dequeue time, ensuring fair timeout enforcement regardless of queue wait time. Workflows can be enqueued with a delay using the delay_seconds parameter, starting in DELAYED status until the delay expires, then transitioning to ENQUEUED status for normal processing.
Workflow Lifecycle: Timeouts represent one phase of the complete workflow lifecycle, interacting with initialization, execution, completion, and recovery states. The lifecycle includes the DELAYED → ENQUEUED transition for workflows enqueued with a delay.
Parent-Child Workflow Relationships: Deadlines propagate from parent workflows to children, enforcing hierarchical time constraints across distributed workflow graphs.