Documents
Queue Polling Backoff Strategy
Queue Polling Backoff Strategy
Type
Topic
Status
Published
Created
Apr 7, 2026
Updated
Apr 27, 2026
Created by
Dosu Bot
Updated by
Dosu Bot

Queue Polling Backoff Strategy#

The queue polling backoff strategy is a dynamic, contention-aware interval management mechanism used in the DBOS queue system. Each queue worker thread continuously polls a PostgreSQL database for newly enqueued workflows, and the strategy automatically adjusts how frequently that poll occurs based on observed database contention . When contention is detected — specifically, when the database returns serialization or lock-acquisition errors — the worker immediately doubles its polling interval, reducing load on the database. As subsequent polls succeed, the interval gradually decays back toward the configured minimum, restoring responsiveness .

The strategy is designed to handle the inherent contention that arises when multiple DBOS worker processes independently poll the same queue database table. In DBOS's distributed architecture, all connected processes act as queue workers and periodically poll the database for new workflows to execute without any centralized coordination . This decentralized design improves horizontal scalability but concentrates concurrent write pressure on shared rows, making adaptive backoff necessary to prevent polling storms. Jitter is applied to every sleep interval to further stagger concurrent workers and prevent synchronized polling spikes .

This polling algorithm applies to both in-memory queues (now deprecated) and database-backed queues. Database-backed queues reload their configuration from the database at the start of each polling iteration, enabling dynamic reconfiguration without restart. In-memory queues use configuration that remains constant after initialization.


Backoff Algorithm#

The backoff algorithm operates on three interval bounds and two update rules applied per polling iteration.

Interval Bounds#

The worker thread initializes three values at startup :

VariableValueDescription
polling_intervalqueue.polling_interval_secCurrent (live) polling interval, starts at the configured base
min_polling_intervalqueue.polling_interval_secLower bound; the interval never falls below the configured base. For database-backed queues, this value is reloaded from the database each loop iteration. For in-memory queues, it remains constant.
max_polling_intervalmax(queue.polling_interval_sec, 120.0)Upper bound; caps backoff at 120 seconds unless the configured base is higher. For database-backed queues, this value is reloaded from the database each loop iteration. For in-memory queues, it remains constant.

Update Rules#

Two interval update rules execute on every iteration of the worker loop:

On contention detection (applied immediately, before sleep): the current interval doubles, clamped to the maximum :

polling_interval = min(max_polling_interval, polling_interval * 2.0)

After every iteration (applied unconditionally, whether the poll succeeded or failed): the interval decays by 10%, clamped to the minimum :

polling_interval = max(min_polling_interval, polling_interval * 0.9)

This combination produces an asymmetric response: contention causes an immediate doubling while recovery is gradual, taking multiple successful iterations to return to the base interval. For example, starting from a 1-second base interval, a single contention event raises the interval to 2 seconds; after that, it takes approximately 7 successive clean iterations to return to ~1 second (via repeated 0.9× decay).

Jitter#

The actual sleep duration on each iteration is not exactly polling_interval but a uniformly sampled value in the range [0.95 × polling_interval, 1.05 × polling_interval] :

stop_event.wait(timeout=polling_interval * random.uniform(0.95, 1.05))

This ±5% jitter prevents multiple worker threads or processes that started with similar intervals from polling in lockstep, mitigating the thundering-herd problem.

Interval Progression Example#

Starting from polling_interval_sec = 1.0:

EventInterval After Update
Initial1.00 s
Contention detected2.00 s
Iteration 1 (clean)1.80 s
Iteration 2 (clean)1.62 s
Iteration 3 (clean)1.46 s
Contention again2.00 s (capped)
… (7 clean iterations)~1.00 s

Contention Detection#

The backoff is triggered exclusively by two specific PostgreSQL error conditions, surfaced through the SQLAlchemy/psycopg error hierarchy :

Error Classpsycopg TypeCause
Serialization failureerrors.SerializationFailureREPEATABLE READ snapshot conflict between concurrent transactions
Lock not availableerrors.LockNotAvailableFOR UPDATE NOWAIT failed because another session holds the row lock

Both errors are wrapped by SQLAlchemy in an OperationalError. The worker catches OperationalError, inspects the underlying psycopg error, and only triggers backoff for these two specific subtypes . All other OperationalError subtypes and general exceptions are logged as warnings without modifying the polling interval.

When backoff triggers, a warning is logged that includes the queue name and the new polling interval value :

Contention detected in queue thread for <queue_name>. Increasing polling interval to <new_interval>.

Sources of Database Contention#

Contention arises inside start_queued_workflows() in dbos/_sys_db.py, the method called by each worker to dequeue and claim workflows. Three architectural choices within this method are the primary contention sources:

1. REPEATABLE READ Transaction Isolation#

For PostgreSQL, start_queued_workflows() opens a transaction at REPEATABLE READ isolation level only when a global concurrency limit (queue.concurrency) or a rate limiter (queue.limiter) is configured :

if self.engine.dialect.name == "postgresql" and (
    queue.concurrency is not None or queue.limiter is not None
):
    c.execute(sa.text("SET TRANSACTION ISOLATION LEVEL REPEATABLE READ"))

This ensures that concurrent workers see consistent snapshots when checking rate limits and global concurrency counts. For queues with only worker_concurrency set, the default READ COMMITTED isolation level is used, reducing the risk of serialization failures. When REPEATABLE READ is active, it introduces the risk of serialization failures when two workers operate on overlapping snapshots and one attempts to update rows that the other has already read.

2. Row-Level Locking: SKIP LOCKED vs. NOWAIT#

The workflow-selection query uses FOR UPDATE locking, but the mode depends on whether a global concurrency limit is configured :

ConditionLock ModeBehavior
queue.concurrency is NoneFOR UPDATE SKIP LOCKEDWorkers skip rows locked by others; no contention errors
queue.concurrency is setFOR UPDATE NOWAITAny locked row immediately raises LockNotAvailable

When NOWAIT is in use, any two workers attempting to claim the same rows simultaneously will cause at least one of them to receive a LockNotAvailable error and back off.

3. Concurrent Read-Write Operations#

The method executes multiple SELECT queries (for rate-limit counting and, when global concurrency is configured, global concurrency counting) followed by per-row UPDATE operations, all within a single transaction . Worker concurrency (worker_concurrency) is now enforced using an in-memory map (ActiveWorkflowById.count_for_queue()) rather than a database query, eliminating that source of read-write overlap. The duration and read-write overlap of the transaction increases with queue depth and the number of configured limits, widening the window for snapshot conflicts between concurrent workers.

The UPDATE statement that transitions workflows from ENQUEUED to PENDING now includes a WHERE clause verifying the current status is ENQUEUED, preventing multiple workers from claiming the same workflow if they race on the same row.


Worker Thread Architecture#

The backoff logic is encapsulated entirely in the queue_worker_thread() function in dbos/_queue.py . A separate thread instance runs per queue, managed by the parent queue_thread() function .

Control Flow#

Loading diagram...

Key Implementation Details#

  • Configuration reloading (database-backed queues): At the start of each polling loop iteration, the worker reloads the queue configuration from the database by calling dbos._sys_db.get_queue(). This allows queue parameters — concurrency, worker_concurrency, limiter, priority_enabled, partition_queue, and polling_interval_sec — to be modified dynamically via the Queue.set_* methods or the client API without restarting the worker. In-memory queues do not reload configuration and use their initial values for the lifetime of the worker thread.
  • Queue deletion detection: For database-backed queues, if get_queue() returns None, the worker logs a message and exits the thread. This enables graceful worker shutdown when a database-backed queue is deleted from the queues table.
  • Graceful shutdown: The worker uses stop_event.wait(timeout=...) rather than time.sleep(), so it responds to stop signals immediately without waiting for the full polling interval to elapse .
  • Workflow execution: For each dequeued workflow ID, the thread calls execute_workflow_by_id() in a nested try/except, so a failure to launch one workflow does not abort the others or disrupt the backoff state .
  • Partitioned queues: When partition_queue=True, the thread first fetches all active partition keys and issues a separate start_queued_workflows() call per partition key . Each such call is an independent transaction and can independently produce contention errors.
  • Thread naming: Worker threads are named queue-worker-<queue_name> for observability .

Queue Manager Thread#

The parent queue_thread() function runs a single manager thread that monitors registered queues at a 1-second check interval, spawning a new queue_worker_thread for any queue whose worker thread is absent or has died . On shutdown, it joins all worker threads with a 10-second timeout per thread .


Configuration#

polling_interval_sec#

The polling_interval_sec parameter on the Queue constructor is the sole user-configurable input to the backoff algorithm :

Queue(
    name: str,
    concurrency: Optional[int] = None,
    limiter: Optional[QueueRateLimit] = None,
    *,
    worker_concurrency: Optional[int] = None,
    priority_enabled: bool = False,
    partition_queue: bool = False,
    polling_interval_sec: float = 1.0,
)
PropertyDetail
Default1.0 second
ValidationMust be strictly positive (> 0.0); raises ValueError otherwise
Role in backoffSets both the initial interval and the minimum floor to which the interval recovers
Effect on maximummax_polling_interval = max(polling_interval_sec, 120.0), so values above 120 s shift the cap upward

Parameters That Amplify Contention#

Certain other Queue parameters increase the likelihood that contention errors will be encountered, indirectly making the backoff more aggressive:

ParameterEffect on Contention
concurrency (global limit)Switches locking to FOR UPDATE NOWAIT, increasing LockNotAvailable errors under concurrent workers; requires REPEATABLE READ isolation
worker_concurrencyEnforced using an in-memory map; does not add database queries or widen transaction conflict windows
limiterAdds a rate-count query to each transaction, widening the conflict window; requires REPEATABLE READ isolation

The queue polling backoff strategy is distinct from several other interval and retry mechanisms in the DBOS system:

Scheduler Polling#

The DBOS dynamic scheduler uses a separate scheduler_polling_interval_sec parameter (default 30 seconds) to control how frequently it polls for schedule changes . Its jitter strategy differs: it caps random delay at 10 seconds regardless of interval size , whereas queue workers apply a proportional ±5% jitter. The scheduler does not implement contention-driven backoff.

Step-Level Retry Backoff#

DBOS provides independent exponential backoff for step retries, configured per step via the @DBOS.step() decorator with interval_seconds, max_attempts, and backoff_rate parameters . The maximum retry interval is capped at 3600 seconds . This mechanism governs re-execution of individual workflow steps on application-level failures, not database polling frequency.

PostgreSQL LISTEN/NOTIFY#

DBOS supports a use_listen_notify configuration option as an event-driven alternative to polling-based queue detection . When enabled, workers are notified of new queue entries rather than discovering them via periodic polls, eliminating poll-induced database load and making contention-driven backoff unnecessary for that notification path.


Relevant Code Files#

FileDescriptionKey Components
dbos/_queue.pyQueue class and worker thread implementationQueue, queue_worker_thread(), queue_thread(), backoff logic
dbos/_sys_db.pySystem database operationsstart_queued_workflows() — transaction isolation, FOR UPDATE locking, workflow status updates

  • DBOS Queue System — workflow queuing, concurrency limits, rate limiters, and partitioned queues
  • PostgreSQL Transaction IsolationREPEATABLE READ semantics and snapshot conflict behavior
  • PostgreSQL Row-Level LockingFOR UPDATE, SKIP LOCKED, and NOWAIT lock modes
  • Exponential Backoff — general algorithm class; doubling on failure, multiplicative decay on success
  • Thundering Herd Problem — the pathology that jitter and backoff jointly mitigate in distributed polling systems
  • Thread Lifecycle Management — how DBOS manages daemon threads and graceful shutdown via threading.Event
  • Database Connection Pooling — system database pool configuration (sys_db_pool_size, default 20) that bounds concurrent polling connections
Queue Polling Backoff Strategy | Dosu