83. Queue-Based Background Processing

Status: Accepted Date: 2025-07-06

Context

Many operations in the Mercury system are long-running or should not block the main application thread. Examples include fetching large amounts of historical market data, generating complex reports, or calling external AI APIs. If we perform these tasks synchronously within an API request, it will lead to very long response times and a poor user experience. It also makes the system brittle; a failure in a long-running task could cause the entire request to fail.

Decision

We will use a queue-based architecture for all non-trivial background processing. Specifically, we will use BullMQ, a robust and popular message queue system for Node.js that is built on top of Redis.

Any operation that is expected to take more than a few hundred milliseconds, or that needs to be reliable and retryable, will be implemented as a background job.

An API endpoint will receive the request, perform initial validation, and then add a job to a specific BullMQ queue.
It will then immediately return a response to the user (e.g., a job ID for polling, or a simple "Accepted" status).
A separate mercury-worker process (adr://comprehensive-consumer-orchestration) will listen to the queue, pick up the job, and execute the long-running task in the background.

Consequences

Positive:

Improved Responsiveness & User Experience: The API responds almost instantly, as it only needs to add a job to the queue, not perform the work itself.
Reliability & Durability: Jobs in the queue are persistent. If the worker process crashes, the jobs remain in the queue and will be processed when the worker restarts. BullMQ also provides automatic retries with exponential backoff for failed jobs.
Decoupling & Scalability: The API producers are completely decoupled from the background consumers. We can scale the number of API servers and worker processes independently based on load. If there is a surge in requests, the queue will absorb the load gracefully.
Rate Limiting & Concurrency Control: BullMQ provides fine-grained control over how many jobs are processed concurrently, which is essential for managing rate limits on external APIs.

Negative:

Increased Complexity: A queue-based system introduces a new moving part (the Redis-based queue) and a new programming model (asynchronous jobs vs. synchronous requests).
Eventual Consistency: The result of the operation is not available immediately. The client may need to poll a status endpoint to find out when the job is complete and get its result, which is a more complex client-side interaction.

Mitigation:

Robust Tooling: BullMQ is a mature library with excellent features and a monitoring UI (bull-board, see adr://bull-dashboard-monitoring) that helps manage the complexity.
Clear API Contracts for Asynchronous Operations: For operations that become asynchronous, we will establish a clear API contract (e.g., POST /tasks returns a { "jobId": "..." }, and GET /tasks/{jobId} returns the status and result).
Use Synchronous Processing Where Appropriate: We will not overuse queues. For operations that are fast, simple, and idempotent, we will continue to use a simple synchronous request-response model. Queues are for tasks that are slow, complex, or require high reliability.

Context​

Decision​

Consequences​

Context

Decision

Consequences