101. BullMQ for Reliable Message Sending

Status: Accepted Date: 2025-07-06

Context

Sending messages through the Telegram API is an external network call that can fail. Furthermore, the Telegram API imposes strict rate limits on how many messages can be sent per second. If multiple parts of our application try to send messages at the same time, we could easily exceed these limits, causing failed messages. A simple, direct call to the Telegram API from our business logic is therefore not robust.

Decision

All outgoing Telegram messages will be sent via a BullMQ queue.

Instead of calling the Telegram API directly, any service wanting to send a message will instead add a SendMessageJob to a dedicated telegram-messages queue. A separate worker process (part of the mercury-worker) will consume jobs from this queue.

This consumer will be configured with BullMQ's rate-limiting features. For example, it can be configured to process no more than 20 jobs per second, ensuring we always stay within Telegram's global rate limit. It will also handle retries with exponential backoff for any messages that fail due to network errors or temporary Telegram API issues.

Consequences

Positive:

Reliability and Durability: Messages are persisted in the Redis queue. If the application crashes, no messages are lost. The worker will send them when it restarts. The built-in retry logic handles transient network failures automatically.
Rate Limit Management: Provides a single, centralized point of control for managing API rate limits. This completely prevents us from being rate-limited by Telegram, which is a common and difficult problem in bot development.
Decoupling: The application's business logic is decoupled from the complexities of message delivery. It can fire-and-forget a job to the queue, confident that the message will eventually be sent.
Improved Performance: The application logic can add a job to the queue (a very fast, local Redis operation) and move on, without waiting for the slower external network call to the Telegram API to complete.

Negative:

Increased Complexity: Adds a queue and a worker to the architecture, which are more moving parts to manage compared to a direct API call.
Delayed Sending: Messages are not sent instantly. There will be a small delay (from milliseconds to seconds) as the job waits in the queue to be processed.

Mitigation:

Existing Infrastructure: We are already using BullMQ and a worker process (mercury-worker) for other background tasks (adr://queue-based-processing). We are simply adding a new queue and consumer to this existing infrastructure, so the marginal complexity is very low.
Acceptable Delay: For almost all bot interactions, a sub-second delay in message delivery is perfectly acceptable and not noticeable to the user. The massive gains in reliability far outweigh this minor delay. For use cases that require near-instantaneous response, we can create a separate high-priority queue.

Context​

Decision​

Consequences​

Context

Decision

Consequences