55. Comprehensive Bot Health Monitoring

Status: Accepted Date: 2025-07-06

Context

The mercury-bot is the central nervous system of our trading operation. Its health is paramount. If the bot is running but unable to connect to a critical dependency (like the PostgreSQL database, Redis, or the Bybit API), it is effectively non-operational, even if its own process is "up". We need a reliable way for our automated monitoring systems (like Prometheus) and for human operators to quickly assess the true health of the bot and its entire operational context.

Decision

The mercury-bot application will expose a comprehensive health check endpoint at /health. This endpoint will do more than just return a 200 OK status to indicate the process is running. It will perform a series of internal checks to verify its ability to function correctly.

The health check will validate:

Database Connectivity: Ability to connect to the PostgreSQL database and perform a simple query (e.g., SELECT 1).
Redis Connectivity: Ability to connect to the Redis server and perform a PING command.
Exchange API Connectivity: Ability to reach the key external exchange API endpoints (e.g., Bybit's status endpoint).
Queue Health: Check the connection to the BullMQ instance and optionally check for abnormal queue lengths or a high number of failed jobs.

The endpoint will return a detailed JSON response indicating the status (up or down) of each individual component, along with an overall status. The HTTP status code will be 200 only if all critical checks pass; otherwise, it will be 503 Service Unavailable.

Consequences

Positive:

Accurate Health Assessment: Provides a true, holistic view of the bot's ability to operate. A 200 OK response means the bot is not just running, but fully functional.
Faster Incident Response: Allows monitoring systems to detect not just that the bot has crashed, but that it has lost connectivity to a key dependency. This leads to faster and more accurate alerting, helping operators to pinpoint the root cause of a problem immediately.
Enables Automated Recovery: A reliable health check is a prerequisite for automated systems like Kubernetes to perform actions like restarting a faulty container or redirecting traffic.

Negative:

Performance Overhead: The health check endpoint performs I/O operations (to the database, Redis, etc.), which adds a small amount of load to the bot and its dependencies.
Increased Complexity: The health check logic itself adds a small amount of complexity to the application's codebase.

Mitigation:

Caching Health Status: To minimize performance impact, the results of the health checks can be cached in memory for a short period (e.g., 5-10 seconds). This ensures that frequent polling of the /health endpoint by monitoring systems does not overwhelm the application or its dependencies.
Standardized Libraries: We can use a standard library like @nestjs/terminus to implement the health checks. These libraries provide a structured way to build comprehensive health indicators and reduce boilerplate code.
Distinguishing Criticality: The health check logic will distinguish between critical failures (e.g., cannot connect to the database) and warnings (e.g., a specific non-essential API is down), returning the appropriate overall status.

Context​

Decision​

Consequences​

Context

Decision

Consequences