55. Comprehensive Bot Health Monitoring
Status: Accepted Date: 2025-07-06
Context
The mercury-bot is the central nervous system of our trading operation. Its health is paramount. If the bot is running but unable to connect to a critical dependency (like the PostgreSQL database, Redis, or the Bybit API), it is effectively non-operational, even if its own process is "up". We need a reliable way for our automated monitoring systems (like Prometheus) and for human operators to quickly assess the true health of the bot and its entire operational context.
Decision
The mercury-bot application will expose a comprehensive health check endpoint at /health. This endpoint will do more than just return a 200 OK status to indicate the process is running. It will perform a series of internal checks to verify its ability to function correctly.
The health check will validate:
- Database Connectivity: Ability to connect to the PostgreSQL database and perform a simple query (e.g.,
SELECT 1). - Redis Connectivity: Ability to connect to the Redis server and perform a
PINGcommand. - Exchange API Connectivity: Ability to reach the key external exchange API endpoints (e.g., Bybit's status endpoint).
- Queue Health: Check the connection to the BullMQ instance and optionally check for abnormal queue lengths or a high number of failed jobs.
The endpoint will return a detailed JSON response indicating the status (up or down) of each individual component, along with an overall status. The HTTP status code will be 200 only if all critical checks pass; otherwise, it will be 503 Service Unavailable.
Consequences
Positive:
- Accurate Health Assessment: Provides a true, holistic view of the bot's ability to operate. A
200 OKresponse means the bot is not just running, but fully functional. - Faster Incident Response: Allows monitoring systems to detect not just that the bot has crashed, but that it has lost connectivity to a key dependency. This leads to faster and more accurate alerting, helping operators to pinpoint the root cause of a problem immediately.
- Enables Automated Recovery: A reliable health check is a prerequisite for automated systems like Kubernetes to perform actions like restarting a faulty container or redirecting traffic.
Negative:
- Performance Overhead: The health check endpoint performs I/O operations (to the database, Redis, etc.), which adds a small amount of load to the bot and its dependencies.
- Increased Complexity: The health check logic itself adds a small amount of complexity to the application's codebase.
Mitigation:
- Caching Health Status: To minimize performance impact, the results of the health checks can be cached in memory for a short period (e.g., 5-10 seconds). This ensures that frequent polling of the
/healthendpoint by monitoring systems does not overwhelm the application or its dependencies. - Standardized Libraries: We can use a standard library like
@nestjs/terminusto implement the health checks. These libraries provide a structured way to build comprehensive health indicators and reduce boilerplate code. - Distinguishing Criticality: The health check logic will distinguish between critical failures (e.g., cannot connect to the database) and warnings (e.g., a specific non-essential API is down), returning the appropriate overall status.