Skip to main content

119. Sampling-Based Candidate Collection

Status: Accepted Date: 2025-07-06

Context

The Maat module is designed to collect "Bad Signal" (BS) candidates from across the system. Some sources of BS candidates could be extremely high-volume. For example, if we are reviewing every single trade taken by the Morpheus shadow trading engine, this could generate thousands of candidates per day.

Processing and storing every single one of these candidates would create an enormous amount of data, and the human review process (adr://human-in-the-loop-validation) would be completely overwhelmed. We need a mechanism to control the volume of data collected while still getting a representative view of the system's performance.

Decision

The Maat module and its associated emitter services will implement a sampling-based collection strategy.

For high-volume data sources, the configuration will include a samplingProbability parameter (a number between 0 and 1). When a BS candidate is generated, the emitter service will use this probability to decide whether to actually emit the event or simply discard it. For example, with a samplingProbability of 0.1, only 10% of the potential candidates would be collected.

This sampling can be configured on a per-source basis. We might choose to sample Morpheus trade reviews at 5%, but always collect 100% of critical events like TAValidationService failures.

Consequences

Positive:

  • Manages Data Volume: Provides a simple and effective way to control the amount of data being sent to Maat for review, preventing the system (and the human reviewers) from being overwhelmed.
  • Reduces Cost: Less data stored and processed means lower storage and compute costs.
  • Configurable: The sampling rate is a configuration parameter, so it can be easily adjusted for different environments or different data sources without changing any code. We can increase the sampling rate if we are investigating a specific issue, and decrease it during normal operation.

Negative:

  • Potential to Miss Issues: The most significant drawback is that we will, by definition, not see every single potential issue. A critical but rare issue might be missed if it happens to fall into the 95% of events that are not sampled.
  • Less Precise Metrics: Metrics based on the sampled data (e.g., "percentage of bad trades") will be an estimate based on the sample, not an exact count based on the total population.

Mitigation:

  • Smart Sampling and Stratification: We will not use a single, global sampling rate. We will apply different rates to different sources based on their criticality. Critical events (like validation failures) will always have a 100% sampling rate. Less critical, high-volume events (like routine trade reviews) will have a lower rate.
  • Adaptive Sampling: In the future, the sampling rate could be made dynamic. If the system detects an increase in a certain type of issue, it could automatically increase the sampling rate for that source to gather more data for analysis.
  • Complementary Metrics: The sampled BS review is not our only form of monitoring. It is complemented by comprehensive, non-sampled metrics and dashboards in Grafana that track the overall population's performance. The BS review is for deep, qualitative analysis of specific examples, not for aggregate statistical tracking.