Skip to main content

56. Cassandra for Time-Series Data

Status: Accepted Date: 2025-07-06

Context

The Mercury system needs to store and query massive volumes of time-series data, primarily historical OHLCV (Open, High, Low, Close, Volume) price data. This data is write-heavy (constantly ingesting new data points) and queries are typically based on time ranges. Using our primary PostgreSQL database for this purpose could lead to performance bottlenecks, as relational databases are not always optimally designed for the scale and write-throughput required for time-series workloads.

Decision

We will use Apache Cassandra as the specialized database for storing all high-volume, time-series trading data. A dedicated mercury-cassandra module will encapsulate all logic for interacting with the Cassandra cluster.

The data models in Cassandra will be heavily optimized for our primary query patterns. For example, OHLCV data will be partitioned by the trading pair and clustered by time, which makes time-range queries extremely efficient. Our primary PostgreSQL database will remain the system of record for transactional, relational data (like users, accounts, tournament state), but not for high-volume historical market data.

Consequences

Positive:

  • High Write Throughput: Cassandra is designed for extremely high write throughput, making it ideal for ingesting a constant stream of market data without performance degradation.
  • Scalability: Cassandra scales horizontally, meaning we can handle growing data volumes by simply adding more nodes to the cluster.
  • Query Performance: When the data model is designed correctly, time-range queries in Cassandra are significantly more performant at scale than in a traditional relational database.
  • Separation of Concerns: This decision separates the concerns of transactional state (Postgres) and historical time-series data (Cassandra), allowing us to use the right tool for the right job and preventing the main database from becoming bloated with historical data.

Negative:

  • Increased Operational Complexity: We must deploy, manage, and monitor a separate, distributed Cassandra cluster in addition to our PostgreSQL database. This is a significant operational burden.
  • Different Consistency Model: Cassandra offers "eventual consistency" by default, which is a different and more complex model to reason about than the strong consistency of PostgreSQL.
  • Specialized Skillset: Designing efficient Cassandra data models and managing a Cassandra cluster requires a specialized skillset that is different from relational database administration.

Mitigation:

  • Infrastructure as Code (IaC): The deployment and configuration of the Cassandra cluster will be fully automated using our Ansible playbooks to reduce the manual operational burden.
  • Careful Data Modeling: We will invest time in designing our Cassandra schemas carefully to match our query patterns, which is the key to performance in Cassandra. This is not a "schema-less" solution for us.
  • Tunable Consistency: For queries that require higher consistency, we can tune the consistency level on a per-query basis, trading off some availability for stronger consistency where needed.
  • Focused Application: We are not using Cassandra for everything. Its use is strictly limited to the time-series workload it excels at, minimizing the scope of its more complex consistency model.