Skip to main content

Mercury Infrastructure Upgrade: From Redis Poverty to Production Beast πŸš€

Β· 3 min read
Max Kaido
Architect

Today we diagnosed why Mercury was only scheduling 586 jobs instead of the expected 1,524 for TA availability checks. Spoiler alert: our infrastructure was basically asking Redis to bench press while sitting on the bar! πŸ’ͺπŸ˜…

The Great Redis Mystery πŸ•΅οΈβ€‹

Our Mercury trading system runs 5 variants (A, B, H, R, W) with a sophisticated job scheduling system. Each should schedule:

  • 508 USDT markets Γ— 3 timeframes = 1,524 TA availability jobs

But we were only getting 586 jobs. The math was simple: something was dying silently.

Root Cause: Algorithmic Brutality + Hardware Poverty​

The Redis Killer Code​

// This innocent-looking code was murdering Redis πŸ’€
const existingJobs = await this.shadowOrdersQueue.getJobs(['waiting', 'delayed']);
const jobAlreadyExists = existingJobs.some(
(job) => job.name === MorpheusJobName.EXECUTE_SHADOW_ORDER &&
job.data.orderId === orderId
);

What this actually does:

  1. Fetches ALL 2,500+ jobs from Redis into memory
  2. Scans through every single job to check for duplicates
  3. Repeats this 1,524 times during bulk scheduling
  4. Creates O(nΒ²) complexity when Redis offers O(1) with proper job IDs

The Hardware Reality Check​

# Our current "production" setup 🀑
redis:
image: 'redis:alpine' # 1GB memory limit
command: redis-server --appendonly yes # No limits, no config

Load Analysis:

  • 5 Mercury variants + 3 domain apps
  • BullMQ queues (high write volume)
  • TA cache data for 508 markets Γ— multiple timeframes
  • Market data cache
  • All running on 2x ARM servers (4 CPU, 8GB RAM each)

Result: Redis memory exhaustion β†’ connection timeouts β†’ silent job failures.

The "Robust by Accident" Discovery​

The funniest part? Our TA availability scheduler has built-in resilience:

// Keeps rescheduling until all markets are covered
// Mercury: "Redis failed me? Fine, I'll just keep trying!"

The system was literally self-healing through brute force scheduling. Eventually, after multiple runs, all 1,524 jobs would get scheduled. Peak Mercury engineering! πŸš€

The Great Infrastructure Upgrade Plan​

Current State: Poverty Edition​

  • 2x Hetzner ARM VPS (4 CPU, 8GB RAM) - "not for high CPU load"
  • Redis running on bicycle wheels 🚲
  • Multiple services fighting for 8GB

Future State: Beast Mode​

JANUS (The Beast - Xeon, Unlimited Power):
β”œβ”€β”€ Mercury-TA (primary) - Heavy TA-Lib calculations
β”œβ”€β”€ Domain apps (Arcana, Anytracker, Maschine) - 99.9% idle
└── Mercury-TA (failover) - Backup from Hetzner

ARM Server 1 (Hetzner):
β”œβ”€β”€ Mercury variants A, B, H
β”œβ”€β”€ Redis (dedicated namespace)
└── PostgreSQL (mercury-abh)

ARM Server 2 (Hetzner):
β”œβ”€β”€ Mercury variants R, W
β”œβ”€β”€ Redis (dedicated namespace)
└── PostgreSQL (mercury-rw)

Strategy Benefits​

  1. Co-location: Redis + PostgreSQL + App on same instance = zero network latency
  2. Load separation: Heavy computation β†’ JANUS, Trading logic β†’ ARM servers
  3. Bulletproof: Server death affects only 2-3 variants, not everything
  4. Cost effective: ARM servers handle what they're good at, beast handles heavy lifting

Technical Lessons Learned​

1. Redis Performance Isn't the Problem​

Redis can handle millions of operations per second. The issue was:

  • Algorithmic complexity: O(nΒ²) duplicate checking
  • Memory limits: 1GB trying to hold gigabytes
  • No configuration: Default limits for production load

2. ARM Servers Have Their Place​

ARM architecture is great for:

  • Trading logic (sufficient CPU performance)
  • I/O bound operations (network, database)
  • Cost efficiency for sustained workloads

Not great for:

  • Heavy computational tasks (TA-Lib calculations)
  • Memory-intensive operations (large Redis datasets)

3. The Power of Proper Job Scheduling​

// Fix: Use deterministic job IDs instead of scanning
await queue.add(jobName, data, {
jobId: `execute-order-${orderId}`, // O(1) duplicate prevention
removeOnComplete: false, // Required with jobId
removeOnFail: false,
});

4. Infrastructure Co-location Strategy​

Placing related services together eliminates:

  • Network latency between Redis and app
  • Connection pool exhaustion across servers
  • Complex service discovery and networking
  • Cascade failures from network issues

The Cheap Ass Engineering Philosophy​

Sometimes the best solutions come from constraints:

  1. Work with what you have until you hit real limits
  2. Profile before upgrading - understand your bottlenecks
  3. Horizontal scaling can be cheaper than vertical
  4. Robust-by-accident designs often work better than over-engineered ones

Our "accidental resilience" through multiple scheduling attempts taught us that eventual consistency can be a feature, not a bug.

Next Steps​

  1. Immediate: Fix the O(nΒ²) scheduler logic
  2. Short-term: Upgrade Redis configuration with proper memory limits
  3. Medium-term: Migrate to the JANUS + ARM hybrid architecture
  4. Long-term: Document this as a case study in "cheap ass engineering that actually works"

Conclusion​

International Cheap Ass Day reminded us that constraints breed creativity. Our poverty-spec infrastructure forced us to:

  • Understand our bottlenecks deeply
  • Design resilient systems (accidentally)
  • Optimize algorithms instead of throwing hardware at problems
  • Plan sustainable growth without breaking the bank

Sometimes you need to run a Ferrari on bicycle wheels to truly appreciate proper tires! 🏎️

Happy International Cheap Ass Day! May your infrastructure be robust and your servers be cheap! πŸŽ‰