Mercury Infrastructure Upgrade: From Redis Poverty to Production Beast 🚀

July 9, 2025 · 3 min read

Architect

Today we diagnosed why Mercury was only scheduling 586 jobs instead of the expected 1,524 for TA availability checks. Spoiler alert: our infrastructure was basically asking Redis to bench press while sitting on the bar! 💪😅

The Great Redis Mystery 🕵️

Our Mercury trading system runs 5 variants (A, B, H, R, W) with a sophisticated job scheduling system. Each should schedule:

508 USDT markets × 3 timeframes = 1,524 TA availability jobs

But we were only getting 586 jobs. The math was simple: something was dying silently.

Root Cause: Algorithmic Brutality + Hardware Poverty

The Redis Killer Code

// This innocent-looking code was murdering Redis 💀
const existingJobs = await this.shadowOrdersQueue.getJobs(['waiting', 'delayed']);
const jobAlreadyExists = existingJobs.some(
  (job) => job.name === MorpheusJobName.EXECUTE_SHADOW_ORDER &&
           job.data.orderId === orderId
);

What this actually does:

Fetches ALL 2,500+ jobs from Redis into memory
Scans through every single job to check for duplicates
Repeats this 1,524 times during bulk scheduling
Creates O(n²) complexity when Redis offers O(1) with proper job IDs

The Hardware Reality Check

# Our current "production" setup 🤡
redis:
  image: 'redis:alpine'  # 1GB memory limit
  command: redis-server --appendonly yes  # No limits, no config

Load Analysis:

5 Mercury variants + 3 domain apps
BullMQ queues (high write volume)
TA cache data for 508 markets × multiple timeframes
Market data cache
All running on 2x ARM servers (4 CPU, 8GB RAM each)

Result: Redis memory exhaustion → connection timeouts → silent job failures.

The "Robust by Accident" Discovery

The funniest part? Our TA availability scheduler has built-in resilience:

// Keeps rescheduling until all markets are covered
// Mercury: "Redis failed me? Fine, I'll just keep trying!"

The system was literally self-healing through brute force scheduling. Eventually, after multiple runs, all 1,524 jobs would get scheduled. Peak Mercury engineering! 🚀

The Great Infrastructure Upgrade Plan

Current State: Poverty Edition

2x Hetzner ARM VPS (4 CPU, 8GB RAM) - "not for high CPU load"
Redis running on bicycle wheels 🚲
Multiple services fighting for 8GB

Future State: Beast Mode

JANUS (The Beast - Xeon, Unlimited Power):
├── Mercury-TA (primary) - Heavy TA-Lib calculations
├── Domain apps (Arcana, Anytracker, Maschine) - 99.9% idle
└── Mercury-TA (failover) - Backup from Hetzner

ARM Server 1 (Hetzner):
├── Mercury variants A, B, H
├── Redis (dedicated namespace)
└── PostgreSQL (mercury-abh)

ARM Server 2 (Hetzner):
├── Mercury variants R, W
├── Redis (dedicated namespace)
└── PostgreSQL (mercury-rw)

Strategy Benefits

Co-location: Redis + PostgreSQL + App on same instance = zero network latency
Load separation: Heavy computation → JANUS, Trading logic → ARM servers
Bulletproof: Server death affects only 2-3 variants, not everything
Cost effective: ARM servers handle what they're good at, beast handles heavy lifting

Technical Lessons Learned

1. Redis Performance Isn't the Problem

Redis can handle millions of operations per second. The issue was:

Algorithmic complexity: O(n²) duplicate checking
Memory limits: 1GB trying to hold gigabytes
No configuration: Default limits for production load

2. ARM Servers Have Their Place

ARM architecture is great for:

Trading logic (sufficient CPU performance)
I/O bound operations (network, database)
Cost efficiency for sustained workloads

Not great for:

Heavy computational tasks (TA-Lib calculations)
Memory-intensive operations (large Redis datasets)

3. The Power of Proper Job Scheduling

// Fix: Use deterministic job IDs instead of scanning
await queue.add(jobName, data, {
  jobId: `execute-order-${orderId}`, // O(1) duplicate prevention
  removeOnComplete: false,  // Required with jobId
  removeOnFail: false,
});

4. Infrastructure Co-location Strategy

Placing related services together eliminates:

Network latency between Redis and app
Connection pool exhaustion across servers
Complex service discovery and networking
Cascade failures from network issues

The Cheap Ass Engineering Philosophy

Sometimes the best solutions come from constraints:

Work with what you have until you hit real limits
Profile before upgrading - understand your bottlenecks
Horizontal scaling can be cheaper than vertical
Robust-by-accident designs often work better than over-engineered ones

Our "accidental resilience" through multiple scheduling attempts taught us that eventual consistency can be a feature, not a bug.

Next Steps

Immediate: Fix the O(n²) scheduler logic
Short-term: Upgrade Redis configuration with proper memory limits
Medium-term: Migrate to the JANUS + ARM hybrid architecture
Long-term: Document this as a case study in "cheap ass engineering that actually works"

Conclusion

International Cheap Ass Day reminded us that constraints breed creativity. Our poverty-spec infrastructure forced us to:

Understand our bottlenecks deeply
Design resilient systems (accidentally)
Optimize algorithms instead of throwing hardware at problems
Plan sustainable growth without breaking the bank

Sometimes you need to run a Ferrari on bicycle wheels to truly appreciate proper tires! 🏎️

Happy International Cheap Ass Day! May your infrastructure be robust and your servers be cheap! 🎉

The Great Redis Mystery 🕵️​

Root Cause: Algorithmic Brutality + Hardware Poverty​

The Redis Killer Code​

The Hardware Reality Check​

The "Robust by Accident" Discovery​

The Great Infrastructure Upgrade Plan​

Current State: Poverty Edition​

Future State: Beast Mode​

Strategy Benefits​

Technical Lessons Learned​

1. Redis Performance Isn't the Problem​

2. ARM Servers Have Their Place​

3. The Power of Proper Job Scheduling​

4. Infrastructure Co-location Strategy​

The Cheap Ass Engineering Philosophy​

Next Steps​

Conclusion​