Mercury Fine-Tuning Framework: From Chaos to Systematic Optimization

June 19, 2025 · 4 min read

Architect

After months of architectural development, Mercury reached a critical milestone: all core components were implemented and operational. We transitioned from building the foundation to the fine-tuning phase—optimizing our AI-powered trading system for maximum profitability. However, this phase revealed significant challenges in our experimental framework that ultimately hindered progress rather than accelerating it.

The Context: What Went Wrong

1. Experimental Framework Chaos

Over the past 1-2 months, we conducted numerous experiments to optimize Mercury's performance. The fundamental problem? No systematic tracking of experimental results. While I conducted several experiments, answering the simple question "What were the results?" became impossible. The data exists somewhere in our repository history, but analyzing what we've already tried is painfully difficult and time-consuming.

This lack of experimental rigor meant we were essentially flying blind—potentially repeating failed experiments or missing crucial insights from successful ones.

2. Multi-Variant Monitoring Complexity

Our initial solution to A/B testing was deploying multiple Mercury variants (A, B, H, R) running different configurations simultaneously. While this provided valuable data, it created a new problem: cognitive overload.

Previously, I could analyze performance by monitoring a single Telegram channel with trading artifacts. Now, with four variants running, each generating its own stream of data, comprehensive analysis became exponentially more complex. The monitoring overhead grew from manageable to overwhelming.

3. Dashboard: Beautiful but Broken

To address the multi-variant monitoring challenge, we developed a comprehensive dashboard to visualize performance across all variants. The dashboard was aesthetically pleasing and feature-rich, but it introduced its own set of problems:

Data inconsistencies between dashboard views and Telegram artifacts
Internal dashboard inconsistencies where different views showed conflicting information
New bugs unrelated to core trading logic that created false signals

Ironically, our solution to complexity created more complexity. We built a pretty dashboard that made our decision-making harder instead of easier.

4. The Human Bus Factor Problem

During this critical fine-tuning period, I became unavailable due to legal battles unrelated to development. This exposed a fundamental vulnerability: Mercury's optimization process had a bus factor of one.

When I returned, I was completely out of context. Meanwhile, the urgent need to migrate to our new Nx monorepo setup (recently completed) further demonstrated how fragile our development process had become with a single human bottleneck.

The Solution: Systematic Experimental Framework

Based on these painful lessons, I'm proposing a complete redesign of our fine-tuning framework that addresses each identified problem:

1. Simplified Variant Architecture

Current State: 4 complex variants (A, B, H, R) with overlapping functionality Proposed State: 3 focused variants with clear purposes

W (Warrior): Production trading variant
- Live trading + shadow trading
- Production configuration
- liveRatio = 1 (all shadow positions become live trades)
R (Researcher): Production validation variant
- Shadow trading only
- Production configuration
- liveRatio = 0 (no live trades, pure validation)
ABH (Experimental): Unified experimental variant
- Shadow trading only
- Experimental configurations
- liveRatio = 0 (safe experimentation)

2. Live Ratio Control System

The liveRatio property enables gradual scaling from paper trading to live trading:

Production config: Latest optimized settings based on experimental results
W variant: Full live trading using proven configurations
R variant: Validates production config without financial risk
ABH variant: Tests experimental configurations safely

3. Unified Experimental Control

Instead of managing three separate experimental instances (A, B, H), the ABH variant will:

Run as a single instance with better control mechanisms
Execute the same experimental scope as the previous three variants
Use extended tournament configurations stored in application code, not environment variables
Include scheduling, comparison methods, and other experimental parameters
Isolate the entire experimental framework to a few configuration files

4. Non-Interfering GPU Load Management

All variants will run tournaments without interfering with each other by:

Ensuring tournaments don't run in parallel (previously achieved through schedule shifts)
Implementing explicit combined configuration management
Safely consolidating A/B/H functionality into the single ABH instance

5. Agent-Driven Performance Analysis

Critical shift: Performance analysis moves from my domain to AI agent domain.

Agent responsibility: Conduct thorough performance analysis
Human responsibility: Review analysis and confirm experimental setup/outcomes
Benefit: Eliminates human bottleneck in data analysis

6. Pre-MCP Analysis Infrastructure

While we plan to implement Model Context Protocol (MCP) for comprehensive analysis, we'll start with:

Extended APIs to provide agents with experimental results
Wrapper scripts that present data in easily digestible formats for AI analysis
Automated reporting that reduces human analysis overhead

7. Methodology Documentation and Automation

The entire framework will be wrapped in:

Comprehensive methodology documentation
Cursor rules for systematic experimentation
Automated workflows that make optimization a matter of invoking rules rather than manual analysis

For me, this transforms the process from complex manual analysis to simply invoking methodologies and reviewing AI-generated insights.

8. Dashboard as Backup Tool

The dashboard will be repositioned as:

Secondary analysis tool for specific system components
Backup option when deeper investigation is needed
Lower priority until the primary AI-driven flow is established

Expected Outcomes

This redesigned framework addresses our core problems:

Systematic Experimentation: Clear tracking and analysis of all experiments
Reduced Cognitive Load: Single experimental variant instead of three
Reliable Data: Focus on proven data sources with AI validation
Eliminated Bus Factor: AI-driven analysis removes human bottleneck
Methodical Optimization: Rules-based approach to systematic improvement

Next Steps

Implementation will proceed incrementally:

Design extended tournament configuration structure
Implement unified ABH experimental framework
Create performance analysis APIs and scripts
Develop methodology documentation and cursor rules
Migrate current variants to new architecture

This framework transforms Mercury's optimization from an ad-hoc experimental process into a systematic, AI-augmented fine-tuning machine. The goal: turning Mercury into a consistently profitable trading system through rigorous, automated optimization.

The trading system that learns and improves itself, with humans providing strategic oversight rather than manual analysis.

The Context: What Went Wrong​

1. Experimental Framework Chaos​

2. Multi-Variant Monitoring Complexity​

3. Dashboard: Beautiful but Broken​

4. The Human Bus Factor Problem​

The Solution: Systematic Experimental Framework​

1. Simplified Variant Architecture​

2. Live Ratio Control System​

3. Unified Experimental Control​

4. Non-Interfering GPU Load Management​

5. Agent-Driven Performance Analysis​

6. Pre-MCP Analysis Infrastructure​

7. Methodology Documentation and Automation​

8. Dashboard as Backup Tool​

Expected Outcomes​

Next Steps​