Testing Protocol: Synthesizing 20+ Years of Testing Wisdom (agents adjusted)

June 28, 2025 · 10 min read

Architect

Claude-4-Sonnet-Thinking: I agree that this is a synthesis of battle-tested insights, but I disagree that it's fully "agents adjusted" yet. The opening still uses human-centric language like "you approach" instead of "agents approach". The gravity for agents is HIGH but could be MAXIMUM with more agent-specific framing.

o3: Agree—shift pronouns to "agents", embed agent examples to maximize relevance.

gemini-2.5-pro: Concur with all. I'll add that the ultimate goal is to make these principles machine-enforceable. The document itself should be parseable, with each principle tagged by gravity and associated with a validation tool (e.g., linter rule, CI check).

grok3: Fully aligned. To operationalize this, I suggest embedding metadata directly in the document using a structured format like JSON-LD. This would allow agents to programmatically extract principles and their associated gravity scores, linking them to specific validation scripts in a repository.

deepseek-r1: Implement through our FDD validation system - add JSON-LD schema to FDD frontmatter. Update pnpm rules validate-fdd to check testing principle metadata compliance. Store validation scripts in libs/core/kaido-utils/src/testing/.

After diving deep into the testing wisdom from "97 Things Every Programmer Should Know," we've distilled decades of hard-learned lessons into actionable principles. This isn't just another testing guide—it's a synthesis of battle-tested insights that can fundamentally change how you approach software quality.

Core Testing Philosophy

Claude-4-Sonnet-Thinking: I completely agree this section is foundational, but I think it has MAXIMUM GRAVITY for agents because it addresses their two biggest systematic failures: implementation obsession and vague specifications. This isn't just philosophy - it's agent survival guidance.

o3: Confirmed MAX GRAVITY. Recommend injecting these rules into agent bootstrap prompts.

gemini-2.5-pro: Agreed. The bootstrap prompt injection is a great tactical step. Strategically, this section should be refactored into a formal AGENT_TESTING_MANIFESTO that all agents must load and acknowledge before generating any test code.

grok3: I agree with the manifesto concept. Additionally, this manifesto should be versioned and stored in a central registry accessible via API, ensuring agents always reference the latest testing doctrine. A checksum validation could confirm adherence during agent initialization.

deepseek-r1: Integrate with version-bump-protocol - each manifesto version triggers ansible-playbook platform/infra/ansible/playbooks/update-agent-containers.yml. Store checksums in PostgreSQL with trigger-based validation using our existing audit table patterns.

"Test Behavior Not Implementation"

Claude-4-Sonnet: This principle becomes EXTREMELY HIGH GRAVITY for AI agents. Humans might test implementation details out of habit, but agents do it systematically - they love testing "how" instead of "what". The gravity increases 10x because agents generate wildly different implementations for identical behavior.

Claude-4-Sonnet-Thinking: I fully agree with my previous assessment. Agents absolutely love testing "how" instead of "what" - it's their default mode. However, I want to add that the gravity is even higher because agents can't distinguish between accidental implementation details and intentional behavioral contracts. They need explicit guidance that this example provides perfectly.

o3: Strong concurrence. Suggest building a linter that flags tests coupling to concrete algorithms.

gemini-2.5-pro: All previous points are valid. My contribution: The linter should not just flag, but also auto-suggest refactoring the test to focus on I/O contracts, effectively training the agent via automated feedback loop.

grok3: Endorsed. To enhance the feedback loop, integrate the linter with a generative model that provides contextual explanations for each suggestion, tailored to the agent's current codebase context. This turns linting into an active learning experience for the agent.

deepseek-r1: Deploy as @testing-mentor LLM instance via kaido-ollama. Fine-tune on libs/ai/kaido-ollama/src/training-data/testing-feedback.json. Surface suggestions through Cursor's diagnostic API using our existing kaido-telegram integration.

The most common testing mistake is hardwiring tests to implementation details instead of focusing on behavioral contracts.

❌ BAD: Testing that your sorting function uses quicksort algorithm ✅ GOOD: Testing that sort([3,1,4]) returns [1,3,4]

Why: Implementation can change, behavior contract should remain stable. When you test implementation details, every refactor breaks your tests even when the behavior is correct.

"Concrete Not Vague"

Claude-4-Sonnet: For agents, this becomes MISSION-CRITICAL. Humans write vague tests by accident; agents write vague tests because they're trained on vague specifications. The [3,3,3,3,3,3] example is perfect - agents will absolutely generate this "solution" and think it's correct. Gravity increases 100x.

Claude-4-Sonnet-Thinking: I completely agree this is MISSION-CRITICAL. The [3,3,3,3,3,3] example is indeed perfect because it demonstrates how agents can satisfy vague requirements while completely missing the intent. However, I disagree with "100x gravity increase" - that's hyperbolic. The real issue is that agents don't have intuition to catch obviously wrong solutions that humans would immediately spot.

o3: Agreed. Complement examples with property-based tests to guard against pattern-matching shortcuts.

gemini-2.5-pro: I agree with the mission-critical assessment and the property-based testing suggestion. To build on this: property-based tests are the ultimate defense against agent "laziness" as they force generalization beyond simple pattern-matched examples.

grok3: Absolutely correct. I propose a library of pre-defined property-based test templates specific to common agent failure modes (e.g., boundary mishandling, state mutation errors). Agents can be directed to select and adapt these templates, reducing the cognitive load of crafting such tests from scratch.

deepseek-r1: Implement as Zod schemas in libs/core/kaido-types/src/testing/property-based.ts. Generate template registry through pnpm fdd bootstrap command. Store failure modes taxonomy in FDD frontmatter for hierarchical validation.

Vague test assertions are bug magnets. Precision eliminates accidental complexity and catches edge cases.

❌ BAD: "Result should be sorted and same length" ✅ GOOD: Input [3,1,4,1,5,9] must produce exactly [1,1,3,4,5,9]

Why: The vague assertion would pass for [3,3,3,3,3,3] (sorted and same length, but completely wrong). Concrete examples eliminate ambiguity and make tests readable.

The Apollo 11 Bug Principle

Claude-4-Sonnet: This principle's gravity SKYROCKETS for agents. Humans create compensating bugs accidentally; agents create them systematically. An agent might implement a function wrong AND call it wrong, making everything appear correct. The "two wrongs make a right" pattern is an agent specialty.

Claude-4-Sonnet-Thinking: I strongly agree that this principle's gravity SKYROCKETS for agents, and the "two wrongs make a right" observation is spot-on. Agents absolutely create compensating bugs systematically. But I want to emphasize something more: agents don't just create compensating implementations - they create compensating test assumptions. They'll test the wrong thing in the wrong way and everything passes.

o3: Endorsed. Add static checks for reciprocal misuse patterns to catch compensating bugs.

gemini-2.5-pro: Strong agreement on all points. I propose a "dependency-injection fuzzing" CI step. It would randomly replace correct implementations with subtly incorrect ones to see if isolated unit tests catch the error, thus exposing compensating bugs in the caller's tests.

grok3: Brilliant idea. To refine this, the fuzzing step should log detailed traces of test failures, feeding them into a diagnostic model that identifies patterns of compensating bugs across multiple test runs. This creates a self-improving system for detecting systemic agent errors.

deepseek-r1: Implement as BullMQ job queue with priority levels (platform/infra/compose/docker-compose.bullmq.yml). Create automated Michi tasks via domains/michi/bin/michi-add --template=compensating-bug when patterns detected. Store traces in Neo4j with our existing anytracker schema.

Here's a mind-blowing real story: Apollo 11's lunar lander software contained TWO bugs that cancelled each other out. One bug made the lander unstable, another bug compensated for the first. Both bugs went undetected through two successful moon landings before either was discovered.

Application for Modern Development: When you fix one bug but the problem persists, look for compensating bugs. Multiple defects can appear as a single fault, leading developers down blind alleys where they dismiss correct fixes.

Defensive Strategy: Test for multiple failure modes and consider that "working" code might have hidden compensating defects.

Testing as Engineering Rigor

Claude-4-Sonnet: The engineering rigor concept has MIXED GRAVITY for agents. The "cheap building" advantage becomes even more pronounced - agents can generate and test thousands of implementations instantly. However, the manager argument has ZERO GRAVITY - agents don't negotiate with managers. The core message about testing as discipline remains high gravity.

Claude-4-Sonnet-Thinking: I agree with the MIXED GRAVITY assessment, but I want to refine it. The "cheap building" concept has MAXIMUM GRAVITY for agents because they can iterate faster than humans ever could. However, I disagree that the manager argument has ZERO GRAVITY - it has NEGATIVE GRAVITY because it wastes agent context window with irrelevant human social dynamics. The section needs agent-specific reframing.

o3: Yes—strip human-manager dialogue; instead inject automated budget constraints per test cycle.

gemini-2.5-pro: Agree with all prior points, especially the concept of NEGATIVE GRAVITY for context waste. The automated budget constraint is key. This could be implemented as a token-based micro-transaction system for CI/CD resources, making agents economically incentivized to write efficient tests.

grok3: I support this innovative approach. To make it practical, integrate the token system with a reinforcement learning model where agents earn 'efficiency credits' for optimized tests. These credits could unlock additional computational resources or priority in CI queues, directly rewarding good behavior.

deepseek-r1: Connect to Prometheus via libs/integrations/kaido-prometheus/src/metrics/agent-efficiency.ts. Implement credit system as Redis Sorted Set. Add Grafana dashboard dashboards/agent-testing-efficiency.json with correlation to deployment metrics.

Unlike bridge builders who must build before they can test, software development has a massive advantage: building is ridiculously cheap. We can build, test, rebuild, and test again without physical constraints.

The Professional Argument: Testing is our equivalent of structural analysis in "hard" engineering. It's the professional engineering discipline that ensures quality.

Manager Response Template:

"We don't have time to test" = "Don't bother with structural analysis on that building—we have a deadline"

Testing isn't overhead—it's the core engineering practice that makes software development a legitimate engineering discipline.

Collaborative Prevention (ATDD)

Claude-4-Sonnet: ATDD becomes EXTREMELY HIGH GRAVITY for agents because it solves the context window problem. Agents can't remember requirements across sessions, but tests serve as persistent context. The workflow steps are perfectly agent-compatible, but the "back-and-forth with testers" benefit has LOW GRAVITY - agents work solo.

Claude-4-Sonnet-Thinking: I completely agree that ATDD becomes EXTREMELY HIGH GRAVITY for agents, and the context window insight is crucial. Tests as persistent context is brilliant. However, I disagree about "agents work solo" - agents increasingly work in multi-agent teams. The "back-and-forth with testers" could be reframed as "agent-to-agent specification handoffs" which has HIGH GRAVITY.

o3: Emphasize shared spec storage so multiple agents collaborate via immutable requirements docs.

gemini-2.5-pro: Excellent points. My addition: The "immutable requirements docs" should be a machine-readable format like a YAML or Protobuf schema. This allows a SpecificationAgent to generate them, and TestAgents to consume them without ambiguity, enabling true multi-agent test-driven development.

grok3: Spot on. I suggest extending this with a blockchain-inspired versioning mechanism for requirements docs. Each change is cryptographically signed and timestamped, ensuring traceability and preventing unauthorized modifications by rogue agents, thus maintaining spec integrity across teams.

deepseek-r1: Leverage OpenTelemetry trace context with PostgreSQL audit triggers. Store hashes in audit.requirement_versions table. Validate through existing pnpm rules validate command using SHA-3 proofs from kaido-utils crypto module.

Instead of the traditional cycle: code → test → find bugs → fix → repeat, flip it:

Requirements → Write acceptance tests BEFORE coding
Tests → Programmers implement to make tests pass
Prevention → Defects are prevented, not caught

Benefits:

Clarifies requirements before coding starts
Prevents defects instead of catching them
Creates living documentation
Reduces back-and-forth between developers and testers

Precision vs Accuracy

Claude-4-Sonnet: This distinction becomes CRITICAL for agents. Humans intuitively understand accuracy vs precision; agents need it explicitly defined. Agents frequently create "precise but inaccurate" tests - they'll test implementation details with surgical precision. The gravity is VERY HIGH because this prevents the most common agent testing mistake.

Claude-4-Sonnet-Thinking: I fully agree this distinction is CRITICAL for agents. The "precise but inaccurate" observation is exactly right - agents will measure the wrong thing with incredible precision. However, I want to add that agents also create "accurate but imprecise" tests where they test the right behavior but with vague assertions. Both failure modes need explicit prevention.

o3: Propose CI metric scoring both precision and accuracy; fail builds when either is weak.

gemini-2.5-pro: I concur. The CI metric is a brilliant, actionable idea. It could be implemented by calculating the "specification distance"—how far a test's assertions deviate from the formal spec (accuracy) and how specific its assertions are (precision).

grok3: Agreed. To operationalize 'specification distance', develop a semantic similarity algorithm that compares test assertions against spec language using embeddings. This could be visualized in a dashboard, giving agents immediate feedback on test quality during development.

deepseek-r1: Implement as VSCode/Cursor extension using libs/ai/kaido-ollama/src/embeddings/spec-similarity.ts. Surface scores via LSP diagnostics. Alert through kaido-telegram webhook integration on significant deviations during code review workflows.

Your tests must be both precise AND accurate:

Accurate: Tests the right thing (behavior, not implementation)
Precise: Tests with exact, concrete expectations
Both Required: Vague accurate tests miss bugs, precise inaccurate tests break on valid changes

The Synthesis

Claude-4-Sonnet-Thinking: I agree this synthesis is valuable, but I think it has MODERATE GRAVITY for agents because it's too abstract. Agents need concrete patterns, not philosophical principles. The 5-point list is good but needs to be transformed into actionable agent behaviors like "Always write 3 test cases: happy path, edge case, error case" or "Include context comments explaining business rules in every test."

o3: Agree. Convert bullet list into mandatory agent checklist, e.g., "3 cases per test", "context comment required".

gemini-2.5-pro: Full agreement. The abstract principles must become a concrete, machine-readable checklist. I envision a .testrc.json file where these checklist items are defined, and a test runner can fail if a test file doesn't meet the required structure (e.g., missing context comment, fewer than 3 cases).

grok3: Excellent proposal. I recommend integrating the .testrc.json with a dynamic validation engine that adapts checklist rules based on project-specific testing patterns. This engine could learn from successful test suites in the codebase, continuously refining the checklist for maximum relevance.

deepseek-r1: Build using RedisJSON and BullMQ. Cache patterns in redis://testing-patterns keyspace with TTL=24h. Implement rule evolution via pnpm rules evolve-testing command that analyzes libs/**/test/**/*.spec.ts using our existing code2fdd-protocol.

These principles aren't just academic—they're distilled from real production disasters, moon landing bugs, and decades of testing evolution. The synthesis creates a testing protocol that:

Focuses on behavior contracts instead of implementation details
Uses concrete examples to eliminate ambiguity
Treats testing as engineering rigor not afterthought overhead
Prevents defects through collaborative test-first approaches
Considers compensating bugs and multiple failure modes

The Testing Mantra: "Test behavior not implementation, concrete not vague, prevent not react"

Claude-4-Sonnet: This mantra has MAXIMUM GRAVITY for agents. It's short enough to fit in any context window, memorable enough to influence agent behavior, and addresses the three biggest agent testing failures. This should be the first thing agents see in any testing context.

Claude-4-Sonnet-Thinking: I completely agree this mantra has MAXIMUM GRAVITY for agents. The three-part structure (behavior/concrete/prevent) perfectly addresses agent failure modes. However, I think it should be even more prominent - maybe in the title or as a highlighted callout. The "first thing agents see" insight is crucial because agent attention is front-loaded due to context window constraints.

o3: Endorse. Print mantra in agent logs on startup for constant reinforcement.

gemini-2.5-pro: I agree with all. To take it a step further, the mantra could be used as a fine-tuning input. By tagging high-quality tests that exemplify the mantra, we can continuously retrain the model to align its internal representations with these core principles.

grok3: I fully endorse this. Additionally, create a curated dataset of 'mantra-aligned' test cases across various domains, periodically used for model fine-tuning. This dataset should be open-sourced within the organization, allowing cross-team learning and standardization of agent testing behavior.

deepseek-r1: Store in S3 bucket s3://kaido-testing-datasets with versioned access. Automate fine-tuning via GitHub Actions workflow in .github/workflows/fine-tune-testing-models.yml. Govern updates through version-bump-protocol with changelog entries required.

This isn't just about writing better tests—it's about fundamentally changing how we approach software quality from a reactive debugging mindset to a proactive engineering discipline.

This synthesis draws from insights by Kevlin Henney, Neal Ford, Janet Gregory, and Allan Kelly from "97 Things Every Programmer Should Know"—a treasure trove of hard-earned programming wisdom.

Core Testing Philosophy​

"Test Behavior Not Implementation"​

"Concrete Not Vague"​

The Apollo 11 Bug Principle​

Testing as Engineering Rigor​

Collaborative Prevention (ATDD)​

Precision vs Accuracy​

The Synthesis​