When Tests Become a Liability: The Cost of Brittle Test Suites

Part 1 of 9 in the series: Unit Testing: A Behavior-First Approach This post kicks off a weekly series where I share the hard-won lessons my teams and I learned about writing tests that actually help you ship faster. Each post stands on its own, but together they tell the story of how we went from dreading test maintenance to genuinely enjoying TDD. If you find this useful, the full series is linked at the bottom.

Reading Time: ~4 minutes

I've been building software for over a decade. I'm currently CTO at Outbound, and before that I was CTO at Script (acquired by Linq), Director of Engineering, and a senior engineer. Along the way I've built workflow engines, drag-and-drop page builders, paper eSign systems (think DocuSign), form builders, ABAC permissions systems, landing page builders, and declarative IaC-like engines for marketing campaigns. I started my first company not knowing how to do frontend work or run infrastructure, and grew into a full-stack developer comfortable with everything from container queries to security group IaC. If I've learned anything, it's been through countless mistakes doing hard things.

One of those mistakes: I set a goal of 80% test coverage and told my team to "test every class." It seemed like the right thing to do: measurable, achievable, and it gave stakeholders confidence. But here's what happened: we created a test suite that was worse than having no tests at all.

This reminds me of the Ford Pinto case study. Lee Iacocca set a clear goal: the car would weigh under 2,000 pounds and cost under $2,000. They hit both targets, but in doing so, they created a car with a dangerous fuel tank design. The metrics were achieved, but the outcome was catastrophic.

Similarly, "80% test coverage" is an easy metric to measure and report. But when we pursued it with a "test every class" mindset, we created brittle, hard-to-read tests that needed to be changed every time any code changed, including refactors. These test suites waste time and don't increase team velocity.

At Script, as requirements changed frequently (as they do at any startup), we started feeling the pain. Every feature addition, every refactor, every improvement meant updating dozens of test files. Developers began to dread making changes because of the test maintenance burden.

I know what some of you are thinking: "This is exactly why I don't write tests. They just slow you down." I've heard that objection hundreds of times, and I used to half-agree with it. But the problem was never testing itself. The problem was how we were testing. Stick with me. The payoff is worth it.

The Problem: Brittle Tests That Break on Every Change

Let me show you what I mean with some hypothetical code. Let's say we have a Task class that represents work items in our system:

class Task {
  constructor(
    public id: string,
    public title: string,
    public status: 'PENDING' | 'IN_PROGRESS' | 'COMPLETED',
    public assignedTo: string
  ) {}

  markComplete(): void {
    this.status = 'COMPLETED';
  }

  assignTo(userId: string): void {
    this.assignedTo = userId;
  }
}

Simple enough. Now, let's look at how tests were written. With our "test every class" approach, we had tests scattered across multiple files, each directly instantiating Task objects:

Test File 1: task-service.test.ts

describe('TaskService', () => {
  it('should retrieve a task by id', async () => {
    /** ⚠️ Coupled to Task constructor. Breaks if signature changes */
    const task = new Task(
      'task-123',
      'Fix bug in login',
      'IN_PROGRESS',
      'user-456'
    );

    const result = await taskService.getTaskById('task-123');

    /** ⚠️ Asserting individual properties, coupled to Task's internal structure */
    expect(result.id).toBe(task.id);
    expect(result.title).toBe(task.title);
    expect(result.status).toBe(task.status);
    expect(result.assignedTo).toBe(task.assignedTo);
  });

  it('should update task status', async () => {
    /** ⚠️ Another direct instantiation, same constructor dependency */
    const task = new Task(
      'task-123',
      'Fix bug in login',
      'PENDING',
      'user-456'
    );

    await taskService.updateStatus('task-123', 'IN_PROGRESS');

    expect(task.status).toBe('IN_PROGRESS');
  });
});

Test File 2: task-repository.test.ts

describe('TaskRepository', () => {
  it('should save a task', async () => {
    /** ⚠️ Yet another file coupled to Task constructor */
    const task = new Task('task-123', 'Review PR', 'PENDING', 'user-789');

    await repository.save(task);

    /** ⚠️ Assertion mirrors Task's internal shape. Breaks if fields are renamed or restructured */
    expect(mockDb.save).toHaveBeenCalledWith({
      id: 'task-123',
      title: 'Review PR',
      status: 'PENDING',
      assignedTo: 'user-789',
    });
  });

  it('should find tasks by assignee', async () => {
    /** ⚠️ Two more direct instantiations. Two more breakpoints when constructor changes */
    const task1 = new Task('task-1', 'Task 1', 'PENDING', 'user-789');
    const task2 = new Task('task-2', 'Task 2', 'IN_PROGRESS', 'user-789');

    await repository.save(task1);
    await repository.save(task2);

    const results = await repository.findByAssignee('user-789');
    expect(results).toHaveLength(2);
  });
});

Test File 3: task-controller.test.ts

describe('TaskController', () => {
  it('should create a task via HTTP', async () => {
    /** ⚠️ Same pattern, third file. Constructor dependency is spreading */
    const task = new Task('task-123', 'New feature', 'PENDING', 'user-456');

    const response = await controller.create({
      title: 'New feature',
      assignedTo: 'user-456',
    });

    expect(response.id).toBeDefined();
    expect(response.title).toBe(task.title);
  });

  it('should update task assignment', async () => {
    /** ⚠️ Every test file knows how to build a Task. That's the root problem */
    const task = new Task(
      'task-123',
      'Existing task',
      'IN_PROGRESS',
      'user-456'
    );

    await controller.updateAssignment('task-123', {
      assignedTo: 'user-789',
    });

    expect(task.assignedTo).toBe('user-789');
  });
});

You get the picture. We had dozens of test files, each creating Task objects directly in their tests.

When Any Change Breaks Everything

Then came the requirement: "We need to track task priority." Simple enough, right? We add a priority field to the Task class:

class Task {
  constructor(
    public id: string,
    public title: string,
    public status: 'PENDING' | 'IN_PROGRESS' | 'COMPLETED',
    public assignedTo: string,
    public priority: 'LOW' | 'MEDIUM' | 'HIGH' // New property
  ) {}

  // ... rest of the class
}

Boom. Every single test that creates a Task object is now broken. TypeScript screams at us: "Expected 5 arguments, but got 4."

We had to update:

task-service.test.ts - 2 test cases
task-repository.test.ts - 2 test cases
task-controller.test.ts - 2 test cases
task-validator.test.ts - 3 test cases
task-event-handler.test.ts - 4 test cases
task-scheduler.test.ts - 5 test cases
... and 12 more test files

Total: 50+ test cases across 18 files needed updates.

And it wasn't just adding the parameter. We had to decide: what's the default priority? Should we assert on it? Does it affect the behavior we're testing? Each test file became a decision point.

This wasn't just about adding a property. Any refactor (extracting a method, renaming a variable, changing a constructor signature) meant updating multiple test files. The tests were so tightly coupled to implementation details that they broke on changes that didn't affect behavior at all.

The Root Cause: Tests Coupled to Implementation

The problem isn't the new property. It's that our tests are coupled to the implementation details of how Task objects are constructed. Every test knows exactly how to build a Task, which means every test breaks when the constructor changes, even when the behavior we're testing hasn't changed.

Just like the Pinto team met their goal but delivered a value-destroying outcome, a software project can hit 80% coverage and still have a test suite that slows the team down, obscures intent, and erodes confidence. Test coverage is only one data point. It tells you an output, not an outcome. It's a well-intended but ultimately defective metric, like lines of code or tokens consumed.

The Real Cost: Worse Than No Tests

Bad test suites are worse than no tests at all. Here's what we experienced:

Wasted Time: Time spent updating tests that didn't need to change
Cognitive Load: Developers have to remember (or be reminded by build checks) to update tests in multiple places
Merge Conflicts: Multiple developers updating the same test files simultaneously
Fear of Change: Developers hesitate to refactor because "it will break all the tests"
False Negatives: Tests fail not because behavior changed, but because implementation changed
Lost Trust: When tests break for non-behavioral reasons, developers start ignoring test failures
Reduced Velocity: The team moves slower because every change requires test maintenance
PR Noise: PR fatigue. What changes are meaningful vs. what is just a refactor?

The tests weren't reducing risk, they were increasing it. They weren't reducing stress, they were creating it. And they definitely weren't documenting behavior, they were documenting implementation details that changed constantly.

This last point deserves emphasis, because it's become one of the core ideas in this series: diffs are communication. When you open a pull request, every file that changed is saying something to the reviewer. If test files are changing, that should mean the public interface or requirements changed. But in our old test suite, test files changed on every PR, even pure refactors, which meant the signal was drowned in noise. Reviewers couldn't tell what was a real behavior change and what was just test maintenance busywork.

Common "Solutions" (That Don't Work)

When we hit this wall, we heard a few common responses:

"Just don't test. It's too expensive and time-consuming."

This feels tempting when tests are a burden. But the problem isn't testing. It's how we're testing. Good tests should make development faster, not slower. They should reduce risk, not increase it.

I'm a firm believer that tests do make you faster. If they don't speed up the initial implementation, they should pay dividends over time. Part of being a senior developer is performing global optimization for the team vs. just locally optimizing your own immediate workflow. I had to learn that the hard way. I spent years thinking I was "too fast" for tests, and I paid for it every time I had to debug a regression at 2 AM.

"This is just how it is. You need to deal with it."

This is defeatist. Yes, maintaining tests takes effort, but it shouldn't require updating 50+ files for a single property addition. There's a better way.

"Write fewer tests."

This misses the point. The problem isn't the number of tests. It's that our tests are brittle and coupled to implementation. We need better tests, not fewer tests.

"You have too many interfaces and abstractions. That's why it's complicated."

I hear this one a lot, and I get it. When you look at a clean architecture codebase for the first time, the number of files and interfaces can feel overwhelming. But here's the distinction I've come to appreciate: there's a difference between essential complexity and accidental complexity. The interfaces and layers exist because the business problem is complex. They help us organize that complexity so each piece can change independently. That's essential. What's accidental is when your tests force you to understand all of it just to make a simple change. The patterns in this series specifically address that accidental complexity in your test suite.

What Good Test Suites Should Do

Before we talk about solutions, let's be clear about what we're aiming for. Good test suites should:

Reduce Risk: Catch bugs before they reach production
Reduce Stress: Give developers peace of mind that as they modify the system, they are not breaking important business behaviors
Document Behavior: Inform other developers about not only "what" the code does but show who the behavior is important to and for what reason. (This helps them ask better questions later to the right stakeholders.)
Provide Design Pressure: Using TDD, tests guide us toward better design. They allow the developer to exercise the API and build a first-class DX from the start.
Increase Velocity: All of the above should have a positive impact on global velocity (and probably local velocity as well)

Our brittle test suite was doing none of these things. It was increasing risk (by creating false negatives), increasing stress (by breaking on refactors), documenting implementation (not behavior), and slowing us down.

The Solution: Isolate Setup and Assertions

The core principle we need to adopt is: test files should only change when behavior or requirements change. This is borrowed from Clean Architecture and informs us that, generally speaking, code that changes at the same rate for the same reasons should be co-located and modularized together. It is helpful for developers who see a test diff show up on a PR to have a clear signal that behavior is being changed or additional behaviors are being added.

If adding a priority property to Task doesn't change the behavior we're testing, our tests shouldn't need to change. We need to isolate the setup (creating test data) and assertions (verifying outcomes) from the test logic itself.

This is where Builders and Test DSL patterns come in. Instead of every test knowing how to construct a Task, we centralize that knowledge in one place. When the constructor changes, only the builder needs updating, not every test file.

But tooling alone isn't enough. We also need to be stingy about what we expose. The more we encapsulate, the less our tests know about internals, and the less they break. This pushes us toward richer domain models that expose behavior over data. Instead of exposing a status enum, expose methods like isActive() or isArchived(). How that's calculated is an internal concern. Good software is easy to change, and you make it easy to change through proper layering and encapsulation. Builders and assertion chains help, but a well-designed public contract is what makes tests truly resilient.

This also has a nice side effect: we can more easily split our tests into a "file per feature or behavior" since setup and assertions can easily be shared. As you'll see in the next post, this provides additional clarity and documentation benefits to the developer.

What's Next: The Rest of This Series

This post identified the problem. The rest of this series is dedicated to solving it, one pattern at a time.

Builders and Test DSL: We'll centralize test data generation so that a constructor change updates one file, not fifty.
Assertion Chains: We'll replace brittle property-level assertions with behavior-focused verification that survives refactoring.
Testing at the Layer of Behavior: We'll stop testing every class and start testing every behavior, reducing test count while increasing confidence.
Tests as Living Documentation: We'll turn tests into executable specifications that communicate intent to developers, product managers, and AI agents.
The Behavior-First Mindset: We'll explore the shift from code-first to behavior-first thinking and why it matters more than ever with AI writing code.
Tests as PR Documentation: We'll use tests as the primary lens for code review, treating diffs as a communication medium.
TDD with AI: We'll see how TDD becomes the standard workflow for supervising AI agents through tests.
The Journey: We'll tie it all together and reflect on what changed.

Each post builds on the last, but they're designed to stand on their own. If any of the problems in this post felt familiar, I think you'll find the solutions worth your time.