Track down a failing test

Hand Sarvam Code a red test and let it dig: reproduce it, trace the failure to its root cause across files, fix it, and confirm the suite is green again.

Sarvam Code - billing-worker

› TestSettleInvoice fails intermittently in CI. Find out why and fix it.

• Let me reproduce it under repeated runs to surface the flake.

• Ran go test -run TestSettleInvoice -count=20

└ --- FAIL: TestSettleInvoice (0.01s)

settle_test.go:48: got status PENDING, want SETTLED

FAIL (3/20 runs)

• Reproduced 3 of 20. The settle goroutine isn't awaited before the assertion: a race, not a logic bug.

• Edited internal/billing/settle.go (+3 -1)

62- go s.markSettled(ctx, inv.ID)

62+ if err := s.markSettled(ctx, inv.ID); err != nil {

63+ return fmt.Errorf("settle: %w", err)

64+ }

• Ran go test -run TestSettleInvoice -count=50

└ ok internal/billing 0.624s

PASS (50/50 runs)

• Root cause was a fire-and-forget settle that the test could outrun. Now awaited and error-checked; 50/50 green. Want me to scan for other go-without-wait calls?

01Point the agent at the failing test or paste the CI output.
02It reproduces the failure locally and reads the code paths involved.
03It forms a hypothesis, narrows it down, and traces to the root cause.
04It writes the fix, then re-runs the test and the surrounding suite.
05You review one diff that explains the bug and proves it's resolved.

Debugging is the part of engineering that resists shortcuts: the symptom is in one file, the cause is in another, and the fix is obvious only once you understand both. A flaky or intermittently failing test is worse still, because it hides behind timing and state that a single read of the code won't reveal.

Sarvam Code treats it as the investigation it is. It reproduces the failure, reads the relevant code paths the way you would, and reasons across files to separate the symptom from the cause. When it has a fix, it does not stop at the one test: it re-runs the surrounding suite so a fix in one place doesn't quietly break another. Every step is visible and gated, so you are reviewing a reasoned diagnosis, not a guess.

Long-horizon debugging is read-heavy and iterative, exactly the workload where an affordable model with a disciplined harness shines. You get to the root cause without paying frontier prices to think out loud.

Curious what it costs to run work like this? See cost per task, or the full pricing.

Track down a failing test

Related

Add an API endpoint

Explain a large repository

Cost per task

Begin here