Flaky Tests: The Complete Guide to Detection, Quarantine & Prevention (2026)

A flaky test passes and fails on the same code. The complete guide to detecting, quarantining, and preventing flaky tests in CI.

İbrahim Süren

Founder · Jun 25, 2026 · 10 min read

Flaky Tests: The Complete Guide to Detection, Quarantine & Prevention (2026)

A flaky test is one that passes and fails on the same code without any change. You manage flakiness in four moves: detect flakes from pass/fail history, quarantine the worst so they stop blocking releases, fix the underlying non-determinism, and track your flake rate so you can prove it's improving.

Key takeaways

A flaky test returns different results on unchanged code — it's a false signal, not a real bug.
Flakiness is widespread: Google's analysis found almost 16% of its tests showed some flakiness.
Detect flakes from history across runs, never from a single result.
Quarantine to stay unblocked, then fix the root-cause non-determinism — don't just retry forever.
Track flake rate over time so reliability work is measurable, not anecdotal.

Flaky tests are the quiet tax on every automated test suite. They pass on one run and fail on the next with no code change, they block releases at the worst possible moment, and — worst of all — they teach your team to ignore red builds. Once that happens, a real regression sails through behind the noise.

This is the complete guide to dealing with them: what a flaky test actually is, what causes flakiness, how to detect it automatically, how to contain it with quarantine and retries, how to prevent it at the root, and how to measure your way out. Where we reference Qualflare, our own platform, we describe only capabilities we can verify.

What is a flaky test?

A flaky test is a test that produces different results — sometimes passing, sometimes failing — on the same code, without any change to that code. Because the code under test didn’t change, a flaky failure tells you nothing reliable: it’s a false signal. The same property makes flakiness uniquely corrosive. A test that’s wrong half the time isn’t just useless; it actively trains engineers to distrust the suite.

Flakiness is not a rare edge case. In Google’s analysis of its own test suite, almost 16% of tests showed some level of flakiness, and the company sees a continual rate of about 1.5% of all test runs reporting a flaky result. At that prevalence, flakiness isn’t a problem you eliminate once — it’s one you manage continuously.

What causes flaky tests?

Every flaky test traces back to non-determinism — the result depends on something the test doesn’t control. Martin Fowler’s guide to eradicating non-determinism in tests remains the canonical breakdown. The common sources:

Timing and race conditions — fixed sleep() waits, assertions that fire before an async operation completes, animations.
Shared state — tests that leave data, files, or global state behind and pollute the next test.
Test-ordering dependencies — a test that only passes if another runs first.
External dependencies — real network calls, third-party APIs, or services that are occasionally slow or down.
Concurrency — tests that aren’t safe to run in parallel because they contend for ports, fixtures, or data.
Environment differences — time zones, locales, randomness, or resource limits that differ between machines. This is why tests so often pass locally but fail in CI.

Which tests are flakiest? Flakiness by layer

Flakiness isn’t spread evenly across the test pyramid. The higher up you go, the more moving parts a test depends on — and the more ways it can turn non-deterministic.

Test layer	Where the flakiness comes from	Relative flake risk
Unit	Shared module state, fake-timer mistakes, ordering leaks between tests	Low
Integration	Real databases, ports, fixtures, and service start-up timing	Medium
End-to-end	Network calls, browser rendering, animations, and full-stack timing	High

This is why an inverted pyramid — many end-to-end tests, few unit tests — is the single biggest structural cause of a flaky suite. End-to-end tests are worth keeping, but each one carries more flakiness risk, so they belong at the top of the pyramid in small numbers, not at its base.

How to detect flaky tests

You cannot spot a flaky test from a single run — passing once proves nothing, and failing once might be a real bug. The reliable approach is historical: record every test’s outcome on every run, keyed by test identity and commit, then flag the ones that flip result without a code change. The more runs in the history, the more confident the signal.

Good detection produces a flakiness score, not a binary flag: a test that fails 20% of the time needs different handling than one that fails 80%. Qualflare scores every test’s reliability from its run history and tracks a 90-day flakiness trend, so you can prioritize the worst offenders and confirm they actually improve after a fix. For the full approach, see how to detect flaky tests automatically; for a step-by-step setup, see the guide to setting up flaky test detection in CI.

A worked example makes the history-based logic concrete. Say checkout_test ran 50 times this week across three commits: it passed 41 times and failed 9, and those 9 failures are spread across all three commits — including the commit that’s currently green in production. Because the same commit both passed and failed, the failures can’t be a code defect; the test is flaky, with a flake rate of 18% (9 of 50). A single-run view would have shown only the latest red and sent someone hunting a bug that doesn’t exist. The history is what turns nine scattered failures into one clear verdict.

How to handle flaky tests: quarantine and retries

Detection tells you which tests are flaky. Handling them is about staying unblocked without going blind.

Quarantine moves a known-flaky test out of the build’s blocking path: it still runs and reports, but its result no longer fails the pipeline. This keeps the build trustworthy for everyone else. The danger is that a quarantined test is invisible coverage — a real bug it would have caught now ships silently — so treat quarantine as temporary. Set an SLA (fix or delete within, say, two weeks) and track how long tests sit there.

Retries re-run a failed test and pass it if any attempt succeeds. They keep pipelines green, but a test that only passes on the second try is still flaky — and careless retries can mask a genuine intermittent bug in your product. Allow limited retries to stay moving, but record which tests needed a retry and how often; retry frequency is itself a flakiness signal.

Setting a quarantine policy

Quarantine without rules just hides tests forever, and the invisible-coverage risk compounds. A workable policy makes the trade-off explicit:

Entry criteria — a test is quarantined only once detection scores it flaky from history, never on a single red run.
An owner — every quarantined test gets an assignee, so it isn’t orphaned.
A time-boxed SLA — fix or delete within a fixed window (two weeks is a common default). A test that outlives its SLA is escalated or removed, not left to rot.
A size cap — limit how many tests can sit in quarantine at once, say 1% of the suite. Hitting the cap is the signal to stop adding features and pay down test debt.
Exit criteria — a test rejoins the blocking suite only when its flakiness score stays clean across a set number of runs, proving the fix held.

The cap matters more than teams expect: every quarantined test is coverage you’ve muted, and an unbounded quarantine list slowly blinds the suite to real regressions.

How to prevent flaky tests

Quarantine and retries are containment. The cure is removing the non-determinism:

Wait on conditions, not the clock — replace fixed sleeps with explicit waits for the state you expect.
Isolate state — each test sets up and tears down its own data; nothing leaks between tests.
Stub external dependencies — don’t let a third-party outage fail your unit tests.
Make tests order-independent — no test should rely on another running first.
Control inputs — pin time, seed randomness, fix locales and time zones.
Make tests parallel-safe — no shared ports, files, or fixtures across concurrent tests.

A top-heavy suite makes all of this worse: end-to-end tests are the flakiest, so a suite that over-relies on them (an inverted test pyramid) will be flaky no matter how carefully you write each test.

Flaky tests by framework

The principles are universal, but the mechanics differ per framework. Each of these has built-in retry data and result formats that feed flaky detection once you send them to an observability layer: Playwright, Cypress, pytest, Jest, and JUnit.

For framework-specific fixes, see the deep dives on Playwright flaky tests in CI, Cypress flaky tests, pytest flaky tests, and Jest flaky tests in CI.

Measuring your way out: flake rate

What gets measured gets fixed. Flake rate — the share of failures that turn out to be flaky rather than real — turns flakiness from an anecdote into a managed metric. Track it over time, set a threshold, and watch whether reliability work is actually paying off. Because reliable, fast tests lower change failure rate and shorten lead time, this work shows up directly in your DORA metrics too.

Flaky tests are also rarely as numerous as they look: a wall of intermittent red usually clusters into a few shared causes, so fixing one root issue can quiet dozens of tests at once.

Start free with Qualflare — upload your CI results and get flaky-test scoring, failure clustering, and release-risk analysis on your own suite within minutes.

Frequently asked questions

What is a flaky test?

A flaky test is a test that produces different results — sometimes passing, sometimes failing — on the same code, without any change to that code. It’s usually caused by timing, shared state, test ordering, or environment differences, and it’s a false signal rather than a real defect.

How do you detect flaky tests?

By analyzing each test’s pass/fail history across many runs. A single run can’t prove a test is flaky; you need to see it flip outcome on unchanged code. Detection tools record every run’s results and flag tests that change result without a corresponding code change, ideally with a flakiness score.

Should you delete flaky tests?

Not as a first step. A flaky test usually covers something real, so deleting it removes coverage. Quarantine it so it stops blocking releases, fix the underlying non-determinism, then return it to the blocking suite. Delete only if the test has no value or duplicates coverage elsewhere.

Are retries a good way to handle flaky tests?

Retries keep builds green but hide the signal — a test that only passes on retry is still flaky and may be masking a real intermittent bug. Use limited retries to stay unblocked, but record which tests needed a retry so flakiness stays visible and gets fixed.

What’s an acceptable flaky test rate?

Lower is always better; the goal is a suite people trust. What matters more than a universal number is the trend: track your flake rate over time and drive it down. A suite where flaky failures are common enough that engineers ignore red builds has already crossed the line.