Skip to content

How to Detect Flaky Tests Automatically (2026)

You can't detect a flaky test from one run. The history-based approach to automatic flaky-test detection — signals, scoring vs flags, and tooling.

İbrahim Süren
Founder · Jun 25, 2026 · 6 min read
How to Detect Flaky Tests Automatically (2026)

You can't detect a flaky test from a single run — passing once proves nothing and failing once might be a real bug. Automatic detection works by recording every test's pass/fail outcome across runs and flagging the ones that flip result without a code change, ideally with a flakiness score rather than a binary flag.

Key takeaways

  • A single run can't reveal flakiness — detection requires pass/fail history across runs.
  • Key signals: result flips on unchanged code, passes only on retry, and high run-to-run timing variance.
  • Prefer a flakiness score over a binary flag — a 20%-fail test needs different handling than an 80% one.
  • Options range from manual re-runs to CI plugins to an observability platform that scores tests automatically.
  • Detection is step one; it only pays off when it feeds quarantine and root-cause fixes.

Flaky tests are easy to feel and hard to pin down. Everyone knows the suite is “a bit flaky,” but which tests, how badly, and getting worse or better? Answering that precisely — automatically, not by memory — is what flaky-test detection does. This guide covers the approach that actually works, the signals to look for, and the tooling options. For the step-by-step pipeline setup, see the companion guide to setting up flaky test detection in CI. Where we reference Qualflare, our own platform, we describe only what it actually does.

Why you can’t detect flaky tests from one run

A single run can’t tell you a test is flaky. A pass might be luck; a single failure might be a genuine bug. Flakiness is defined by inconsistency on unchanged code, and inconsistency is only visible across multiple runs. This is the core reason naive approaches — re-running once, eyeballing a failure — don’t scale: they confuse “failed this time” with “unreliable over time.”

History-based detection: the reliable approach

Flaky-test detection works by recording every test’s outcome on every run, keyed by a stable test identity and the commit it ran against, then flagging tests that flip result without a corresponding code change. The longer the history, the more confident the signal — a test that’s failed-then-passed on the same commit ten times is unambiguously flaky.

This is also why detection belongs in the observability layer rather than the test runner: the runner sees one execution, but detection needs the whole stream of runs. Google’s research underscores why this is worth automating — it identifies flakiness as one of the central challenges of automated testing, precisely because it’s pervasive and invisible to single-run thinking. In its own suite, almost 16% of tests showed some level of flakiness and about 1.5% of all test runs report a flaky result — a steady background rate that only history can separate from real failures.

What signals indicate flakiness

Beyond the headline signal — a result that flips on unchanged code — strong detectors weigh several:

  • Result flips per commit — the same test, same code, different outcomes.
  • Pass-on-retry — the test failed, then passed on an automatic retry with no change. Retry counts are a direct flakiness signal.
  • Timing variance — large run-to-run swings in duration often accompany races and timeouts.
  • Environment correlation — failures that track with parallelism, specific runners, or ordering rather than code (the classic pass-locally-fail-in-CI pattern).

Flakiness scoring beats binary flags

A flaky/not-flaky flag throws away information. A flakiness score — how often and how recently a test flips — lets you triage: quarantine the test that fails 80% of the time today, and watch the one creeping from 2% to 10% before it becomes a problem. Scoring also tells you whether a fix worked, because you can see the score fall over subsequent runs. Qualflare scores every test’s reliability from its history and tracks a 90-day flakiness trend for exactly this reason.

Ways to detect flaky tests

ApproachHow it worksTrade-off
Manual re-runsRe-run failed tests and see if they passDoesn’t scale; no history or scoring
Framework retry dataUse built-in retry results (Playwright, Cypress) as a signalPer-run only; no cross-run trend
CI pluginsA pipeline step tracks repeated failuresOften CI-specific; limited analysis
Observability platformIngests every run’s results and scores flakiness automaticallyBest at scale; keeps full cross-run history

The further down the table, the more the system remembers — and memory is what detection runs on. An observability platform also connects detection to the rest of the picture: it can cluster a group of flaky failures to a shared cause, so you fix one root issue instead of chasing each test.

From detection to action

Detection only pays off when it drives action. Once a test is scored flaky, quarantine the worst offenders so they stop blocking releases, then fix the underlying non-determinism and let the score confirm the fix. The full lifecycle — detect, contain, fix, measure — is covered in the complete guide to flaky tests.

Start free with Qualflare — upload your CI results and get automatic flaky scoring and a 90-day reliability trend on your own suite within minutes.

Frequently asked questions

How do you automatically detect flaky tests?

Record every test’s pass/fail result on every run, keyed by test identity and commit, then flag tests that change outcome without a code change. The more run history you have, the more confident the signal. Mature detection assigns each test a flakiness score based on how often and how recently it flips, rather than a simple flaky/not-flaky flag.

Can you detect flaky tests from a single test run?

No. A pass could be luck and a single failure could be a real bug. Flakiness is defined by inconsistent results on unchanged code, which only becomes visible across multiple runs. That’s why reliable detection is always history-based.

What signals indicate a test is flaky?

The strongest signal is a result that flips between pass and fail with no code change. Others include tests that fail then pass on automatic retry, large run-to-run variance in duration, and failures that correlate with parallelism or specific runners rather than code.

What tools detect flaky tests?

Options range from manual re-runs of failed tests, to your framework’s built-in retry data, to dedicated CI plugins, to a test observability platform that ingests results from every run and scores flakiness automatically. The platform approach scales best because it keeps the cross-run history detection depends on.

Ready to ship with confidence?

Start free with Qualflare's AI-powered test management.