Flaky Test Statistics 2026: How Common, Costly & Fixable They Are

Flaky test statistics for 2026: how common they are (Google ~16% of tests), what they cost (Slack ~28 min/failure), causes, and how top teams fix them.

İbrahim Süren

Founder · Jun 25, 2026 · 9 min read

Flaky Test Statistics 2026: How Common, Costly & Fixable They Are

Original research

Flaky tests are pervasive and expensive: Google found ~16% of its tests carry some flakiness and ~1.5% of all test runs report a flaky result; Slack measured ~28 minutes of manual triage per failed test and traced 57% of its build failures to flaky and failing tests. The most common root cause is async waits. This is a compiled, fully-cited reference of what the public research and engineering reports actually say.

Key takeaways

Google: ~16% of its tests show some flakiness, and ~1.5% of all test runs report a flaky result.
Cost is real: Slack measured ~28 minutes of manual triage per failed test, and flaky + failing tests caused 57% of its build failures.
Async waits are the single most common cause (Luo et al.), ahead of concurrency and test-order dependencies.
Flakiness is near-universal — the vast majority of developers in Eck et al.'s survey called it a significant problem.
It's manageable, not eliminable: Slack's automated suppression cut flaky build-job failures from ~57% to under 5%.

Flaky tests — tests that pass and fail on the same code without any change — are one of the most studied and least solved problems in software testing. This is a compiled reference of what the public research and large-scale engineering reports actually say about how common, costly, and fixable they are. Every figure links to its primary source; where a number couldn’t be verified to a source that states it, we left it out. Where we reference Qualflare, our own platform, we describe only what it actually does.

A note on methodology: the numbers below are drawn from published academic studies and engineering blogs (Google, Slack, Meta) — not from Qualflare’s own data. They’re the figures the industry cites when it talks about flaky tests.

How common are flaky tests?

Flakiness is pervasive at scale. The most-cited data point comes from Google: in 2016, John Micco reported that “almost 16% of our tests have some level of flakiness associated with them” and that Google sees “a continual rate of about 1.5% of all test runs reporting a ‘flaky’ result.” Those two numbers — ~16% of tests, ~1.5% of runs — are the reference points the whole field uses.

That headline rate hides a sharp pattern by test size. Google’s own data shows that over a single week, 0.5% of its small tests, 1.6% of its medium tests, and 14% of its large tests were flaky — the bigger and more integrated the test, the more ways it has to go non-deterministic (as compiled in Trunk’s 2024 analysis of 20.2 million CI jobs). And small per-test rates compound at scale: across 1,000 tests each flaking just 0.1% of the time, the chance that at least one fails on a given run is roughly 63% — which is why large suites can feel perpetually red.

The share of failures that are flaky is even more striking. At Slack, before they built automated detection, flaky and failing automated tests accounted for 57% of build failures — meaning more than half of the time a build went red, it wasn’t a real, reproducible problem.

And the problem is trending up, not down. Bitrise’s analysis of 10M+ mobile builds from 2022 to 2025 found the share of teams hitting flakiness rose from 10% in 2022 to 26% in 2025.

Metric	Figure	Source
Tests with some flakiness	~16%	Google (2016)
Test runs reporting a flaky result	~1.5%	Google (2016)
Large tests flaky per week (vs 0.5% small)	14%	Google
Build failures caused by flaky + failing tests	57% (pre-automation)	Slack
CI jobs in Trunk’s 2024 flakiness analysis	20.2 million	Trunk
Teams hitting flakiness (2022 → 2025)	10% → 26%	Bitrise

How much do flaky tests cost?

The cost is engineering time and lost trust. Slack measured roughly 28 minutes of manual triage per failed test — multiplied across a codebase with 16,000+ Android and 11,000+ iOS automated tests and 120+ developers opening 550+ pull requests a week, that is an enormous tax. After automating flaky-test suppression, Slack reported recovering about 553 hours of triage time and lifting main-branch stability from roughly 20% to 96%.

The less measurable cost is trust. Once a suite cries wolf often enough, engineers start ignoring red builds — and a real regression slips through behind the noise. That’s why Google frames flakiness as one of the main challenges of automated testing, not just a nuisance.

What causes flaky tests?

The canonical taxonomy comes from Luo et al.’s 2014 “An Empirical Analysis of Flaky Tests”, which studied commits that fixed flaky tests across dozens of open-source projects and identified ten categories of root cause. The single most common is asynchronous waits — tests that don’t wait properly for an operation to complete, about 45% of the flaky-test fixes they analyzed — followed by concurrency (race conditions, deadlocks) and test-order dependencies (tests that assume a particular execution order). The remaining categories include resource leaks, network, time, I/O, randomness, floating-point, and unordered collections.

Every one of these is a form of non-determinism: the result depends on something the test doesn’t control. Martin Fowler’s guide to eradicating non-determinism in tests remains the canonical breakdown of how to fix each category. For the framework-specific versions, see our guides for Playwright, Cypress, pytest, and Jest & Vitest.

How developers experience flakiness

Flaky tests aren’t a fringe concern. In Eck et al.’s 2019 study, “Understanding Flaky Tests: The Developer’s Perspective,” a survey of 121 professional developers (median five years of industry experience) found that flakiness is perceived as significant by the vast majority of developers, regardless of team size or project domain. The same study had 21 developers classify 200 flaky tests they had previously fixed — grounding the taxonomy in tests engineers actually dealt with. A larger 2022 survey by Gruber and Fraser reached 335 developers and came to the same conclusion: flaky tests are a common and serious problem across domains.

How the biggest engineering teams handle it

The pattern across Google, Slack, and Meta is the same: measure flakiness from history, then act on it — you don’t eliminate it, you manage it.

Google quarantines flaky tests and tracks their rate continuously (the 16% / 1.5% figures come from that ongoing monitoring).
Slack built automated detection and suppression, driving flaky build-job failures from ~57% to under 5% and stability from ~20% to 96%.
Meta built a probabilistic flakiness score to quantify how flaky each test is over time, rather than treating “flaky” as a binary label.
Uber runs continuous detection across its monorepos — about 1,000 flaky tests out of 600K in its Go monorepo — while validating 2,500+ code changes a day.

What you can do about it

The takeaway from the data is a four-step lifecycle, the same one the big teams use: detect flaky tests from their pass/fail history (you can’t tell from a single run), quarantine the worst so they stop blocking releases, fix the underlying non-determinism, and measure your flake rate over time so the work is trackable. Our complete guide to flaky tests walks the full lifecycle, how to detect flaky tests automatically covers the detection step, and the best tools for debugging flaky tests roundup compares the options.

Qualflare does the detection and measurement automatically: it scores every test’s reliability from its CI history, clusters failures by root cause, and tracks a 90-day flakiness trend — the history-based approach the data shows actually works.

Start free with Qualflare — upload your CI results and see your own flaky-test rate and the worst offenders within minutes.

Frequently asked questions

How common are flaky tests?

Very. Google reported that almost 16% of its tests have some level of flakiness and that about 1.5% of all test runs report a flaky result. At Slack, flaky and failing automated tests accounted for 57% of build failures before they automated suppression.

How much do flaky tests cost?

They cost engineering time and release confidence. Slack measured roughly 28 minutes of manual triage per failed test, and its flaky-test automation recovered about 553 hours of triage time. More broadly, flaky failures erode trust in the suite, so real regressions get dismissed as “probably flaky.”

What causes flaky tests?

Luo et al.’s foundational analysis catalogued ten causes; the single most common is asynchronous waits (tests not waiting properly for an operation to finish), followed by concurrency (race conditions, deadlocks) and test-order dependencies. All are forms of non-determinism.

It varies by org and suite, but it’s a large share: Slack traced 57% of build failures to flaky and failing tests, and Google frames flakiness as one of the main challenges of automated testing precisely because flaky failures are a big fraction of the failures engineers investigate.

Can you eliminate flaky tests entirely?

At scale, no — you manage flakiness continuously rather than eliminating it. Google treats it as an ongoing program; Slack’s automation drove flaky build-job failures from ~57% to under 5% and lifted main-branch stability from ~20% to 96%, but the work is continuous, not one-and-done.

How do large engineering teams measure flakiness?

From history across many runs, not a single result. Meta built a probabilistic flakiness score to quantify how flaky each test is over time; Google and Slack track per-test flake history to detect, quarantine, and suppress the worst offenders.