Three years ago, I wrote about auto-ticketing flaky specs (dear lord, please slow down time maybe a bit? kthx).
In the time since, I was loosely consulting on a small/medium Rails project that was struggling with some technical debt - one aspect of which was a flaky test suite. The project had 231 flaky test reports, around 6,600 unit tests, and 342 feature specs running on RSpec, Capybara, and Selenium-Chrome.
Eventually, it reached 0 (zero) flaky tests - which, I think, makes it the only “production grade” project with such metrics I have ever seen in my life. So, one might ask - how can I get there?
Time
It just takes time. Flaky unit tests are usually easy to reason about; the biggest chunk of work was solving various issues with feature specs, as that area is the Wild Wild West. For example, a rogue request from some container leaking into another test (there was a tusd container for uploads and a test that didn’t wait properly for the upload to finish, meaning a webhook would fail a subsequent, random test). Other issues included custom smooth-scroll logic causing clicks on unintended elements, or JavaScript components not being mounted before an interaction with the page occurred - and the list goes on.
Trying a different driver
I tried using Ferrum (with the Cuprite driver). It won’t auto-magically speed up your test suite, but using a different driver might surface different problems, as Ferrum speaks directly to the browser over a WebSocket (a huge gain since you don’t have to install the right version of chromedriver).
For example, it turned out a lot of Stimulus-related tests now failed because the test tried to click on a button before the Stimulus controller managed to connect. In that case, I would suggest rendering such a problematic button as disabled and re-enabling it in the connect() method. In practice, this shouldn’t be visible to the end-user, and Capybara will wait for the button to be enabled when doing click_on. That simple change fixes your test without any weird wait-for/sleep hacks.
Speaking of which - if you’re explicitly waiting for some elements or interactions outside of Capybara’s default wait behavior, there is a high chance you’re doing something wrong, or maybe your UI lacks a UX interaction you can assert on.
In the end, the project stayed on Ferrum, as it also solved Selenium “invalid session id” exceptions, which turned out to be really difficult to debug in a CI environment.
Logging
Some feature specs can feel impossible to debug. Ferrum offers really in-depth logs that you can shove into your favorite LLM for cross-referencing. In some cases, it can work wonders. Capybara::Cuprite::Driver accepts a logger, so you can save these as CI artifacts in case a test fails. Then just tell your agent to analyze it and you might just get lucky.
Persistence
When you have 100+ flaky tests, the whole thing seems overwhelming. See point 1 and remember - it will take time. Treat it as a marathon, take the easy wins, fix one test a week, and you will be done after 2 years. Having no flaky specs feels really good, and in case something new surfaces, the resistance to addressing it is much smaller.
I know some companies just love heroic efforts, but solving this isn’’t about a single weekend of slopfixing. It’s about a deliberate, continuous effort toward a clear goal.