Waldo sessions now support scripting! – Learn more

Detecting flaky tests might be the hardest part

October 6, 2021

min read

Today we are excited to release flakiness detection in Waldo.

For those not familiar with Waldo: our product allows nontechnical people to create, maintain and orchestrate automation testing for their mobile apps right from the web browser. Waldo is built as a “script-less” automation engine. This means that end users are not burdened with writing scripts; neither does Waldo conceptualize tests as scripts internally. This leads to interesting benefits, including how to address the problem of test flakiness.

The problem

Flakiness is probably the worst thing that can happen to automated end-to-end testing. A “flaky” test is simply one that yields different outcomes when played multiple times.

Flakiness causes most teams to doubt their automation results and leads them to consider errors to be “probably because of a blip or something”.

Because Waldo is an automation platform, we have always had the problem of flakiness in mind, and thus we have discovered that there are 2 ways we can address it:

Reduce the likelihood of tests being flaky: Waldo has supported that from day 1. Indeed, this was one of the primary motivations for developing a purely script-less automation engine. A script is typically a hardcoded sequence of instructions describing how to replay an interaction: “after x amount of time, click on the button with ID click_me”. Specifically, wait times are a huge cause for flakiness, even more so when doing true end-to-end testing where the app communicates with a real backend server. Instead, Waldo is an adaptive automation engine, meaning that it learns how to replay a particular test based upon an ever-increasing library of previous times when that test was successfully played. You can consider recording a test to be seeding this library.
Offer an experience around flaky tests: No matter how hard we try, flaky tests will happen. Even if Waldo was a perfect automation engine (and it is not, at least not yet), there is only so much that it can control anyway—if the app is doing A/B testing, or if it has a race condition on a screen, or if its backend is down for a few minutes while a test is running—any of these occurrences characteristically cause flakiness as well.

We are just now releasing that second part, and we think it is a fundamental addition to the Waldo platform and how people can now evaluate their test results.

How Waldo detects flaky tests

Most importantly, Waldo needs to able to tell that a test is flaky.

Before this change, Waldo would partition automation results into 2 conceptual categories:

success: the test passes
error: the test fails

This meant that any time a test resulted in an error, one could understandably raise an eyebrow: “Wait, is that test really broken?“.

Instead, Waldo now partitions automation results into 3 categories:

success: the test consistently passes
error: the test consistently fails, and it always fails for the same reason
flaky: the test has erratic behavior

This immediately provides 2 important benefits:

An error now really means that something has consistently diverged from the expected path.
When a test is deemed flaky, you can dive into the details to understand and potentially fix the cause of its flakiness.

Diving into flaky tests

If a test is considered flaky, Waldo displays something like this:

The Waldo platform has always been very visual. We at Waldo believe that a picture is worth a thousand words (and probably a million cryptic lines to stdout). When it comes to a flaky test, it is no different. Waldo shows the state in plain sight, with all the different outcomes that occurred while replaying it, so that it is straightforward for you to analyze what caused the flakiness.

‍