Waldo sessions now support scripting! – Learn more

QA @ Square: Mobile QA processes that scale (Part 2)

January 23, 2019

min read

This is part two of a two-part interview. Read part one on developing mobile QA processes for customer reliability.

Last week, we shared part one in our interview with Square. We discussed how they create mobile QA processes that ensure reliability for their millions of customers and that scale as they add new products, apps, and features.

For part two, we talk specifically about their tools for scaling, how they monitor and report bugs, and how they measure success. Read on for more!

Can you tell me about your unique work with Firebase?

Google has a suite of tools for mobile development called Firebase. One of the tools is Firebase Test Lab, which allows you to request both emulators and physical devices through their APIs and run tests against them.

We made a full transition from internally hosting our own emulators and building out our infrastructure to moving everything to Firebase this year. The primary reason was scalability. We have a lot of UI tests at this point. We were outpacing the rate at which we could rack new servers and get them spun up. Our team members would have to wait for our servers to become available to run their UI tests. Google doesn’t have that problem, it turns out.

We were outpacing the rate at which we could rack new servers and get them spun up.

Our CI infrastructure takes the compiled app and all the tests and sends it over to Firebase. When we do that, we cut up all the test cases into buckets and distribute it across 100 different emulators at a time. That way, we get a faster response back rather than waiting.

I [Chris] think we have close to 24-30 hours worth of tests that run on every pull request. Obviously, we don’t want our developers sitting there for 30 hours waiting for a result. So we split it up and send it out to Firebase to do all that testing for us.

If we decided we need 110 emulators tomorrow, all we have to do is change a number real quick, and they give us more emulators. Although there’s a cost associated with that and we lose a little bit of control, the return on the time saved is worth it.

The Firebase analytics and UI are nice too. They provide activity logs of all the emulators that ran and a video of your running tests. That made a big difference because sometimes it’s hard to find what when wrong when a test fails. But with a video, we can see if the test was just on the wrong screen or something. We use to have screenshots which were nice, but not as easy to see what’s going on as video.

What does the end of your testing process look like?

For the last step, we release the new version to all of our Beta merchants. Then, if there’s no issue (if they don’t spot any major bugs), we start a percentage roll out of the build to all users, not just Beta. Meaning, 5% of the people downloading our release app from the public Google Play Store will start getting the new version.

We have a whole set of monitoring tools for our crash rate, which watches for any spikes or introductions of new crashes that happen. We try and correlate that to work that’s being done—if there’s a new feature being rolled out, for example.

If we do turn a feature on and see a massive spike in crashes, we can quickly roll that back in seconds without doing a new app release or server deploy.

In conjunction with that, we also have a good set of tools to manage feature flagging different types of features being released. You may get the new app, but you may not have part of the app turned on that’s going to do some sort of new functionality.

If we do turn a feature on and see a massive spike in crashes, we can quickly roll that back in seconds without doing a new app release or server deploy. It’s usually just a flip of a server flag in a web app that all our devs have access to.

We won’t go into full release until we’ve achieved a crash rate low enough to meet our standards.

What crash rate is ideal, realistically?

Crash rate is an interesting metric for us. Most companies base crash rate off of user sessions. But the way our merchants use our app is different from most apps. They’re usually sitting in the app for hours at a time and it’s always running. Whereas if you have Twitter, for instance, you’ll check it for five minutes and that’s one session for them. So they monitor crash rate that way.

Instead of monitoring crash rate per session, we measure it against payments. I think that’s important for companies to recognize: session crash rate is not necessarily always the most important metric. For us, if our app crashes one time within five hours or after 6,000 transactions, that’s not so bad.

…that’s important for companies to recognize: session crash rate is not necessarily always the most important metric.

With it being so critical that bugs are minimal for your users, how do you safeguard your QA process?

Let’s say we find an issue with our release that needs fixing in the app and it’s not something we can roll back via a server-side flag. We have what’s called the Change Control Board or CCB, which is exclusively for fixing bugs. It’s not meant for new features or anything like that. It’s a review process the engineer goes through before they can merge a fix or change to that release branch.

We try very hard to ensure that our product managers and engineers understand that once we cut a release branch, it should be stable and no new code should be introduced. But sometimes, we have to fix things missed earlier in the process. That’s where the CCB comes in.

Our CCB pull request process is:

Get approval of your pull request from two people (rather than the standard one).
Fill out a form explaining why the feature needs to be fixed now rather than waiting and what the impact is.

A good example is if the engineer messed up the logic for the feature flag. Maybe they didn’t wire it up correctly and now every merchant is getting a feature that’s not working. That would be a case where they need to make a CCB. They need to go in there and fix the logic or hard turn it off and wait until the next release train to roll out their feature.

How do those bugs or flaws sneak through the whole testing process?

To be honest, we probably forgot to test an area of the code or we didn’t think about an edge case. One of the things that come with this type of fix is that our engineers also have to write a test that prevents the bug from being introduced again.

How do you “eat your own dog food” or test your own product as users at Square?

It’s tough because we’re not just software, we’re hardware too and most of our team members aren’t merchants. Our full-time jobs are here at Square. As a solution, we’ve introduced Square payments at lunch. We have two cafeterias onsite and a coffee bar.

Our QA analysts are there with them to help them with any issues and to report any bugs. That way, we make sure our team is using Square from the consumer end on a daily basis too.

Any other final challenges or lessons to add?

We’ve started doing automated hardware testing at Square. We used to only have the reader that connects to your phone. But now, we have actual Android devices with the reader built in. We have two, the Square Register and the Square Terminal. We have test racks that test the software with the hardware to stage payments. It’s an interesting end-to-end test. Any change to the firmware goes through that to make sure the whole thing still works. It’s kind of fun to watch too.

How do you do QA at your company?

We want to hear your stories! Send a note to shannon@waldo.io if you’d like to share your QA process with waldo readers. You can also learn more about Waldo and give it a free trial.