Building Fast & Resilient Web Applications

You've applied all the best practices, set up audits and tests to detect performance regressions, released the new application to the world, and... lo and behold, the telemetry is showing that despite your best efforts, there are still many users—including those on "fast devices" and 4G networks—that are falling off the fast path: janky animations and scrolling, slow loading pages and API calls, and so on. Frustrating. There must be something wrong with the device, the network, or the browser—right?

Maybe there is. There is an infinite supply of reasons for why the application can fall off the fast path: overloaded networks and servers, transient network routing issues, device throttling due to energy or heat constraints, competition for resources with other processes on the user's device, and the list goes on and on. It is impossible to anticipate all the edge cases that can knock our applications off the fast path, but one thing we know for certain: they will happen. The question is, how are you going to deal with it?

Carving out the fast path is not enough. We need to make our applications resilient.

Resilient applications provide guardrails that protect our users from the inevitable performance failures. They anticipate these problems ahead of time, have mechanisms in place to detect them, know how to adapt to them at runtime, and as a result, are able to deliver a reliable user experience despite these complications.

I won't rehash every point in the video, but let's highlight the key themes:

  1. (9m3s) Seemingly small amounts of performance variability in critical components quickly add up to create less than ideal conditions. We must design our systems to detect and deal with such cases—e.g. set explicit SLA's on all requests and specify upfront how the violations will be handled.

  2. (16m28s) The "performance inequality" gap is growing. There are two market forces at play: there is a race for features and performance, and there is high demand for lower prices. These are not entirely at odds, the cheap devices are also getting faster, but the flagships are racing ahead at a much faster pace.

  3. (19m45s) "Fast" devices show spectacular peak performance in benchmarks, but real-world performance is more complicated: we often have to trade off raw performance against energy costs and thermal constraints, compete for shared resources with other applications, and so on.

  4. (23m35s) Mobile networks provide an infinite supply of performance entropy, regardless of the continent, country, and provider—e.g. the chances of a device connecting to a 4G network in some of the largest European countries are effectively a coin flip; just because you "have a signal" doesn't mean the connection will succeed; see "Resilient Networking".

If we ignore the above and only optimize for the fast path, we shouldn't be surprised when the application goes off the rails, and our users complain about unreliable performance. On the other hand, if we accept the above as "normal" operational constraints of a complex system, we can engineer our applications to anticipate these challenges, detect them, and adapt to them at runtime (31m39s):

  1. Treat offline as the norm.
  2. All request must have a fallback.
  3. Use available API's to detect device & network capabilities.
  4. Adapt application logic to match the device & network capabilities.
  5. Observe real-world performance (runtime, network) at runtime, goto(4).
Ilya GrigorikIlya Grigorik is a web ecosystem engineer, author of High Performance Browser Networking (O'Reilly), and Principal Engineer at Shopify — follow on Twitter.