igvita.com

Stop Cross-Site Timing Attacks with SameSite cookies

2016-08-26T00:00:00-07:00

Let's say we have a client that can initiate a network request for any URL on the web but the response is opaque and cannot be inspected. What could we learn about the client or the response? As it turns out, armed with a bit of patience and rudimentary statistics, "a lot".

For example, the duration of the fetch is a combination of network time of the request reaching the server, server processing time, and network time of the response. Each and every one of these steps "leaks" information both about the client and the server.

For example, if the total duration is very small (say, <10ms) then we can reasonably intuit that we might be talking to a local cache, which means that the client has previously fetched this resource. Alternatively, if the duration is slightly higher (say, <50ms) then we can reasonably guess that the client is on a low-latency network (e.g. fast 4G or WiFi). We can also append random data to the URL to make it unique and rule out the various HTTP caches along the way. From there, we can try making more requests to the server and observe how the fetch duration changes to infer change in server processing times and/or larger responses being sent to the client.

If we're really crafty, we can also use the properties of the network transport like CWND induced roundtrips in TCP (see TCP Slow Start), and other quirks of local network configuration, as additional signals to infer properties (e.g. size) of the response—see TIME, HEIST attacks. If the response is compressed and also happens to reflect submitted data, then there is also the possibility of using a compression oracle attack (see BREACH) to extract data from the response.

In theory, the client could try to stymie such attacks by making all operations take constant time, but realistically that's neither a practical nor an acceptable solution due to the user experience and performance implications of such strategy. Injecting random delays doesn't fare much better, as it carries similar implications.

"Networking thermodynamics"

Each and every step in the fetch process—from the client generating the request and putting on the wire, the network hops to the server, the server processing time, response properties, and the network hops back to the client—"leaks" information about the properties of the client, network, server, and the response. This is not a bug; it's a fact of life. Borrowing an explanation from our physicist friends: putting a system to work amounts to extracting energy from it, which we can then measure and interrogate to learn facts about said system.

Eyes glazing over yet? The practical implication is that if the necessary server precautions are missing, the use of the above techniques can reveal private information about you and your relationship to that server - e.g. login status, group affiliation, and more. This requires a bit more explanation…

The dangers of credentialed cross-origin "no-cors" requests

The fact that we can use side-channel information, such as the duration of a fetch, to extract information about the response is not, by itself, all that useful. After all, if I give you a URL you can just use your own HTTP client to fetch it and inspect the bytes on the wire. However, what does make it dangerous is if you can co-opt my client (my browser) to make an authenticated request on my behalf and inspect the (opaque) response that contains my private content. Then, even if you can't access the response directly, you can observe any of the aforementioned properties of the fetch and extract private information about my client and the response. Let's make it concrete…

I like to visit kittens.com on which I have an account to pin my favorite images:
- The authentication mechanism is a login form with all the necessary precautions (CSRF tokens, etc).
- Once authenticated, the server sets an HTTP cookie scoped to kittens.com with a private token that is used to authenticate me on future visits.
Someone else entices me to visit shady.com to view more pictures of kittens...
- While I'm indulging in kitten pictures on shady.com, the page issues background requests on my behalf to kittens.com with the goal of attempting to learn something about my status on said site.

How does shady.com make a credentialed request? A simple image element is sufficient:

 src="https://kittens.com/favorites" alt="Yay authenticated kittens!">

The browser processes the image element, initializes a request for https://kittens.com/favorites, attaches my HTTP cookies associated with kittens.com, and dispatched the request. The target server (kittens.com) sees a valid authentication cookie and dutifully sends back the HTML response containing my favorite kittens. Of course, the image tag will choke on the HTML and will fire an error callback, but that doesn't matter, because even though we can't inspect the response, we can still learn a lot by observing the timing of the authenticated request-response flow.

With the benefit of a few decades of experience under our belt, and if we were rebuilding the web platform from scratch, we probably wouldn't allow such "no-cors" authenticated requests without explicit CORS opt-in from the server, just as we do today for XMLHttpRequest and Fetch API. Alas, that would be a major breaking change, so that's off the table. However, not all is lost either, because kittens.com can deploy additional logic to protect itself, and its users, against such cross-origin attacks.

In this article we're focusing on cross-site timing attacks: why they exist and how to mitigate them. However, note that this is a subclass of the larger Cross-Site Request Forgery (CSRF) type of attacks, which can wreck havoc on your site and your users data. The good news is, the mitigations are the same.

Declare your cookies as "same-site"

The core issue is that the browser attaches target origin's cookies on "no-cors" requests regardless of the origin that initiates the request. In theory, the target origin could look at the Referrer header, but the attacker could hide the initiating origin—e.g. via no-referrer policy. Similarly, the Origin header is only sent on CORS requests, so that won't help either. However, SameSite cookies give us the exact behavior we want:

Here, we update [RFC6265] with a simple mitigation strategy that allows servers to declare certain cookies as "same-site", meaning they should not be attached to "cross-site" requests…
Note that the mechanism outlined here is backwards compatible with the existing cookie syntax. Servers may serve these cookies to all user agents; those that do not support the "SameSite" attribute will simply store a cookie which is attached to all relevant requests, just as they do today.

SameSite cookies have two modes: "strict" and "lax". In strict mode, the cookies are not sent in top-level navigations, which offers strong protection but requires some additional deployment considerations. In lax mode, cookies are sent for top-level navigations-e.g. navigations initiated by elements, window.open(), ), which offers reasonable protection. Do read the IETF spec, it provides good guidance.

200 OK HTTP/1.1
...
Set-Cookie: SID=31d4d96e407aad42; SameSite=Strict

Using our example above, if kittens.com set the SameSite flag on its authentication cookie, then the image request initiated by shady.com would not contain the authentication cookie due to mismatch of the initiating origin and the origin that set the cookie and would result in a generic unauthenticated response—e.g. a redirect to a login page. If you're kittens.com, enabling SameSite cookies should be a no-brainer.

More generally, if your site or service does not intentionally provide cross-origin resources (e.g. embeddable widgets, site plugins, etc.), then you should use SameSite cookies as your default.

SameSite cookies are supported in Chrome (since M51) and Opera 39, and are under consideration in Firefox. Let's hope the other browsers will be fast followers. Last but not least, it's worth noting that you also can, as a user, block third party cookies in your browser to protect yourself from this type of cross-origin attack.

Building Fast & Resilient Web Applications

2016-05-20T00:00:00-07:00

You've applied all the best practices, set up audits and tests to detect performance regressions, released the new application to the world, and... lo and behold, the telemetry is showing that despite your best efforts, there are still many users—including those on "fast devices" and 4G networks—that are falling off the fast path: janky animations and scrolling, slow loading pages and API calls, and so on. Frustrating. There must be something wrong with the device, the network, or the browser—right?

Maybe there is. There is an infinite supply of reasons for why the application can fall off the fast path: overloaded networks and servers, transient network routing issues, device throttling due to energy or heat constraints, competition for resources with other processes on the user's device, and the list goes on and on. It is impossible to anticipate all the edge cases that can knock our applications off the fast path, but one thing we know for certain: they will happen. The question is, how are you going to deal with it?

Carving out the fast path is not enough. We need to make our applications resilient.

Resilient applications provide guardrails that protect our users from the inevitable performance failures. They anticipate these problems ahead of time, have mechanisms in place to detect them, know how to adapt to them at runtime, and as a result, are able to deliver a reliable user experience despite these complications.

I won't rehash every point in the video, but let's highlight the key themes:

(9m3s) Seemingly small amounts of performance variability in critical components quickly add up to create less than ideal conditions. We must design our systems to detect and deal with such cases—e.g. set explicit SLA's on all requests and specify upfront how the violations will be handled.
(16m28s) The "performance inequality" gap is growing. There are two market forces at play: there is a race for features and performance, and there is high demand for lower prices. These are not entirely at odds, the cheap devices are also getting faster, but the flagships are racing ahead at a much faster pace.
(19m45s) "Fast" devices show spectacular peak performance in benchmarks, but real-world performance is more complicated: we often have to trade off raw performance against energy costs and thermal constraints, compete for shared resources with other applications, and so on.
(23m35s) Mobile networks provide an infinite supply of performance entropy, regardless of the continent, country, and provider—e.g. the chances of a device connecting to a 4G network in some of the largest European countries are effectively a coin flip; just because you "have a signal" doesn't mean the connection will succeed; see "Resilient Networking".

If we ignore the above and only optimize for the fast path, we shouldn't be surprised when the application goes off the rails, and our users complain about unreliable performance. On the other hand, if we accept the above as "normal" operational constraints of a complex system, we can engineer our applications to anticipate these challenges, detect them, and adapt to them at runtime (31m39s):

Treat offline as the norm.
All request must have a fallback.
Use available API's to detect device & network capabilities.
Adapt application logic to match the device & network capabilities.
Observe real-world performance (runtime, network) at runtime, goto(4).

Control Groups (cgroups) for the Web?

2016-03-01T00:00:00-08:00

You've optimized every aspect of your page—it's fast, and you can prove it. However, for better or worse, you also need to include a resource that you do not control (e.g. owned by a different subteam or a third-party), and by doing so you lose most, if not all, guarantees about the runtime performance of your page - e.g. an included script resource can execute any code it wants, at any point in your carefully optimized rendering loop, and for any lengths of time; it can fetch and inject other resources; all of the scheduling and execution is on par with your carefully crafted code.

We're missing primitives that enable control over how and where CPU, GPU, and network resources are allocated by the browser. To the browser, all scripts look the same. To the developer, some are more important than others. Today, the web platform lack the tools to bridge this gap, and that's at least one reason why delivering reliable performance is often an elusive goal for many.

We can learn from those before us...

Conceptually, the above problem is nothing new. For example, Linux control groups (cgroups) address the very same issues "higher up" in the stack: multiple processes compete for a finite number of available resources on the device, and cgroups provide a mechanism by which resource allocation (CPU, GPU, memory, network, etc) can be specified and enforced at a per-process level - e.g. this process is allowed to use at most 10% of the CPU, 128MB of RAM, is rate-limited to 500Kbps of peak bandwidth, and is only allowed to download 10Mb in total.

The problem is that we, as site developers, have no way to communicate and specify similar policies for resources that run on our sites. Today, including a script or an iframe gives it the keys to the kingdom: these resources execute with the same priority and with unrestricted access to the CPU, GPU, memory, and the network. As a result, the best we can do is cross our fingers and hope for the best.

Arguably, Content-Security-Policy offers a functional subset of the larger "cgroups for the web" problem: it allows the developer to control which origins the browser is allowed to access, and new embedded enforcement proposal extends this to subresources! However, this only controls the initial fetch, it does not address the resource footprint (CPU, GPU, memory, network, etc.) once it is executed by the browser.

Would cgroups for the web help?

As a thought experiment, it may be worth considering how a cgroups-like policy could look like in the browser, and what we would want to control. What follows is a handwavy sketch, based on the frequent performance failure cases found in the wild, and conversations with teams that have found themselves in these types of predicaments:


 http-equiv="cgroup" name="background"
      content="cpu-share 0.05; cpu-priority low;
               net-share 0.05; net-priority low;">


 http-equiv="cgroup" name="app"
             content="cpu-share 0.8; cpu-priority high;
                      net-share 1.0; net-priority high">


 http-equiv="cgroup" name="ads"
             content="cpu-share 0.2; cpu-priority medium;
                      net-share 0.8; net-priority medium">

...


 cgroup="app" rel="stylesheet" href="/style.css">




 cgroup="ads" src="//3rdparty.com/widget">

The above is not an exhaustive list of plausible directives; don't fixate on the syntax. The key point, and question, is whether it would be useful—both to site developers and browser developers—to have such annotations communicate the preferred priorities and resource allocation strategy on their page - e.g. some scripts are more important than others, some network fetches should have lower relative priority, and so on.

Bonus: control groups are hierarchical. For example, if an iframe is allocated 30% of the available CPU cycles, then subresources executing within that iframe are sub-dividing the 30% allocated to the parent.

How does the browser enforce such policies?

Well, it may not be able to, in the strict sense of that word. For example, if a "background" script is scheduled and decides to monopolize the renderer thread and run for 20 frames, there isn't much that the runtime can do—today, at least. However, the runtime can use the provided information to decide which callback or function to schedule next, or how to prioritize loading of resources. Some browsers may be able to do a better job of enforcing such policies, but even small scheduling optimizations can yield significant user-visible wins. Today, the browser is running blind.

Further, once the browser knows the "desired allocation", it can flag and warn the developer when there is a mismatch at runtime - e.g. it can fire events via PerformanceObserver to notify the app of violations, allowing the developer to gather and act on this data. In effect, this could be the first step towards enabling attribution and visibility into the real-world runtime performance and impact of various resources.

Perhaps an idea worth exploring?

The "Average Page" is a myth

2016-01-12T00:00:00-08:00

As anyone and everyone in the web performance community will tell you, the size of the average page is continuously getting bigger: more JavaScript, more image and video bytes, growing use of web fonts, and so on. In fact, as of December 2015, the HTTP Archive shows that the average desktop site weighs in at 2227KB, and mobile is up to 1253KB.

Except, what is an "average page", exactly? Intuitively, it is a page that is representative of the web at large, in its payload size, distribution of bytes between different content types, etc. More technically, it is a measure of central tendency of the underlying distribution - e.g. for a normal distribution the average is the central peak, with 50% values greater and 50% values smaller than its value. Which, of course, begs the question: what is the shape and type of the distribution for transferred bytes and does it match this model? Let's plot the histogram and the CDF plots...

The x-axis shows that we have outliers weighing in at 30MB+.
The quantile values are 25th: 699KB, 50th (median): 1445KB, 75th: 2697KB.
The CDF plot shows that 90%+ of the pages are under 5000KB.

The x-axis shows that we have outliers weighing in at 10MB+.
The quantile values are 25th: 403KB, 50th (median): 888KB, 75th: 1668KB.
The CDF plot shows that 90%+ of the pages are under 3000KB.

Let's start with the obvious: the transfer size is not normally distributed, and there is no meaningful "central value" and talking about the mean is meaningless, if not deceiving - see "Bill Gates walks into a bar...". We need a much richer and nuanced language and statistics to capture what's going on here, and an even richer set of tools and methods to analyze how these values change over time. The "average page" is a myth.

I've been as guilty as anyone in (ab)using averages when talking about this data: they're easy to get and simple to communicate. Except, they're also meaningless in this context. My 2016 resolution is to kick this habit. Join me.

Page weight as of December 2015

Coming up with a small set of descriptive statistics for a dataset is hard, and attempting to reduce a dataset as rich as HTTP Archive down to a single one is an act of folly. Instead, we need to visualize the data and start asking questions.

For example, why are some pages so heavy? A cursory look shows that the heaviest ~3% by page weight, both for desktop (>7374KB) and mobile (>4048KB), are often due to large (and/or heavy) number of images. Emphasis on often, because a deeper look at the most popular content types shows outliers in each and every category. For example, plotting the CDFs for desktop pages yields:

We have pages that fetch tens of megabytes of HTML, images, video, and fonts, as well as high single-digit megabytes of JavaScript and CSS. Each of these "obese" outliers is worth digging into, but we'll leave that for a separate investigation. Let's compare this data to the mobile dataset.

Lots of outliers as well, but the tails for mobile pages are not nearly as long. This alone explains much of the dramatic "average page" difference (desktop: 2227KB, mobile: 1253KB) — averages are easily skewed by a few large numbers. Focusing on the average leads us to believe that mobile pages are significantly "lighter", whereas in reality all we can say so far is that the desktop distribution has a longer tail with much heavier pages.

To get a better sense for the difference in distributions between the desktop and mobile pages, let's exclude the heaviest 3% that compress all of our graphs and zoom in on the [0, 97%] interval:

Mobile pages do appear to consume fewer bytes. For example, a 1000KB budget would allow the client to fetch fully ~38% of desktop pages vs. 54% of mobile pages. However, while the savings for mobile pages are present for all content types, the absolute differences for most of them are not drastic. Most of the total byte difference is explained by fewer image bytes. Structurally, mobile pages are not dramatically different from desktop pages.

Changes in page weight over time

Comparing the CDFs against the year prior shows that the transfers sizes for most content types have increased for both the desktop and mobile pages. However, there are some unexpected and interesting results as well:

The amount of shipped HTML bytes has decreased!
2015-mobile and 2014-desktop distributions tend to overlap.

In terms of bytes fetched, for everything but images, mobile pages are a year behind their desktop counterparts? Intuitively, this makes sense, just because we're working with a smaller screen doesn't mean the required functionality is less, or less complex.

Take the data out for a spin...

My goal here is to raise questions, not to provide answers; this is a very shallow analysis of a very rich dataset. For a deeper and a more hands-on look at this data, take a look at my Datalab workbook. Better, clone it, run your own analysis, and share your results! If we want to talk about the trends, outliers, and their causes on the web, then we need to understand this data at a much deeper level.

Don't lose user and app state, use Page Visibility

2015-11-20T00:00:00-08:00

Great applications do not lose user's progress and app state. They automatically save the necessary data without interrupting the user and transparently restore themselves as and when necessary - e.g. after coming back from a background state or an unexpected shutdown.

Unfortunately, many web applications get this wrong because they fail to account for the mobile lifecycle: they're listening for the wrong events that may never fire, or ignore the problem entirely at the high cost of poor user experience. To be fair, the web platform also doesn't make this easy by exposing (too) many different events: visibilityState, pageshow, pagehide, beforeunload, unload. Which should we use, and when?

You cannot rely on pagehide, beforeunload, and unload events to fire on mobile platforms. This is not a bug in your favorite browser; this is due to how all mobile operating systems work. An active application can transition into a "background state" via several routes:

The user can click on a notification and switch to a different app.
The user can invoke the task switcher and move to a different app.
The user can hit the "home" button and go to homescreen.
The OS can switch the app on users behalf - e.g. due to an incoming call.

Once the application has transitioned to background state, it may be killed without any further ceremony - e.g. the OS may terminate the process to reclaim resources, the user can swipe away the app in the task manager. As a result, you should assume that "clean shutdowns" that fire the pagehide, beforeunload, and unload events are the exception, not the rule.

To provide a reliable and consistent user experience, both on desktop and mobile, the application must use Page Visibility API and execute its session save and restore logic whenever visibilityChange state changes. This is the only event your application can count on.

// query current page visibility state: prerender, visible, hidden
var pageVisibility = document.visibilityState;

// subscribe to visibility change events
document.addEventListener('visibilitychange', function() {
  // fires when user switches tabs, apps, goes to homescreen, etc.
    if (document.visibilityState == 'hidden') { ... }

    // fires when app transitions from prerender, user returns to the app / tab.
    if (document.visibilityState == 'visible') { ... }
});

If you're counting on unload to save state, record and report analytics data, and execute other relevant logic, then you're missing a large fraction of mobile sessions where unload will never fire. Similarly, if you're counting on beforeunload event to prompt the user about unsaved data, then you're ignoring that "clean shutdowns" are an exception, not the rule.

Use Page Visibility API and forget that the other events even exist. Treat every transition to visible as a new session: restore previous state, reset your analytics counters, and so on. Then, when the application transitions to hidden end the session: save user and app state, beacon your analytics, and perform all other necessary work.

If necessary, with a bit of extra work you can aggregate these visibility-based sessions into larger user flows that account for app and tab switching - e.g. report each session to the server and have it aggregate multiple sessions together.

Practical implementation considerations

In the long term, all you need is the Page Visibility API. As of today, you will have to augment it with one other event — pagehide, to be specific — to account for the "when the page is being unloaded" case. For the curious, here's a full matrix of which events fire in each browser today (based on my manual testing):

visibilityChange works reliably for task-switching on mobile platforms.
beforeunload is of limited value as it only fires on desktop navigations.
unload does not fire on mobile and desktop Safari.

The good news is that Page Visibility reliably covers task-switching scenarios across all platforms and browser vendors. The bad news is that today Firefox is the only implementation that fires the visibilityChange event when the page is unloaded — Chrome, WebKit, and Edge bugs to address this. Once those are resolved, visibilityState is the only event you'll need to provide a great user experience.

Eliminating Roundtrips with Preconnect

2015-08-17T00:00:00-07:00

The "simple" act of initiating an HTTP request can incur many roundtrips before the actual request bytes are routed to the server: the browser may have to resolve the DNS name, perform the TCP handshake, and negotiate the TLS tunnel if a secure socket is required. All accounted for, that's anywhere from one to three — and more in unoptimized cases — roundtrips of latency to set up the socket before the actual request bytes are routed to the server.

Modern browsers try their best to anticipate what connections the site will need before the actual request is made. By initiating early "preconnects", the browser can set up the necessary sockets ahead of time and eliminate the costly DNS, TCP, and TLS roundtrips from the critical path of the actual request. That said, as smart as modern browsers are, they cannot reliably predict all the preconnect targets for each and every website.

The good news is that we can — finally — help the browser; we can tell the browser which sockets we will need ahead of initiating the actual requests via the new preconnect hint shipping in Firefox 39 and Chrome 46! Let's take a look at some hands-on examples of how and where you might want to use it.

Preconnect for dynamic request URLs

Your application may not know the full resource URL ahead of time due to conditional loading logic, UA adaptation, or other reasons. However, if the origin from which the resources are going to be fetched is known, then a preconnect hint is a perfect fit. Consider the following example with Google Fonts, both with and without the preconnect hint:

In the first trace, the browser fetches the HTML and discovers that it needs a CSS resource residing on fonts.googleapis.com. With that downloaded it builds the CSSOM, determines that the page will need two fonts, and initiates requests for each from fonts.gstatic.com — first though, it needs to perform the DNS, TCP, and TLS handshakes with that origin, and once the socket is ready both requests are multiplexed over the HTTP/2 connection.

 href='https://fonts.gstatic.com' rel='preconnect' crossorigin>
 href='https://fonts.googleapis.com/css?family=Roboto+Slab:700|Open+Sans' rel='stylesheet'>

In the second trace, we add the preconnect hint in our markup indicating that the application will fetch resources from fonts.gstatic.com. As a result, the browser begins the socket setup in parallel with the CSS request, completes it ahead of time, and allows the font requests to be sent immediately! In this particular scenario, preconnect removes three RTTs from the critical path and eliminates over half of second of latency.

The font-face specification requires that fonts are loaded in "anonymous mode", which is why we must provide the crossorigin attribute on the preconnect hint: the browser maintains a separate pool of sockets for this mode.

Initiating preconnect via Link HTTP header

In addition to declaring the preconnect hints via HTML markup, we can also deliver them via an HTTP Link header. For example, to achieve the same preconnect benefits as above, the server could have delivered the preconnect hint without modifying the page markup - see below. The Link header mechanism allows each response to indicate to the browser which other origins it should connect to ahead of time. For example, included widgets and dependencies can help optimize performance by indicating which other origins they will need, and so on.

Preconnect with JavaScript

We don't have to declare all preconnect origins upfront. The application can invoke preconnects in response to user input, anticipated activity, or other user signals with the help of JavaScript. For example, consider the case where an application anticipates the likely navigation target and issues an early preconnect:

function preconnectTo(url) {
    var hint = document.createElement("link");
    hint.rel = "preconnect";
    hint.href = url;
    document.head.appendChild(hint);
}

The user starts on jsbin.com; at ~3.0 second mark the page determines that the user might be navigating to engineering.linkedin.com and initiates a preconnect for that origin; at ~5.0 second mark the user initiates the navigation, and the request is dispatched without blocking on DNS, TCP, or TLS handshakes — nearly a second saved for the navigation!

Preconnect often, Preconnect wisely

Preconnect is an important tool in your optimization toolbox. As above examples illustrate, it can eliminate many costly roundtrips from your request path — in some cases reducing the request latency by hundreds and even thousands of milliseconds. That said, use it wisely: each open socket incurs costs both on the client and server, and you want to avoid opening sockets that might go unused. As always, apply, measure real-world impact, and iterate to get the best performance mileage from this feature.

Finally, for debugging purposes, do note that preconnect directives are treated as optimization hints: the browser might not act on each directive each and every time, and the browser is allowed to adjust its logic to perform a partial handshake - e.g. fall back to DNS lookup only, or DNS+TCP for TLS connections.

Browser Progress Bar is an Anti-pattern

2015-06-25T00:00:00-07:00

The user initiates a navigation, and the browser gets busy: it'll likely have to resolve a dozen DNS names, establish an even larger number of connections, and then dispatch one or more requests over each. In turn, for each request, it often does not know the response size (chunked transfers), and even when it does, it is still unable to reliably predict the download time due to variable network weather, server processing times, and so on. Finally, fetching and processing one resource might trigger an entire subtree of new requests.

Ok, so loading a page is complicated business, so what? Well, if there is no way to reliably predict how long the load might take, then why do so many browsers still use and show the progress bar? At best, the 0-100 indicator is a lie that misleads the user; worse, the success criteria is forcing developers to optimize for "onload time", which misses the progressive rendering experience that modern applications are aiming to deliver. Browser progress bars fail both the users and the developers; we can and should do better.

Indeterminate indicators in post-onload era

To be clear, progress indicators are vital to helping the user understand that an operation is in progress. The browser needs to show some form of a busy indicator, and the important questions are: what type of indicator, whether progress can be estimated, and what criteria are used to trigger its display.

Some browsers have already replaced "progress bars" with "indeterminate indicators" that address the pretense of attempting to predict and estimate something that they can't. However, this treatment is inconsistent between different browser vendors, and even same browsers on different platforms — e.g. many mobile browsers use progress bars, whereas their desktop counterparts use indeterminate indicators. We need to fix this.

Also, while we're on the subject, what are the conditions that trigger the browser's busy indicator anyway? Today the indicator is shown only while the page is loading: it is active until the onload event fires, which is supposed to indicate that the page has finished fetching all of the resources and is now "ready". However, in a world optimized for progressive rendering, this is an increasingly less than useful concept: the presence of an outstanding request does not mean the user can't or shouldn't interact with the page; many pages defer fetching and further processing until after onload; many pages trigger fetching and processing based on user input.

Time to onload is bad performance metric and one that developers have been gaming for a while. Making that the success criteria for the busy indicator seems like a decision worth revisiting. For example, instead of relying on what is now an arbitrary initialization milestone, what if it represented the pages ability to accept and process user input?

Does the page have visible content and is it ready to accept input (e.g. touch, scroll)? Hide the busy indicator.
Is the UI thread busy (see jank) due to long-running JavaScript or other work? Show the busy indicator until this condition is resolved; the busy indicator may be shown at any point in the application lifecycle.

The initial page load is simply a special case of painting the first frame (ideally in <1000ms), at which time the page is unable to process user input. Post first frame, if the UI thread is busy once again, then the browser can and should show the same indicator. Changing the busy indicator to signal interactivity would address our existing issues with penalizing progressive rendering, remove the need to continue gaming onload, and create direct incentives for developers to build and optimize for smooth and jank-free experiences.

Fixing the 'Blank Text' Problem

2015-04-10T00:00:00-07:00

In cases where textual content is loaded before downloadable fonts are available, user agents may render text as it would be rendered if downloadable font resources are not available or they may render text transparently with fallback fonts to avoid a flash of text using a fallback font - Font loading guidelines.

The ambiguity and lack of developer override in above spec language is a big gap and a performance problem. First, the ambiguity leaves us with inconsistent behavior across different browsers, and second, the lack of developer override means that we are either rendering content that should be blocked, or unnecessarily blocking rendering where a fallback would have been acceptable. There isn't a single strategy that works best in all cases.

Let's quantify the problem

How often does the above algorithm get invoked? What's the delta between the time the browser was first ready to render text and the font became available? Speaking of which, how long does it typically take the font download to complete? Can we just initiate the font fetch earlier to solve the problem?

As it happens, Chrome already tracks the necessary metrics to answer all of the above. Open a new tab and head to chrome://histograms to inspect the metrics (for the curious, check out histograms.xml in Chromium source) for your profile and navigation history. The specific metrics we are interested in are:

WebFont.HadBlankText: count of times text rendering was blocked.
WebFont.BlankTextShownTime: duration of blank text due to blocked rendering.
WebFont.DownloadTime.*: time to fetch the font, segmented by filesize.
PLT.NT_Request: time to first response byte (TTFB).

Text rendering performance on Chrome for Android

Inspecting your own histograms will, undoubtedly, reveal some interesting insights. However, is your profile data representative of the global population? Chrome aggregates anonymized usage statistics from opted-in users to help the engineering team improve Chrome's features and performance, and I've pulled the same global metrics for Chrome for Android. Let's take a look...

	50th	75th	95th
WebFont.DownloadTime.0.Under10KB	~400 ms	~750 ms	~2300 ms
WebFont.DownloadTime.1.10KBTo50KB	~500 ms	~900 ms	~2600 ms
WebFont.DownloadTime.2.50KBTo100KB	~600 ms	~1100 ms	~3800 ms
WebFont.DownloadTime.3.100KBTo1MB	~800 ms	~1500 ms	~5000 ms

WebFont.BlankTextShownTime	~350 ms	~750 ms	~2300 ms

PLT.NT_Request	~150 ms	~380 ms	~1300 ms

	No blank text	Had blank text
WebFont.HadBlankText	~71%	~29%

29% of page loads on Chrome for Android displayed blank text: the user agent knew the text it needed to paint, but was blocked from doing so due to the unavailable font resource. In the median case the blank text time was ~350 ms, ~750 ms for the 75th percentile, and a scary ~2300 ms for the 95th.

Looking at the font download times, it is also clear that even the smallest fonts (<10KB) can take multiple seconds to complete. Further, the time to fetch the font is significantly higher than the time to the first HTML response byte (see PLT.NT_Request) that may contain text that can be rendered. As a result, even if we were able to start the font fetch in parallel with the HTML request, there are still many cases where we would have to block text rendering. More realistically, the font fetch would be delayed until we know it is required, which means waiting for the HTML response, building the DOM, and resolving styles, all of which defer text rendering even further.

Developers need control of the text rendering strategy

As the above data illustrates, fetching the font sooner and optimizing the resource filesize are both important but not sufficient to eliminate the "blank text problem". The network fetch may take a while, and we can't control that.

That said, knowing this, we can provide the necessary controls to developers to specify the desired text rendering strategy: there are cases where using a fallback is a valid strategy, and there are cases when rendering should be blocked. Both strategies are valid and can coexist on the same page depending on the content being rendered.

In short, text is almost always the single most important asset on the page, and we need to give developers control over how and when it's rendered. The CSS font rendering proposal should, I hope, resolve this.

Resilient Networking: Planning for Failure

2015-01-26T00:00:00-08:00

A 4G user will experience a much better median experience both in terms of bandwidth and latency than a 3G user, but the same 4G user will also fall back to the 3G network for some of the time due to coverage, capacity, or other reasons. Case in point, OpenSignal data shows that an average "4G user" in the US gets LTE service only ~67% of the time. In fact, in some cases the same "4G user" will even find themselves on 2G, or worse, with no service at all.

All connections are slow some of the time. All connections fail some of the time. All users experience these behaviors on their devices regardless of their carrier, geography, or underlying technology — 4G, 3G, or 2G.

You can use the OpenSignal Android app to track own stats for 4G/3G/2G time, plus many other metrics.

Why does this matter?

Networks are not reliable, latency is not zero, and bandwidth is not infinite. Most applications ignore these simple truths and design for the best-case scenario, which leads to broken experiences whenever the network deviates from its optimal case. We treat these cases as exceptions but in reality they are the norm.

All 4G users are 3G users some of the time.
All 3G users are 2G users some of the time.
All 2G users are offline some of the time.

Building a product for a market dominated by 2G vs. 3G vs. 4G users might require an entirely different architecture and set of features. However, a 3G user is also a 2G user some of the time; a 4G user is both a 3G and a 2G user some of the time; all users are offline some of the time. A successful application is one that is resilient to fluctuations in network availability and performance: it can take advantage of the peak performance, but it plans for and continues to work when conditions degrade.

So what do we do?

Failing to plan for variability in network performance is planning to fail. Instead, we need to accept this condition as a normal operational case and design our applications accordingly. A simple, but effective strategy is to adopt a "Chaos Monkey approach" within our development cycle:

Define an acceptable SLA for each network request
- Interactive requests should respect perceptual time constants.
- Background requests can take longer but should not be unbounded.
Make failure the norm, instead of an exception
- Force offline mode for some periods of time.
- Force some fraction of requests to exceed the defined SLA.
- Deal with SLA failures instead of ignoring them.

Degraded network performance and offline are the norm not an exception. You can't bolt-on an offline mode, or add a "degraded network experience" after the fact, just as you can't add performance or security as an afterthought. To succeed, we need to design our applications with these constraints from the beginning.

Tooling and API's

Are you using a network proxy to emulate a slow network? That's a start, but it doesn't capture the real experience of your average user: a 4G user is fast most of the time and slow or offline some of the time. We need better tools that can emulate and force these behaviors when we develop our applications. Testing against localhost, where latency is zero and bandwidth is infinite, is a recipe for failure.

We need API's and frameworks that can facilitate and guide us to make the right design choices to account for variability in network performance. For the web, ServiceWorker is going to be a critical piece: it enables offline, and it allows full control over the request lifecycle, such as controlling SLA's, background updates, and more.

Capability Reporting with Service Worker

2014-12-15T00:00:00-08:00

Some people, when confronted with a problem, think: “I'll use UA/device detection!” Now they have two problems...

But, despite all of its pitfalls, UA/device detection is a fact of life, a growing business, and an enabling business requirement for many. The problem is that UA/device detection often frequently misclassifies capable clients (e.g. IE11 was forced to change their UA); leads to compatibility nightmares; can't account for continually changing user and runtime preferences. That said, when used correctly it can also be used for good.

Browser vendors would love to drop the User-Agent string entirely, but that would break too many things. However, while it is fashionable to demonize UA/device detection, the root problem is not in the intent behind it, but in how it is currently deployed. Instead of "detecting" (i.e. guessing) the client capabilities through an opaque version string, we need to change the model to allow the user agent to "report" the necessary capabilities.

Granted, this is not a new idea, but previous attempts seem to introduce as many issues as they solve: they seek to standardize the list of capabilities; they require agreement between multiple slow-moving parties (UA vendors, device manufacturers, etc); they are over-engineered - RDF, seriously? Instead, what we need is a platform primitive that is:

Flexible: browser vendors cannot anticipate all the use cases, nor do they want or need to be in this business beyond providing implementation guidance and documenting the best-practices.
Easy to deploy: developers must be in control over which capabilities are reported. No blocking on UA consensus or other third parties.
Cheap to operate: compatible and deployable with existing infrastructure. No need for third-party databases, service contracts, or other dependencies in the serving path.

Here is the good news: this mechanism exists, it's Service Worker. Let's take a closer look...

Service worker is an event-driven Web Worker, which responds to events dispatched from documents and other sources… The service worker is a generic entry point for event-driven background processing in the Web Platform that is extensible by other specifications - see explainer, starter, and cookbook docs.

A simple way to understand Service Worker is to think of it as a scriptable proxy that runs in your browser and is able to see, modify, and respond to, all requests initiated by the page it is installed on. As a result, the developer can use it to annotate outbound requests (via HTTP request headers, URL rewriting) with relevant capability advertisements:

Developer defines what capabilities are reported and on which requests.
Capability checks are executed on the client - no guessing on the server.
Reported values are dynamic and able to reflect changes in user preference and runtime environment.

This is not a proposal or a wishlist, this is possible today, and is a direct result of enabling powerful low-level primitives in the browser - hooray. As such, now it's only a question of establishing the best practices: what do we report, in what format, and how to we optimize interoperability? Let's consider a real-world example...

E.g. optimizing video startup experience

Our goal is to deliver the optimal — fast and visually pleasing — video startup experience to our users. Simply starting with the lowest bitrate is suboptimal: fast, but consistently poor visual quality for all users, even for those with a fast connection. Instead, we want to pick a starting bitrate that can deliver the best visual experience from the start, while minimizing playback delays and rebuffers. We don't need to be perfect, but we should account for the current network weather on the client. Once the video starts playing, the adaptive bitrate streaming will take over and adjust the stream quality up or down as necessary.

The combination of Service Worker and Network Information API make this trivial to implement:

// register the service worker
navigator.serviceWorker.register('/worker.js').then(
    function(reg) { console.log('Installed successfully', reg) },
    function(err) { console.log('Worker installation failed', err) }
);

// ... worker.js
self.addEventListener('fetch', function(event) {
    var requestURL = new URL(event.request);

    // Intercept same origin /video/* requests
    if (requestURL.origin == location.origin) {
        if (/^\/video\//.test(requestURL.pathname)) {
            // append the MD header, set value to NetInfo's downlinkMax:
            // http://w3c.github.io/netinfo/#downlinkmax-attribute
            event.respondWith(
                fetch(event.request.url, {
                    headers: { 'MD': navigator.connection.downlinkMax }
                })
            );
            return;
        }
    }
});

Site installs a Service Worker script that is scoped to capture /video/* requests.
When a video request is intercepted, the worker appends the MD header and sets its value to the current maximum downlink speed. Note: current plan is to enable downlinkMax in Chrome 41.
Server receives the video request, consults the advertised MD value to determine the starting bitrate, and responds with the appropriate video chunk.

We have full control over the request flow and are able to add additional data to the request prior to dispatching it to the server. Best of all, this logic is transparent to the application, and you are free to customize it further. For example, want to add an explicit user override to set a starting bitrate? Prompt the user, send the value to the worker, and have it annotate requests with whatever value you feel is optimal.

Tired of writing out srcset rules for every image? Service Worker can help deliver DPR-aware 's: use content negotiation, or rewrite the image URL's. Note that device DPR is a dynamic value: zooming on desktop browsers affects the DPR value! Existing device detection methods cannot account for that.

Implementation best practices

Service Worker enables us (web developers) to define, customize, and deploy new capability reports at will: we can rewrite requests, implement content-type or origin specific rules, account for user preferences, and more. The new open questions are: what capabilities do our servers need to know about, and what's the best way to deliver them?

It will be tempting to report every plausibly useful property about a client. Please think twice before doing this, as it can add significant overhead to each request - be judicious. Similarly, it makes sense to optimize for interoperability: use parameter names and format that works well with existing infrastructure and services - caches and CDN's, optimization services, and so on. For example, the MD and DPR request headers used in above examples come from Client-Hints, the goals for which are:

To document the best practices for communicating client capabilities via HTTP request header fields.
Acts as a registry for common header fields to help interoperability between different services.
- e.g. you can already use DPR and RW hints to optimize images with resrc.it service.

Now is the time to experiment. There will be missteps and poor initial implementations, but good patterns and best practices will emerge. Most importantly, the learning cycle for testing and improving this infrastructure is now firmly in the hands of web developers: deploy Service Worker, experiment, learn, and iterate.