Reflection on webperf metrics with Gilles Dubuc (Wikimedia Foundation)

Wikipedia regularly tops JDN’s webperf rankings in all categories. To understand how this site works on its loading times, we went to meet Gilles Dubuc, Senior Software Engineer at the Wikimedia Foundation. This webperf expert shares his views on the subject, and his questions about the relativity of the indicators used by industry professionals. Interview.

Who are you, Gilles Dubuc?

I’ve been doing web development for over 10 years, working for DeviantArtamong others , and I’ve been working at the Wikimedia Foundation since 2014. Wikipedia has 16 billion page views every month, 350 employees, just over a third of whom are engineers – which is still very modest compared to the site’s traffic! I joined the foundation as part of a project for a feature that lets you quickly open images in Wikipedia articles to enlarge them.

This feature raised a number of issues: how to preserve loading times with a quality image? What image size should be used at what time? Should the JS be loaded in advance, or when the user interacts with the feature? All these questions naturally made web performance an important topic when developing this feature, and led me to take an interest in webperf as a whole on Wikipedia.

How did webperf become one of Wikipedia’s priorities?

We wanted to set up a team dedicated to webperf. For the past 4 years, 5 people have been working full-time on the subject. Although everyone has their own affinities, we work on all aspects: front-end and back-end, telemetry, continuous integration testing, site performance monitoring.

There’s also a double challenge: we have to work on web performance for both readers and contributors.

For readers, we have our own CDN with 5 datacenters around the world, so that content is served from the nearest point for each visitor.
More recently, our new datacenter in Singapore has enabled us to improve our Time To First Byte by 30% for Internet users connecting from Asia.

For contributors, on the other hand, the datacenter is currently only in the USA, a limitation that we will resolve in the medium term.
Our technical base is 15 years old, totally open source, and we are making evolutions as we go along, while maintaining what already exists, with severe budgetary constraints, as the Wikimedia Foundation is a non-profit organization that lives solely on donations.

We rely on our own physical infrastructure, and are currently working on a new Kubernetes-based architecture that brings us the benefits of the cloud. The tests we carry out to measure webperf are done without user data.

What webperf indicators do you track?

We run synthetic tests using tools such as WebPageTest and Webpagereplay, which enable us to measure synthetic indicators such as Speed Index. We also monitor RUM indicators collected directly from Internet users, such as Time To First Byte, which enables us to detect network problems, First Contentful Paint and LoadEventEnd.
This last indicator is often considered archaic, as it implies that all page content must be loaded in full, including that which is not visible. However, our research has shown that it currently correlates best with user perception.

When you develop new functionalities, how do you integrate webperf?

Webperf is at the heart of our priorities, and as soon as we launch a new feature, we do a lot of internal education. We’ve worked hard to ensure that we don’t come across as “the webperf police”, and to encourage people to think in terms of performance right from the design stage, just as we think in terms of “secure by design” products. Indeed, respecting webperf best practices means making choices in terms of architecture, so it’s better to think about it upstream rather than at the end, at the risk of having to redo everything.

Our motto: remain at least as efficient as the previous version.

Overall, I’d say we’re lucky to be able to experiment and adopt new technologies more quickly than other teams. We also want to be part of the future of the web!

Speaking of which, how do you see the future of webperf?

I’m a member of the W3C, and the Wikimedia Foundation will soon be a member too. In this context, I had the opportunity to take part in the last TPAC conference in Lyon. During my meetings with browser publishers, I noticed that they tended to come up with solutions without necessarily having feedback from many websites when the standards were first designed. They were very curious to know what was most useful for us in terms of webperf, and what might be blocking us at the moment.

The Wikimedia Foundation has expressed needs in terms of webperf measurement that require the definition of new standards. So, since this year, we’ve been taking part in trials with Chrome (Origin Trials) to test new features ahead of time and assess their relevance. The tests are carried out on Wikipedia traffic, for example, to measure page responsiveness using an API currently under development – which is currently particularly difficult to estimate.

For example, we’d like to measure performance throughout the navigation, and not just at the moment of initial loading, as Time To Interactivecurrently does .

For example, Event Timing will enable us toevaluate the reaction time after clicking on a link, a button, a menu… This reaction time can affect the browsing experience and, consequently, Internet users’ perception of the site’s performance.

UX really is a central element, so…

Yes! I took part in a research program in 2018 with Télécom ParisTech, to make the link between the perception of performance and the data we measure.

We collect a lot of browsing data, but in reality, we don’t know how to prioritize what’s important to web users. So we asked them openly: “Do you think the page loaded quickly?“, and we observed the answers in different contexts to see if the latter had any influence on the perception of speed.

In particular, we found that the state of mind at the time of viewing the page had no influence on the perception of speed. Even when visiting Wikipedia articles in the draft stage containing little information, which can be a source of frustration, users’ responses did not change. We concluded that external factors had little influence on speed perception.

The same was true of the Russian team’s loss at the last FIFA World Cup: the ratio of responses from Wikipedia users in this part of the world did not change.

So, does the external environment ever influence the perception of speed?

Of course it does. In the office, for example, Internet users adapt. They have less powerful equipment than at home, and slower connections, so their level of expectation drops.

This raises the question of the relevance and objectivity of indicators: do they really reflect users’ impressions?

Up to now, most of the indicators we’ve been able to measure on browsers have been created on the basis of what was easy to display. But I’d like to see us move towards indicators that reflect what people really feel.

But how can we bring performance indicators closer to the human experience? Are we moving away from Synthetic Monitoring?

For example, to measure the actual display speed of content, we rely on Synthetic Monitoring. In the future, we’ll have APIs that allow us to know when an image is actually displayed – and not just when the browser has downloaded it.

Let’s also take the example of Speed Index: we have to choose a reference screen size to measure it, even though we know that screen sizes vary from one device to another. What’s more, when we look at the Speed Index, we pretend that the surfer doesn’t interact until the page is fully loaded, whereas we know full well that everyone starts scrolling before the page is fully loaded and wants to see what’s after the waterline. This indicator has the merit of existing, but it can’t accurately translate the user experience.

That said, despite its limitations, Synthetic Monitoring has the advantage of offering a stability that organic data do not, so it will always be needed. It’s a very effective way of detecting changes in performance that come from us and not from the environment. These metrics are always useful over the very long term.

For example, at Wikipedia, we regularly have localized spikes in traffic linked to current events. If this unusual increase in traffic comes from a region of the world where the Internet is generally slow, it can have an impact on our overall webperf KPIs, even though in reality it doesn’t indicate any change in our performance in absolute terms.

Conversely, when new, more powerful phone models are massively adopted, webperf metrics improve, but that doesn’t say anything about our actual performance either. It could even be masking a drop in our webperf! So when we look at long-term performance trends, we need to do a thorough job of distinguishing between trends that are due to the environment and those that are due to our own performance. This is where synthetic monitoring comes in very handy.

Overall, loading speed and web performance are improving. But what I’m looking to observe as part of my research is also the increase in performance in absolute terms and in relation to expectations, outside a browsing context with ever faster phones and ever smoother networks. And why not, create new standards!

So what would you like to measure?

I’d like to move towards tools that reproduce human perception, beyond the declarative. If I take Wikipedia as an example, unlike a commercial site which can carry out A/B tests, it’s very difficult to measure what a reader has understood or retained from a page. Today, for example, we rely on consultation time, which is far from being a reliable indicator. In the end, we might say that if the information is better presented, the person reading it must spend less time consulting it to find what he or she was looking for.

In fact, let’s take the example of previews in Google SERPs: does the surfer not go to the page because the preview content has provided sufficient information? Or is it because it wasn’t relevant? How can we measure what the user has learned?

As the notion of time is highly subjective, how do you assess it?

There are still a lot of avenues to explore in order to develop the indicators we have, whatever the field, webperf included. For example, I’d like to get closer to the human capacity to evaluate page loading time. And besides, if it’s faster, is it always better?

Various studies have shown that in certain cases, such as price comparison sites, a search that takes longer gives the impression of being more complete… In reality, many sites in this field could very well display their results instantly, but deliberately slow them down, because A/B tests have shown that they achieve a better conversion rate this way. So there are cases where “slow” or intentionally slowed loading times can be more reassuring. It all depends on the context.

The whole point of our current research is to find a way of transcribing the human browsing experience as faithfully as possible. Let’s hope we succeed, or come as close as possible!

Want to keep up with the latest webperf news?
Subscribe to our monthly newsletter!

I subscribe!

Tagged Webperformance

You must be logged in to post a comment.