Wednesday, December 21, 2016

How V8 measures real-world performance

Over the last year the V8 team has developed a new methodology to measure and understand real-world JavaScript performance. We’ve used the insights that we gleaned from it to change how the V8 team makes JavaScript faster. Our new real-world focus represents a significant shift from our traditional performance focus. We’re confident that as we continue to apply this methodology in 2017, it will significantly improve users’ and developers’ ability to rely on predictable performance from V8 for real-world JavaScript in both Chrome and Node.js.

The old adage “what gets measured gets improved” is particularly true in the world of JavaScript virtual machine (VM) development. Choosing the right metrics to guide performance optimization is one of the most important things a VM team can do over time. The following timeline roughly illustrates how JavaScript benchmarking has evolved since the initial release of V8:

Evolution of JavaScript benchmarks.

Historically, V8 and other JavaScript engines have measured performance using synthetic benchmarks. Initially, VM developers used microbenchmarks like SunSpider and Kraken. As the browser market matured a second benchmarking era began, during which they used larger but nevertheless synthetic test suites such as Octane and JetStream.

Microbenchmarks and static test suites have a few benefits: they’re easy to bootstrap, simple to understand, and able to run in any browser, making comparative analysis easy. But this convenience comes with a number of downsides. Because they include a limited number of test cases, it is difficult to design benchmarks which accurately reflect the characteristics of the web at large. Moreover, benchmarks are usually updated infrequently; thus, they tend to have a hard time keeping up with new trends and patterns of JavaScript development in the wild. Finally, over the years VM authors explored every nook and cranny of the traditional benchmarks, and in the process they discovered and took advantage of opportunities to improve benchmark scores by shuffling around or even skipping externally unobservable work during benchmark execution. This kind of benchmark-score-driven improvement and over-optimizing for benchmarks doesn’t always provide much user- or developer-facing benefit, and history has shown that over the long-term it’s very difficult to make an “ungameable” synthetic benchmark.

Measuring real websites: WebPageReplay & Runtime Call Stats

Given an intuition that we were only seeing one part of the performance story with traditional static benchmarks, the V8 team set out to measure real-world performance by benchmarking the loading of actual websites. We wanted to measure use cases that reflected how end users actually browsed the web, so we decided to derive performance metrics from websites like Twitter, Facebook, and Google Maps. Using a piece of Chrome infrastructure called WebPageReplay we were able to record and replay page loads deterministically.

In tandem, we developed a tool called Runtime Call Stats which allowed us to profile how different JavaScript code stressed different V8 components. For the first time, we had the ability not only to test V8 changes easily against real websites, but to fully understand how and why V8 performed differently under different workloads.

We now monitor changes against a test suite of approximately 25 websites in order to guide V8 optimization. In addition to the aforementioned websites and others from the Alexa Top 100, we selected sites which were implemented using common frameworks (React, Polymer, Angular, Ember, and more), sites from a variety of different geographic locales, and sites or libraries whose development teams have collaborated with us, such as Wikipedia, Reddit, Twitter, and Webpack. We believe these 25 sites are representative of the web at large and that performance improvements to these sites will be directly reflected in similar speedups for sites being written today by JavaScript developers.

For an in-depth presentation about the development of our test suite of websites and Runtime Call Stats, see the BlinkOn 6 presentation on real-world performance. You can even run the Runtime Call Stats tool yourself.

Making a real difference

Analyzing these new, real-world performance metrics and comparing them to traditional benchmarks with Runtime Call Stats has also given us more insight into how various workloads stress V8 in different ways.

From these measurements, we discovered that Octane performance was actually a poor proxy for performance on the majority of our 25 tested websites. You can see in the chart below: Octane’s color bar distribution is very different than any other workload, especially those for the real-world websites. When running Octane, V8’s bottleneck is often the execution of JavaScript code. However, most real-world websites instead stress V8’s parser and compiler. We realized that optimizations made for Octane often lacked impact on real-world web pages, and in some cases these optimizations made real-world websites slower.

Distribution of time running all of Octane, running the line-items of Speedometer and loading websites from our test suite on Chrome M57.

We also discovered that another benchmark was actually a better proxy for real websites. Speedometer, a WebKit benchmark that includes applications written in React, Angular, Ember, and other frameworks, demonstrated a very similar runtime profile to the 25 sites. Although no benchmark matches the fidelity of real web pages, we believe Speedometer does a better job of approximating the real-world workloads of modern JavaScript on the web than Octane.

Bottom line: A faster V8 for all

Over the course of the past year, the real-world website test suite and our Runtime Call Stats tool has allowed us to deliver V8 performance optimizations that speed up page loads across the board by an average of 10-20%. Given the historical focus on optimizing page load across Chrome, a double-digit improvement to the metric in 2016 is a significant achievement. The same optimizations also improved our score on Speedometer by 20-30%.

These performance improvements should be reflected in other sites written by web developers using modern frameworks and similar patterns of JavaScript. Our improvements to builtins such as Object.create and Function.prototype.bind, optimizations around the object factory pattern, work on V8’s inline caches, and ongoing parser improvements are intended to be generally applicable improvements to underlooked areas of JavaScript used by all developers, not just the representative sites we track.

We plan to expand our usage of real websites to guide V8 performance work. Stay tuned for more insights about benchmarks and script performance.

Posted by the V8 team


  1. This comment has been removed by the author.

  2. Speedier top websites are generally of the greater benefit to the common user, indeed. Congratulations!
    However, what made Chrome so special in the beginning are its platform-like characteristics. Developers could finally start writing complex games as web applications. Even more so with WebGL.
    Since games are not in the top 20, you could be approaching a missed opportunity here. Webby games could win again (on mobile)... But this requires focus.

    (Full disclosure - I do not develop, nor do I almost ever play games. :) I actually prefer regular web pages to be speedier, but I think still it is a missed opportunity for the web platform at large)

    1. We are updating the set of top 25 (they are now more) constantly. A game (which one?) is indeed missing. We are currently tracking Maps as a surrogate for JS/Graphic intensive pages.

    2. We have "traditional" benchmarks running games though like oortonline.

  3. I don't think it is a good way of testing. The real task, that needs a lot of performance, could be made after clicking some button on the website (e.g. in a web app or in a game), so fast parsing of HTML/CSS/JS, rendering the page and executing the initial JS does not really mean anything. You optimize your browser for pages that need to be loaded, but not for apps that need to perform tasks during the runtime.

    1. It is a similar challenge though. Nevertheless, that is one area we want to explore further.

  4. Are there any benchmarks that try to capture NodeJS performance? I'm guessing that if you're focusing on improving parsing, that will probably have benefits for IO-based Node applications. But Node processes are going to be longer running, so slower compilation would be favorable if it yields faster performance. Is there any core focus on this scenario?

    1. Node.js is using the ACMEAir benchmark for testing real-world performance. Good real-world workloads are harder to collect, because one cannot simply capture/replay HTTPs traffic and use web page replay.

      There are also short-running real-world Node.js workloads: Node.js used as part of the build chain e.g. as a transpiler. Startup in that case is more important than throughput.

    2. I'm also a bit surprised by the apparently lack of Nodejs benchmarks. Millions of applications in the world rely on nodejs/v8 performance thus it would seem logical to include some specific benchmarks in your performance toolchain.