Tuesday, September 11, 2018

Celebrating 10 years of V8

This month marks the 10-year anniversary of shipping not just Google Chrome, but also the V8 project. This post gives an overview of major milestones for the V8 project in the past 10 years as well as the years before, when the project was still secret.

A visualization of the V8 code base over time, created using gource.

Before V8 shipped: the early years

Google hired Lars Bak in the autumn of 2006 to build a new JavaScript engine for the Chrome web browser, which at the time was still a secret internal Google project. Lars had recently moved back to Aarhus, Denmark, from Silicon Valley. Since there was no Google office there and Lars wanted to remain in Denmark, Lars and several of the project’s original engineers began working on the project in an outbuilding on his farm. The new JavaScript runtime was christened “V8”, a playful reference to the powerful engine you can find in a classic muscle car. Later, when the V8 team had grown, the developers moved from their modest quarters to a modern office building in Aarhus, but the team took with them their singular drive and focus on building the fastest JavaScript runtime on the planet.

Launching and evolving V8

V8 went open-source the same day Chrome was launched: on September 2nd, 2008. The initial commit dates back to June 30th, 2008. Prior to that date, V8 development happened in a private CVS repository. Initially, V8 supported only the ia32 and ARM instruction sets and used SCons as its build system.

2009 saw the introduction of a brand new regular expression engine named Irregexp, resulting in performance improvements for real-world regular expressions. With the introduction of an x64 port, the number of supported instruction sets increased from two to three. 2009 also marked the first release of the Node.js project, which embeds V8. The possibility for non-browser projects to embed V8 was explicitly mentioned in the original Chrome comic. With Node.js, it actually happened! Node.js grew to be one of the most popular JavaScript ecosystems.

2010 witnessed a big boost in runtime performance as V8 introduced a brand-new optimizing JIT compiler. Crankshaft generated machine code that was twice as fast and 30% smaller than the previous (unnamed) V8 compiler. That same year, V8 added its fourth instruction set: 32-bit MIPS.

2011 came, and garbage collection was vastly improved. A new incremental garbage collector drastically reduced pause times while maintaining great peak performance and low memory usage. V8 introduced the concept of Isolates, which allows embedders to spin up multiple instances of the V8 runtime in a process, paving the way for lighter-weight Web Workers in Chrome. The first of V8’s two build system migrations occurred as we transitioned from SCons to GYP. We implemented support for ES5 strict mode. Meanwhile, development moved from Aarhus to Munich (Germany) under new leadership with lots of cross-pollination from the original team in Aarhus.

2012 was a year of benchmarks for the V8 project. The team did speed sprints to optimize V8’s performance as measured through the Sunspider and Kraken benchmark suites. Later, we developed a new benchmark suite named Octane (with V8 Bench at its core) that brought peak performance competition to the forefront and spurred massive improvements in runtime and JIT technology in all major JS engines. One outcome of these efforts was the switch from randomized sampling to a deterministic, count-based technique for detecting “hot” functions in V8’s runtime profiler. This made it significantly less likely that some page loads (or benchmark runs) would randomly be much slower than others.

2013 witnessed the appearance of a low-level subset of JavaScript named asm.js. Since asm.js is limited to statically-typed arithmetic, function calls, and heap accesses with primitive types only, validated asm.js code could run with predictable performance. We released a new version of Octane, Octane 2.0 with updates to existing benchmarks as well as new benchmarks that target use cases like asm.js. Octane spurred the development of new compiler optimizations like allocation folding and allocation-site-based optimizations for type transitioning and pretenuring that vastly improved peak performance. As part of an effort we internally nicknamed “Handlepocalypse”, the V8 Handle API was completely rewritten to make it easier to use correctly and safely. Also in 2013, Chrome’s implementation of TypedArrays in JavaScript was moved from Blink to V8.

In 2014, V8 moved some of the work of JIT compilation off the main thread with concurrent compilation, reducing jank and significantly improving performance. Later that year, we landed the initial version of a new optimizing compiler named TurboFan. Meanwhile, our partners helped port V8 to three new instruction set architectures: PPC, MIPS64, and ARM64. Following Chromium, V8 transitioned to yet another build system, GN. The V8 testing infrastructure saw significant improvements, with a Tryserver now available to test each patch on various build bots before landing. For source control, V8 migrated from SVN to Git.

2015 was a busy year for V8 on a number of fronts. We implemented code caching and script streaming, significantly speeding up web page load times. Work on our runtime system’s use of allocation mementos was published in ISMM 2015. Later that year, we kicked off the work on a new interpreter named Ignition. We experimented with the idea of subsetting JavaScript with strong mode to achieve stronger guarantees and more predictable performance. We implemented strong mode behind a flag, but later found its benefits did not justify the costs. The addition of a commit queue made big improvements in productivity and stability. V8’s garbage collector also began cooperating with embedders such as Blink to schedule garbage collection work during idle periods. Idle-time garbage collection significantly reduced observable garbage collection jank and memory consumption. In December, the first WebAssembly prototype landed in V8.

In 2016, we shipped the last pieces of the ES2015 (previously known as “ES6”) feature set (including promises, class syntax, lexical scoping, destructuring, and more), as well as some ES2016 features. We also started rolling out the new Ignition and TurboFan pipeline, using it to compile and optimize ES2015 and ES2016 features, and shipping Ignition by default for low-end Android devices. Our successful work on idle-time garbage collection was presented at PLDI 2016. We kicked off the Orinoco project, a new mostly-parallel and concurrent garbage collector for V8 to reduce main thread garbage collection time. In a major refocus, we shifted our performance efforts away from synthetic micro-benchmarks and instead began to seriously measure and optimize real-world performance. For debugging, the V8 inspector was migrated from Chromium to V8, allowing any V8 embedder (and not just Chromium) to use the Chrome DevTools to debug JavaScript running in V8. The WebAssembly prototype graduated from prototype to experimental support, in coordination with other browser vendors experimental support for WebAssembly. V8 received the ACM SIGPLAN Programming Languages Software Award. And another port was added: S390.

In 2017, we finally completed our multi-year overhaul of the engine, enabling the new Ignition and TurboFan pipeline by default. This made it possible to later remove Crankshaft (130,380 deleted lines of code) and Full-codegen from the codebase. We launched Orinoco v1.0, including concurrent marking, concurrent sweeping, parallel scavenging, and parallel compaction. We officially recognized Node.js as a first-class V8 embedder alongside Chromium. Since then, it’s impossible for a V8 patch to land if doing so breaks the Node.js test suite. Our infrastructure gained support for correctness fuzzing, ensuring that any piece of code produces consistent results regardless of the configuration it runs in.

In an industry-wide coordinated launch, V8 shipped WebAssembly on-by-default. We implemented support for JavaScript modules as well as the full ES2017 and ES2018 feature sets (including async functions, shared memory, async iteration, rest/spread properties, and RegExp features). We shipped native support for JavaScript code coverage, and launched the Web Tooling Benchmark to help us measure how V8’s optimizations impact performance for real-world developer tools and the JavaScript output they generate. Wrapper tracing from JavaScript objects to C++ DOM objects and back allowed us to resolve long-standing memory leaks in Chrome and to handle the transitive closure of objects over the JavaScript and Blink heap efficiently. We later used this infrastructure to increase the capabilities of the heap snapshotting developer tool.

2018 saw an industry-wide security event upend what we thought we knew about CPU information security with the public disclosure of the Spectre/Meltdown vulnerabilities. V8 engineers performed extensive offensive research to help understand the threat for managed languages and develop mitigations. V8 shipped mitigations against Spectre and similar side-channel attacks for embedders that run untrusted code.

Recently, we shipped a baseline compiler for WebAssembly named Liftoff which greatly reduces startup time for WebAssembly applications while still achieving predictable performance. We shipped BigInt, a new JavaScript primitive that enables arbitrary-precision integers. We implemented embedded builtins, and made it possible to lazily deserialize them, significantly reducing V8’s footprint for multiple Isolates. We made it possible to compile script bytecode on a background thread. We started the Unified V8-Blink Heap project to run a cross-component V8 and Blink garbage collection in sync. And the year’s not over yet…

Performance ups and downs

Chrome’s V8 Bench score over the years shows the performance impact of V8’s changes. (We’re using the V8 Bench because it’s one of the few benchmarks that can still run in the original Chrome beta.)

Chrome’s V8 Bench score from 2008 to 2018

Our score on this benchmark went up over the last ten years!

However, you might notice two performance dips over the years. Both are interesting because they correspond to significant events in V8’s history. The performance drop in 2015 happened when V8 shipped baseline versions of ES2015 features. These features were cross-cutting in the V8 code base, and we therefore focused on correctness rather than performance for their initial release. We accepted these slight speed regressions to get features to developers as quickly as possible. In early 2018, the Spectre vulnerability was disclosed, and V8 shipped mitigations to protect users against potential exploits, resulting in another regression in performance. Luckily, now that Chrome is shipping Site Isolation, we can disable the mitigations again, bringing performance back on par.

Another take-away from this chart is that it starts to level off around 2013. Does that mean V8 gave up and stopped investing in performance? Quite the opposite! The flattening of the graphs represents the V8 team’s pivot from synthetic micro-benchmarks (such as V8 Bench and Octane) to optimizing for real-world performance. V8 Bench is an old benchmark that doesn’t use any modern JavaScript features, nor does it approximate actual real-world production code. Contrast this with the more recent Speedometer benchmark suite:

Chrome’s Speedometer 1 score from 2013 to 2018

Although V8 Bench shows minimal improvements from 2013 to 2018, our Speedometer 1 score went up (another) during this same time period. (We used Speedometer 1 because Speedometer 2 uses modern JavaScript features that weren’t yet supported in 2013.)

Nowadays, we have even better benchmarks that more accurately reflect modern JavaScript apps, and on top of that, we actively measure and optimize for existing web apps.

Summary

Although V8 was originally built for Google Chrome, it has always been a stand-alone project with a separate codebase and an embedding API that allows any program to use its JavaScript execution services. Over the last 10 years, the open nature of the project has helped it become a key technology not only for the Web Platform, but in other contexts, like Node.js. Along the way the project evolved and remain relevant despite many changes and dramatic growth.

Initially, V8 supported only two instruction sets. In the last 10 years the list of supported platforms reached eight: ia32, x64, ARM, ARM64, 32- and 64-bit MIPS, 64-bit PPC, and S390. V8’s build system migrated from SCons to GYP to GN. The project moved from Denmark to Germany, and now has engineers all over the world, including in London, Mountain View, and San Francisco, with contributors outside of Google from many more places. We’ve transformed our entire JavaScript compilation pipeline from unnamed components to Full-codegen (a baseline compiler) and Crankshaft (an feedback-driven optimizing compiler) to Ignition (an interpreter) and TurboFan (a better feedback-driven optimizing compiler). V8 went from being “just” a JavaScript engine to also supporting WebAssembly. The JavaScript language itself evolved from ECMAScript 3 to ES2018; the latest V8 even implements post-ES2018 features.

The story arc of Web is a long and enduring one. Celebrating Chrome and V8’s 10th birthday is a good opportunity to reflect that even though this is a big milestone, the Web Platform’s narrative has lasted for more than 25 years. We have no doubt the Web’s story will continue for at least that long in the future. We’re committed to making sure that V8, JavaScript, and WebAssembly continue to be interesting characters in that narrative. We’re excited to see what the next decade has in store. Stay tuned!

Monday, August 20, 2018

Liftoff: a new baseline compiler for WebAssembly in V8

V8 v6.9 includes Liftoff, a new baseline compiler for WebAssembly. Liftoff is now enabled by default on desktop systems. This article details the motivation to add another compilation tier and describes the implementation and performance of Liftoff.

Since WebAssembly launched more than a year ago, adoption on the web has been steadily increasing. Big applications targeting WebAssembly have started to appear. For example, Epic’s ZenGarden benchmark comprises a 39.5 MB WebAssembly binary, and AutoDesk ships as a 36.8 MB binary. Since compilation time is essentially linear in the binary size, these applications take a considerable time to start up. On many machines it’s more than 30 seconds, which does not provide a great user experience.

But why does it take this long to start up a WebAssembly app, if similar JS apps start up much faster? The reason is that WebAssembly promises to deliver predictable performance, so once the app is running, you can be sure to consistently meet your performance goals (e.g. rendering 60 frames per second, no audio lag or artifacts…). In order to achieve this, WebAssembly code is compiled ahead of time in V8, to avoid any compilation pause introduced by a just-in-time compiler that could result in visible jank in the app.

The existing compilation pipeline (TurboFan)

V8’s approach to compiling WebAssembly has relied on TurboFan, the optimizing compiler we designed for JavaScript and asm.js. TurboFan is a powerful compiler with a graph-based intermediate representation (IR) suitable for advanced optimizations such as strength reduction, inlining, code motion, instruction combining, and sophisticated register allocation. TurboFan’s design supports entering the pipeline very late, nearer to machine code, which bypasses many of the stages necessary for supporting JavaScript compilation. By design, transforming WebAssembly code into TurboFan’s IR (including SSA-construction) in a straightforward single pass is very efficient, partially due to WebAssembly’s structured control flow. Yet the backend of the compilation process still consumes considerable time and memory.

The new compilation pipeline (Liftoff)

The goal of Liftoff is to reduce startup time for WebAssembly-based apps by generating code as fast as possible. Code quality is secondary, as hot code will eventually be recompiled with Turbofan anyway.. Liftoff avoids the time and memory overhead of constructing an IR and generates machine code in a single pass over the bytecode of a WebAssembly function.

Pipeline

From the diagram above it is obvious that Liftoff should be able to generate code much faster than TurboFan since the pipeline only consists of two stages. In fact, the function body decoder does a single pass over the raw WebAssembly bytes and interacts with the subsequent stage via callbacks, so code generation is performed while decoding and validating the function body. Together with WebAssembly’s streaming APIs, this allows V8 to compile WebAssembly code to machine code while downloading over the network.

Code generation in Liftoff

Liftoff is a simple code generator, and fast. It performs only one pass over the opcodes of a function, generating code for each opcode, one at a time. For simple opcodes like arithmetics, this is often a single machine instruction, but can be more for others like calls. Liftoff maintains metadata about the operand stack in order to know where the inputs of each operation are currently stored. This virtual stack exists only during compilation. WebAssembly’s structured control flow and validation rules guarantee that the location of these inputs can be statically determined. Thus an actual runtime stack onto which operands are pushed and popped is not necessary. During execution, each value on the virtual stack will either be held in a register or be spilled to the physical stack frame of that function. For small integer constants (generated by i32.const), Liftoff only records the constant’s value in the virtual stack and does not generate any code. Only when the constant is used by a subsequent operation, it is emitted or combined with the operation, for example by directly emitting a addl <reg>, <const> instruction on x64. This avoids ever loading that constant into a register, resulting in better code.

Let’s go through a very simple function to see how Liftoff generates code for that.

This example function takes two parameters and returns their sum. When Liftoff decodes the bytes of this function, it first begins by initializing its internal state for the local variables according to the calling convention for WebAssembly functions. For x64, V8’s calling convention passes the two parameters in the registers rax and rdx.

For get_local instructions, Liftoff does not generate any code, but instead just updates its internal state to reflect that these register values are now pushed on the virtual stack. The i32.add instruction then pops the two registers and chooses a register for the result value. We cannot use any of the input registers for the result, since both registers still appear on the stack for holding the local variables. Overwriting them would change the value returned by a later get_local instruction. So Liftoff picks a free register, in this case rcx, and produce the sum of rax and rdx into that register. rcx is then pushed onto the virtual stack.

After the i32.add instruction, the function body is finished, so Liftoff must assemble the function return. As our example function has one return value, validation requires that there must be exactly one value on the virtual stack at the end of the function body. So Liftoff generates code that moves the return value held in rcx into the proper return register rax and then returns from the function.

For the sake of simplicity, the example above does not contain any blocks (if, loop …) or branches. Blocks in WebAssembly introduce control merges, since code can branch to any parent block, and if-blocks can be skipped. These merge points can be reached from different stack states. Following code, however, has to assume a specific stack state to generate code. Thus, Liftoff snapshots the current state of the virtual stack as the state which will be assumed for code following the new block (i.e. when returning to the control level where we currently are). The new block will then continue with the currently active state, potentially changing where stack values or locals are stored: some might be spilled to the stack or held in other registers. When branching to another block or ending a block (which is the same as branching to the parent block), Liftoff must generate code that adapts the current state to the expected state at that point, such that the code emitted for the target we branch to finds the right values where it expects them. Validation guarantees that the height of the current virtual stack matches the height of the expected state, so Liftoff need only generate code to shuffle values between registers and/or the physical stack frame as shown below.

Let’s look at an example of that.

The example above assumes a virtual stack with two values on the operand stack. Before starting the new block, the top value on the virtual stack is popped as argument to the if instruction. The remaining stack value needs to be put in another register, since it is currently shadowing the first parameter, but when branching back to this state we might need to hold two different values for the stack value and the parameter. In this case Liftoff chooses to deduplicate it into the rcx register. This state is then snapshotted, and the active state is modified within the block. At the end of the block, we implicitly branch back to the parent block, so we merge the current state into the snapshot by moving register rbx into rcx and reloading register rdx from the stack frame.

Tiering up from Liftoff to TurboFan

With Liftoff and Turbofan, V8 now has two compilation tiers for WebAssembly: Liftoff as the baseline compiler for fast startup and TurboFan as optimizing compiler for maximum performance. This poses the question of how to combine the two compilers to provide the best overall user experience.

For JavaScript, V8 uses the Ignition interpreter and the TurboFan compiler and employs a dynamic tier-up strategy. Each function is first executed in Ignition, and if the function becomes hot, TurboFan compiles it into highly-optimized machine code. A similar approach could also be used for Liftoff, but the tradeoffs are a bit different here:

  1. WebAssembly does not require type feedback to generate fast code. Where JavaScript greatly benefits from gathering type feedback, WebAssembly is statically typed, so the engine can generate optimized code right away.
  2. WebAssembly code should run predictably fast, without a lengthy warm-up phase. One of the reasons applications target WebAssembly is to execute on the web with predictable high performance. So we can neither tolerate running suboptimal code for too long, nor do we accept compilation pauses during execution.
  3. An important design goal of the Ignition interpreter for JavaScript is to reduce memory usage by not compiling functions at all. Yet we found that an interpreter for WebAssembly is far too slow to deliver on the goal of predictably fast performance. We did, in fact, build such an interpreter, but being 20× or more slower than compiled code, it is only useful for debugging, regardless of how much memory it saves. Given this, the engine must store compiled code anyway; in the end it should store only the most compact and most efficient code, which is TurboFan optimized code.

From these constraints we concluded that dynamic tier-up is not the right tradeoff for V8’s implementation of WebAssembly right now, since it would increase code size and reduce performance for an indeterminate time span. Instead, we chose a strategy of eager tier-up. Immediately after Liftoff compilation of a module finished, the WebAssembly engine starts background threads to generate optimized code for the module. This allows V8 to start executing code quickly (after Liftoff finished), but still have the most performant TurboFan code available as early as possible.

The picture below shows the trace of compiling and executing the EpicZenGarden benchmark. It shows that right after Liftoff compilation we can instantiate the WebAssembly module and start executing it. TurboFan compilation still takes several more seconds, so during that tier-up period the observed execution performance will gradually increase since individual TurboFan functions will be used as soon as they are finished.

Performance

Two metrics are interesting for evaluating the performance of the new Liftoff compiler. First we want to compare the compilation speed (i.e. time to generate code) with TurboFan. Second, we want to measure the performance of the generated code (i.e. execution speed). The first measure is the more interesting here, since the goal of Liftoff is to reduce startup time by generating code as quickly as possible. On the other hand, the performance of the generated code should still be pretty good since that code might still execute for several seconds or even minutes on low-end hardware.

Performance of generating code

For measuring the compiler performance itself, we ran a number of benchmarks and measured the raw compilation time using tracing (see picture above). We run both benchmarks on an HP Z840 machine (2 x Intel Xeon E5-2690 @2.6GHz, 24 cores, 48 threads) and on a Macbook Pro (Intel Core i7-4980HQ @2.8GHz, 4 cores, 8 threads). Note that Chrome does currently not use more than 10 background threads, so most of the cores of the Z840 machine are unused.

We execute three benchmarks:

  1. EpicZenGarden: The ZenGarden demo running on the Epic framework: https://s3.amazonaws.com/mozilla-games/ZenGarden/EpicZenGarden.html
  2. Tanks!: A demo of the Unity engine: https://webassembly.org/demo/
  3. AutoDesk: https://web.autocad.com/
  4. PSPDFKit: https://pspdfkit.com/webassembly-benchmark/

For each benchmark, we measure the raw compilation time using the tracing output as shown above. This number is more stable than any time reported by the benchmark itself, as it does not rely on a task being scheduled on the main thread and does not include unrelated work like creating the actual WebAssembly instance.

The graphs below show the results of these benchmarks. Each benchmark was executed three times, and we report the average compilation time.

Code Generation Performance: Liftoff vs. TurboFan on Macbook

Code Generation Performance: Liftoff vs. TurboFan on Z840

As expected, the Liftoff compiler generates code much faster both on the high-end desktop workstation as well as on the MacBook. The speedup of Liftoff over TurboFan is even bigger on the less-capable MacBook hardware.

Performance of the generated code

Even though performance of the generated code is a secondary goal, we want to preserve user experience with high performance in the startup phase, as Liftoff code might execute for several seconds before TurboFan code is finished.

For measuring Liftoff code performance, we turned off tier-up in order to measure pure Liftoff execution. In this setup, we execute two benchmarks:

  1. Unity headless benchmarks

    This is a number of benchmarks running in the Unity framework. They are headless, hence can be executed in the d8 shell directly. Each benchmark reports a score, which is not necessarily proportional to the execution performance, but good enough to compare the performance.

  2. PSPDFKit: https://pspdfkit.com/webassembly-benchmark/

    This benchmark reports the time it takes to perform different actions on a pdf document and the time it takes to instantiate the WebAssembly module (including compilation).

Just as before, we execute each benchmark three times and use the average of the three runs. Since the scale of the recorded numbers differs significantly between the benchmarks, we report the relative performance of Liftoff vs TurboFan. A value of +30% means that Liftoff code runs 30% slower than TurboFan. Negative numbers indicate that Liftoff executes faster. Here are the results:

Liftoff Performance on Unity

On Unity, Liftoff code execute on average around 50% slower than TurboFan code on the desktop machine and 70% slower on the MacBook. Interestingly, there is one case (Mandelbrot Script) where Liftoff code outperforms TurboFan code. This is likely an outlier where, for example, the register allocator of TurboFan is doing poorly in a hot loop. We are investigating to see if TurboFan can be improved to handle this case better.

Liftoff Performance on PSPDFKit

On the PSPDFKit benchmark, Liftoff code executes 18-54% slower than optimized code, while initialization improves significantly, as expected. These numbers show that for real-world code which also interacts with the browser via JavaScript calls, the performance loss of unoptimized code is generally lower than on more computation-intensive benchmarks.

And again, note that for these numbers we turned off tier-up completely, so we only ever executed Liftoff code. In production configurations, Liftoff code will gradually be replaced by TurboFan code, such that the lower performance of Liftoff code lasts only for short period of time.

Future work

After the initial launch of Liftoff, we are working to further improve startup time, reduce memory usage, and bring the benefits of Liftoff to more users. In particular, we are working on improving the following things:

  1. Port Liftoff to arm and arm64 to also use it on mobile devices. Currently, Liftoff is only implemented for Intel platforms (32 and 64 bit), which mostly captures desktop use cases. In order to also reach mobile users, we will port Liftoff to more architectures.
  2. Implement dynamic tier-up for mobile devices. Since mobile devices tend to have much less memory available than desktop systems, we need to adapt our tiering strategy for these devices. Just recompiling all functions with TurboFan easily doubles the memory needed to hold all code, at least temporarily (until Liftoff code is discarded). Instead, we are experimenting with a combination of lazy compilation with Liftoff and dynamic tier-up of hot functions in TurboFan.
  3. Improve performance of Liftoff code generation. The first iteration of an implementation is rarely the best one. There are several things which can be tuned to speed up the compilation speed of Liftoff even more. This will gradually happen over the next releases.
  4. Improve performance of Liftoff code. Apart from the compiler itself, the size and speed of the generated code can also be improved. This will also happen gradually over the next releases.

Conclusion

V8 now contains Liftoff, a new baseline compiler for WebAssembly. Liftoff vastly reduces start-up time of WebAssembly applications with a simple and fast code generator. On desktop systems, V8 still reaches maximum peak performance by recompiling all code in the background using TurboFan. Liftoff is enabled by default in V8 v6.9 (Chrome 69), and can be controlled explicitly with the --liftoff/--no-liftoff and chrome://flags/#enable-webassembly-baseline flags in each, respectively.

Posted by Clemens Hammacher, WebAssembly compilation maestro